Reliability on Torizon
Reliability is an important topic for embedded systems. Once you have deployed thousands of devices to the field, malfunction may cause harm to people and equipment and may imply costs for on-site maintenance.
From Wikipedia's page on Reliability engineering, in a nutshell:
Reliability engineering is [...] the ability of equipment to function without failure.
Torizon strives to be a reliable system from its conception and at all levels. Be it on Torizon OS or our tools and features, like TorizonCore Builder and Torizon Cloud, we care about providing reliable defaults, best practices, and guidance to our customers.
This article is divided into two top-level sections with their sub-sections:
It isn't in the scope of this article to go through Torizon security, even though insecure systems may adversely impact their reliability.
On toradex.com, additional resources about the reliability of Toradex products are available:
How Torizon is made reliable
This section provides information on how Torizon is made reliable, from the hardware to its several software layers and features.
Ultimately, on top of all our efforts, we also expect you to test and validate your products. Our focus on making Torizon a platform that embraces DevOps makes it easy for you to setup your CI/CD/CT infrastructure to make the next steps in ensuring reliability.
Torizon OS is integrated with Toradex hardware out-of-the-box and therefore it benefits from hardware tests and validation. This section derives from and complements the Quality and Reliability page on our main website:
- Design and sourcing of components: from conception, we focus on industrial-grade designs that are meant to operate 24/7 at full load, and to achieve reliability under these conditions, we only use industrial-grade SoCs. When sourcing components, we only procure original components, according to our Counterfeit Parts Prevention Policy.
- Off-the-shelf SoMs: we provide a small number of off-the-shelf configurations for each SoM model. Many customers use each of these configurations in many different applications, therefore a large customer base leverages bugfixes and improvements done on each SoM model and configuration.
- Verification and validation: for all our products, we thoroughly validate that their design meets specifications, running tests over the rated temperature range. For transparency, as soon as a problem on early hardware samples is identified, it is added to the errata/known issues section on the affected hardware page. See an example for Verdin iMX8M Plus errata/known issues. Once products are fully validated, their product life-cycle state is promoted to volume product.
- Functional testing: every Toradex SoM gets a unique serial number and goes through extensive functional testing at the end of the production line. The test results for every SoM are archived, allowing full traceability and early detection of manufacturing issues.
- Shock and vibration testing: an internationally-accredited testing laboratory performed shock and vibration compliance tests on various configurations of Toradex SoMs, carrier boards, and fastening methods, passing different EN standards.
- ISO 9001: Toradex and the electronic manufacturing services (EMS) contracted by Toradex are certified ISO 9001.
Software strategy, transparency, and testing
Torizon is an end-to-end software platform, composed of various software components. It includes:
- The Board Support Package (BSP) which is comprised of the bootloader and the Linux kernel, among other components
- The Torizon OS distribution, including specific kernel configuration, the init system, base libraries, the Docker container engine, and the OSTree and Aktualizr libraries for system updates, among others.
- The Torizon Cloud provides a series of cloud services such as remote updates, device monitoring, fleet management, and remote access, among others.
Given the number of software components that either run on the device or directly affect it, we invest heavily in continuous integration (CI), continuous delivery (CD), test automation and continuous testing (CT) to ensure that our customers' use cases are covered and working. On top of that, we run manual tests to cover product features that are in the process of being automated tested.
- Embedded Linux support strategy: BSP Layers, Reference images, and Torizon OS images are released as often as every day, with different qualifications and intended uses. For instance, our quarterly releases are extensively tested and, therefore, intended to be used in production. Learn the details on the Embedded Linux Support Strategy page.
- Upstream Linux kernel: the Linux kernel is one of the most important components of our Embedded Linux releases. The upstream kernel - the one maintained by kernel.org - is accepted to be one of overall higher quality, as contributions made to this project are subject to scrutiny and high standards, and are widely used on millions of devices around the world. All contributions are continuously carried on to newer kernel releases, making the software supported for long periods. Toradex adopts the upstream Linux kernel on SoMs whose upstream support is featureful enough for our customers' use cases. Newer SoCs are not well supported by the upstream, and in such cases, we release and maintain in-house downstream forks based on the ones provided by the SoC vendors, such as NXP.
- New feature and issue trackers: our BSP issue tracker and Torizon issue trackers - which are also new feature trackers - are regularly updated to reflect the latest state of our software. Subscribe for updates and track SoMs, components, and subsystems that are relevant to your project. On a higher level, on each SoM page on toradex.com, there is a roadmap section that lists the software support status.
- Automated tests: BSP Layers (through the Reference Images for Yocto Project) and Torizon OS releases - nightly, monthly, quarterly, and LTS - are all tested as often as every day in our continuous testing infrastructure. The test results are not yet publicly exposed (as of Q4 2022), therefore failed automated tests are at the moment added to our issue trackers. We focus on adding reproducible tests, so if a defect or regression is identified, it can be tracked to a specific software version, if required. Our test coverage is reflected by important and relevant test cases focused on our customers' use cases.
- Manual tests: even with our continuous focus and effort to automate as many tests as possible, there is still a subset that must be run manually. We run the manual tests for all SoMs supported by Torizon OS on a quarterly basis, always before a quarterly release. Failed tests are tracked on our issue trackers.
Resilience, being the ability to endure difficulties, is a contributing factor to increased reliability. In Torizon OS, it is present in various forms, as outlined below:
- Docker data integrity checker: Torizon OS comes with a service that is capable of checking the integrity of Docker data and trying to recover from corrupted data situations - that although rare may happen. Learn more in the section Docker data integrity checker.
- Docker container health monitor: bugs are inherent to software development, and most likely unknown until experienced in the field. The container health monitor allows you to set rules to continuously check if your application is running, and trigger a reset if it deviates from the expected behavior. Learn more in the section Docker container health monitor.
- Docker daemon watchdog: The Docker daemon has a watchdog configured to reset the board if Docker crashes.
- Container limiting of resources: when starting a container, you can constrain the available memory and maximum CPU usage. This is an additional layer of resilience in case a bug in your application or any of its dependencies suddenly becomes resource-hungry. Learn more on the Toradex article How to Configure CPU Usage with Torizon and the Docker documentation article Runtime options with Memory, CPUs, and GPUs.
- Applications detached from the base OS: containers include the entire root filesystem required to run an application, therefore the dependency on the base OS is reduced. It allows you to focus on ensuring your application is functional without taking into account all changes on the base OS. As long as the Docker engine and any specific hardware functionalities are working, so is your application.
- Frequent releases and updates: being Torizon a DevOps-oriented platform, frequent releases are a core functionality. As they include the latest security and bug fixes from upstream projects, the tendency is that your devices become more reliable over time. It also allows you to deploy bug fixes to your application in a fast and cost-effective manner, preventing a bug identified on a specific device to happen across your entire fleet.
Reliable and frequent updates
Torizon, as a DevOps platform, has frequent updates designed as a core feature. Therefore, such operations must be reliable to not compromise the reliability of the entire system. Here are some key aspects that ensure it:
- Power-cut tolerant: at the heart of our update system, OSTree - also known as libostree - performs the file system operations that replace one system version with a newer one. It is designed to be an atomic, power-cut tolerant operation, as explained on the blog post Is Torizon OTA Safe From Power Loss?
- Automatic rollback: if a successfully deployed update does not behave as expected - for example, if the kernel hangs or the Docker service does not start - the system automatically rolls back to the previous state, which is known to be working. You may also add custom conditions for a boot to be considered successful. The rollback ensures that the device continues to operate until the failed update is investigated and fixed. To learn more, read the Update Checks and Rollbacks from the Aktualizr article, and the Automatic Rollback from the OSTree article.
- Allow and block updates: some products can't have their application stopped or be rebooted at any given time. Instead, the application must decide when to allow or block updates. Torizon OS provides mechanisms to allow and block updates and to enable and disable automatic reboots. Optionally, it is also possible to customize the reboot behavior to gracefully stop the application and perform predefined actions before rebooting.
Device and fleet monitoring
Aging and malfunction affect different products on the field in many ways. For example, a motor may require maintenance from time to time, and an SoM where data is logged frequently into the internal flash memory may have an expected lifetime.
Tracking important system metrics, logs, and alerts enables you to act before - or in the worst case as soon as - things go wrong.
- Integrated device monitoring: device monitoring in Torizon is integrated and enabled out-of-the-box. On the device side, the open-source log processor and forwarder Fluent Bit collects and sends data to the Torizon Cloud. Learn more in the Webinar: Secure Device Monitoring - Check Health, Resources and Performance
- Customizable metrics: Torizon OS is configured to send some data parameters by default. You can add your data collection in Torizon OS and configure the Torizon Cloud to use this data. Learn more with some use cases: Customizing device metrics for Torizon Cloud and Blog: Flash Health Monitoring on Torizon
Customers' End Products
Toradex SoMs and the base software that runs on them are only a part of your final product. Toradex's Customer’s Obligation of Validation states that:
[...] customers are obliged to do a full release testing of the intended combination of Toradex Hardware and Toradex Software Products on the target system. [...]
This ensures that your final products are successful and reliable. We trust that Torizon focus on DevOps makes it easy for you to achieve high standards while lowering your effort to setup and run tests.
- Reproducible deployments: to make your tests, you must be able to deploy the same software you did tests on. Toradex ensures that a given release of Torizon OS does not change, and you can either check it out or even Build Torizon OS from Source With Yocto Project/OpenEmbedded, though we recommend that you take our ready-to-use binary builds and customize them with TorizonCore Builder. On the application side, our Debian containers for Torizon are also versioned and allow you to build your own reproducible containerized applications on top of them.
- End-to-end application development: the Visual Studio Code Extension for Torizon allows you to create a CI/CD pipeline script for GitHub Actions with a single command. The Torizon IDE Backend Command-line Interface is at the heart of such integration, as it allows IDE commands to be run in a headless, CI environment. In the emerging Torizon VS Code Extension v2 (formerly codenamed ApolloX), under Toradex Labs as of Q4 2022, the focus on DevOps integration is being thought since the conception of the project, leading to an optimal experience, centered on CI from the beginning of a project's development instead of late just before a release.
- OS customization CI/CD: TorizonCore Builder is packaged in a container. As all relevant CI environments allow using containers, this makes it easy to integrate your custom builds into a CI pipeline. You can leverage our nightly Torizon OS builds to integrate and deploy your custom releases to a test device fleet every day, reducing the time to identify regressions.
- Torizon Cloud integration: the Torizon Cloud API makes it possible for you to control your fleet deployments from within your CI infrastructure. You can automatically trigger updates to your test fleet(s) and check the status of deployments, among other possibilities. The API is currently in beta; let us know about your use case and we can help you with documentation and integration tips.
Torizon OS services to increase reliability
This hands-on section explains how to use services that can increase the reliability of Torizon OS.
Applying Configuration to a Custom Torizon OS Image
Once you apply the features described in this section to a board, you must create a custom Torizon OS image with the exact same configuration to install on several boards during production programming.
You can do this with the TorizonCore Builder Tool - Customization for Production Programming and Torizon Cloud, more specifically by Capturing Changes in the Configuration of a Board on Torizon OS.
It is recommended that you:
- Have an SoM with Torizon OS.
- Have a brief understanding of the Torizon OS technical overview.
- Have a brief understanding of the update system technical overview.
- Adhere to the Torizon Best Practices Guide while doing development.
- Subscribe for updates on our Torizon OS Issue Tracker.
Docker data integrity checker
Docker data might get corrupted on the device. It is a rare situation but may happen in some specific cases like malfunctioning hardware or unintended power cuts during write operations in the storage device (NAND or eMMC). The risk is minimized in Torizon OS because most of the filesystem is mounted read-only and journaling is enabled on read-write mount points. Anyway, if it happens, it can result in containers not being able to start.
To avoid such situations, there is a feature called Docker integrity checker in Torizon OS.
docker-compose systemd service is not able to start all containers successfully, the
docker-integrity-checker systemd service will be triggered.
This service will perform an integrity check on all installed Docker images that are defined in the
/var/sota/storage/docker-compose/docker-compose.yml file because this is the file used by
If any of the Docker images are identified as corrupted, they will be deleted and re-pulled from the container registry again.
This feature is currently disabled by default in Torizon OS, and can be enabled by creating the
# touch /etc/docker/enable-integrity-checker
This feature can create additional network traffic in case a corrupted container image is detected.
Docker container health monitor
It might happen sometimes that a container appears to be up and running, but it’s not running as desired. To improve the reliability of the system, Torizon OS is able to monitor the health of running containers, and restart them if needed.
To monitor a container in Torizon OS, one must:
- Declare a user-defined check to determine the health state of a running container
- Label the container with
Given the above conditions, Torizon OS will check the container for its health state every 5 minutes and restart it if the "unhealthy" state is detected.
User defined check
Docker containers can be configured with a check to determine whether or not running containers are in a "healthy" state.
Here is an example of defining a health check. In this case, it will check for the existence of
test: ["CMD", "test", "-S", "/tmp/.X11-unix/X0"]
If the file doesn’t exist, the container will become "unhealthy". More information about Docker healthcheck is available in the Docker Compose file reference.
Every container that is going to be monitored has to be labeled as “autoheal=true":
Enabling docker-watchdog service
docker-watchdog systemd service can be enabled by running:
# sudo systemctl enable docker-watchdog.service
After enabling and starting this service, all containers configured with a health check as stated above will be monitored and restarted if they become "unhealthy".
Docker live restore
Docker live restore is a feature meant to keep containers alive during daemon downtime.
It is not enabled and it doesn't make sense to use it in the default context of Torizon OS, because:
- The Docker daemon systemd service is configured to reboot the board if the daemon crashes.
- The Torizon Updates system requires a reboot for OS updates, which is the case when the Docker daemon is updated.
If the default behavior of rebooting does not meet your use case, edit the corresponding systemd service
docker.service to disable it and prevent it from restarting automatically, then enable the Docker live restore. Other changes to the default behavior may be required, and they may also introduce unknown consequences.
use this feature at your own risk, we do not test it extensively. Be aware that we count unsuccessful reboots to rollback failed updates. By changing the default behavior, you are removing a mechanism that can detect a potentially bad update and trigger a rollback. Other unknown side-effects may be present, make sure to validate the changes for your own use case.
To test things, make sure there is a running container, forcibly kill the Docker daemon manually, and start it again:
# sudo killall --signal SIGKILL /usr/bin/dockerd
# sudo systemctl start docker
if you restart the service with
sudo systemctl restart docker, the containers will be stopped and restarted as well, regardless of the Docker configuration.
To make any of the changes reproducible, capture the changes in the configuration of a board on Torizon OS.