Search by Tags

Reliability on Torizon

 

Article updated at 01 Jul 2021
Subscribe for this article updates

Select the version of your OS from the tabs below. If you don't know the version you are using, run the command cat /etc/os-release or cat /etc/issue on the board.



Remember that you can always refer to the Torizon Documentation, there you can find a lot of relevant articles that might help you in the application development.

Torizon 5.4.0

Introduction

Reliability is an important topic for embedded systems. Once you have deployed thousands of devices to the field, malfunction or successful attacks may cause harm to people and equipment, and may imply costs for on-site maintenance.

Torizon strives to be a reliable system from its conception and at all levels. Be it on TorizonCore or our tools and features, like TorizonCore Builder and our OTA update system, we care about providing safe defaults and guidance to our customers.

In this article, we highlight some of the aspects that make Torizon a reliable platform.

Prerequisites

It is recommended that you:

Docker data integrity checker

Docker data might get corrupted on the device. It is a rare situation but may happen in some specific cases like malfunctioning hardware or unintended powercuts during write operations in the storage device (NAND or eMMC). The risk is minimized in TorizonCore because most filesystem is mounted read-only and journaling is enabled on read-write mount points. Anyway, if it happens, can result in containers not being able to start.

To avoid such situations, there is a feature called Docker integrity checker in TorizonCore.

If the docker-compose systemd service is not able to start all containers successfully, the docker-integrity-checker systemd service will be triggered.

This service will perform an integrity check on all installed Docker images that are defined in the /var/sota/storage/docker-compose/docker-compose.yml file because this is the file used by docker-compose.service.

If any of the Docker images are identified as corrupted, they will be deleted and re-pulled from the container registry again.

This feature is currently disabled by default in TorizonCore, and can be enabled by creating the /etc/docker/enable-integrity-checker file:

# touch /etc/docker/enable-integrity-checker

Warning: This feature can create additional network traffic in case a corrupted container image is detected.

Docker health monitor

It might happen sometimes that a container appears to be up and running, but it’s not running as desired. To improve the reliability of the system, TorizonCore is able to monitor the health of running containers, and restart them if needed.

To monitor a container in TorizonCore, one must:

  • Declare a user-defined check to determine the health state of a running container
  • Label the container with "autoheal=true"
  • Enable docker-watchdog.service systemd service

Given the above conditions, TorizonCore will check the container for its health state every 5 minutes and restart it if the "unhealthy" state is detected.

User defined check

Docker containers can be configured with a check to determine, whether or not running containers are in a “healthy“ state.

Here is an example of defining a health check. In this case, it will check for the existence of /tmp/.X11-unix/X0 file:

healthcheck:
    test: ["CMD", "test", "-S", "/tmp/.X11-unix/X0"]
    interval: 5s
    timeout: 4s
    retries: 2
    start_period: 10s

If the file doesn’t exist, the container will became "unhealthy". More information about Docker healthcheck available here.

Label

Every container, that is going to be monitored, has to be labeled as “autoheal=true”:

    labels:
      - autoheal=true

Enabling docker-watchdog service

The docker-watchdog systemd service can be enabled by running:

# sudo systemctl enable docker-watchdog.service

After enabling and starting this service, all containers configured with a health check as stated above will be monitored and restarted if they became "unhealthy".