Version: Torizon OS 7.x.y

Run TensorFlow Lite with the NPU on i.MX 8M Plus Based Modules on Torizon OS

Introduction

This article will guide you on how to run TensorFlow Lite with the NPU (VeriSilicon/VX) on i.MX 8M Plus based Toradex System on Modules (SoMs) on Torizon OS. This guide serves as a basic setup for using the NPU inside Torizon containers using libvx_delegate.so, with a CPU fallback.

Requirements

Hardware requirements:

Hardware: An i.MX 8M Plus based Toradex module with NPU
- Verdin iMX8M Plus
- SMARC iMX8M Plus

Software Requirements:

A Toradex System on Module with Torizon OS installed with NPU/VX support.
A configured build environment, as described in the Configure Build Environment for Torizon Containers article.
A quantized .tflite model supported by the VX delegate (CPU fallback remains available)

Run TensorFlow Lite on the NPU

From Torizon example

To run using an example model and dataset from Torizon:

On your target, runs docker run command:

docker run --rm \
--name "tflite-example" \
-v /dev:/dev \
-v /tmp:/tmp \
--device-cgroup-rule "c 199:0 rmw" \
torizon/tensorflow-lite-imx8:4

After run you should expect the following results:

Running using NPU:

INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/colibri_wifi4_jpg.rf.5374189653b8a1ed91b6225899687528.jpg: som (0.0299s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/colibrinowifi5_jpg.rf.2447307ad48bf09ff77c3a1d2db22b36.jpg: som (0.0307s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/colibrinowifi7_jpg.rf.f92a43a199f8593af6175bfaffd424e8.jpg: som (0.0301s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/verdin11_jpg.rf.40e36cc5e58bac016920773d25c4402d.jpg: som (0.0296s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/verdin5_jpg.rf.6c565df4d8282284fff93a7ea1322ed2.jpg: som (0.0300s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/verdin7_jpg.rf.e1d2e776604747c5ceea49fb1b3aebe4.jpg: som (0.0302s)

Images processed: 35
Mean inference time: 0.030091490064348494
Images/s: 33.23198677970322
Std deviation: 0.00035271071249791896

Running using CPU:

INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
/app/images/validation/20170202_130713_jpg.rf.02141be2c7a33a178eb732e60d931b08.jpg: som (0.3679s)
/app/images/validation/20170202_130748_jpg.rf.59ba7f77331c66f87e1f84d681ae24af.jpg: som (0.3665s)
/app/images/validation/20170427_091556_jpg.rf.04de51bcf68c7dcaf9cef819dfb77f96.jpg: som (0.3668s)
/app/images/validation/20171215_120124_jpg.rf.42ab3a816917acbbc52f0085baca0ebc.jpg: som (0.3661s)
/app/images/validation/20180207_110215_jpg.rf.545bf8c1a4f8879453e8d50433a98186.jpg: som (0.3666s)
/app/images/validation/20200124_092137_jpg.rf.6fd46b3b2342b41af2731e965e096eec.jpg: som (0.3664s)
/app/images/validation/20211003_124958_jpg.rf.247d395827b04d847f6d1dd72763e72c.jpg: som (0.3666s)
/app/images/validation/20240208_180942_jpg.rf.50bfcc530043756d1afe42e34fae1899.jpg: som (0.3669s)
/app/images/validation/20240208_181002_jpg.rf.8daf107e086842d887652ad65e85d845.jpg: som (0.3664s)
/app/images/validation/20240208_181018_jpg.rf.d81032846504ab42fcd07caea1457f3d.jpg: som (0.3669s)
/app/images/validation/20240209_074516_jpg.rf.f63a60a8423ea7faa6d4cfb70a2fbcb4.jpg: som (0.3665s)
/app/images/validation/20240209_074532_jpg.rf.6563ef4a988952c3c6030120a6379827.jpg: som (0.3664s)
/app/images/validation/20240209_074541_jpg.rf.3125e79e48dc249c228c3dc78b601338.jpg: som (0.3669s)
/app/images/validation/20240209_074559_jpg.rf.63413dac635d4525c4a568dee4dc5bc0.jpg: som (0.3667s)
/app/images/validation/20240209_074622_jpg.rf.1367ecac766633d291c1e298297c5fea.jpg: som (0.3668s)
/app/images/validation/20240209_093936_jpg.rf.2e1e4301a5d3e7df4de4aee6f278c23c.jpg: som (0.3665s)
/app/images/validation/20240220_101428_jpg.rf.aaba70e29b442df34864333c81f317ce.jpg: som (0.3666s)
/app/images/validation/20240220_101437_jpg.rf.6cba29955ce6a40f6be9190084bd38cd.jpg: som (0.3665s)
/app/images/validation/20240228_101455_jpg.rf.7a8999b086694135841ca8bf0ea983e4.jpg: som (0.3656s)
/app/images/validation/20240229_081302_jpg.rf.c4e7ed154cb6cdf27940a738df3f37d0.jpg: som (0.3663s)
/app/images/validation/20240229_081333_jpg.rf.3aa011c33beba0cc76d367b9a348d9dd.jpg: som (0.3668s)
/app/images/validation/20240229_114225_jpg.rf.cbe35cee56aae023d9612166bf21ac0f.jpg: som (0.3667s)
/app/images/validation/20240229_120443_jpg.rf.d22060b22ae2c200ee607b5f2971a795.jpg: som (0.3668s)
/app/images/validation/DSC_0055_JPG.rf.846f69306b8ee55d4a0e02c0289b8aff.jpg: som (0.3666s)
/app/images/validation/colibri_it11_jpg.rf.e87591ae0966de61a44aa129f5fbcf76.jpg: som (0.3668s)
/app/images/validation/colibri_it13_jpg.rf.a2744fa933e6fc5950216967114a4cd0.jpg: som (0.3666s)
/app/images/validation/colibri_it2_jpg.rf.6c28a3a84c9bcaef3545adc488866eca.jpg: som (0.3666s)
/app/images/validation/colibri_it5_jpg.rf.7f4be685d931afed8e84306d3112bef0.jpg: som (0.3666s)
/app/images/validation/colibri_it7_jpg.rf.51b04122bd907e7eb255ac1e9387b6c4.jpg: som (0.3668s)
/app/images/validation/colibri_wifi1_jpg.rf.c4d21bffe663b54682c8b2ece7fc337a.jpg: som (0.3668s)
/app/images/validation/colibri_wifi4_jpg.rf.5374189653b8a1ed91b6225899687528.jpg: som (0.3667s)
/app/images/validation/colibrinowifi5_jpg.rf.2447307ad48bf09ff77c3a1d2db22b36.jpg: som (0.3663s)
/app/images/validation/colibrinowifi7_jpg.rf.f92a43a199f8593af6175bfaffd424e8.jpg: som (0.3667s)
/app/images/validation/verdin11_jpg.rf.40e36cc5e58bac016920773d25c4402d.jpg: som (0.3664s)
/app/images/validation/verdin5_jpg.rf.6c565df4d8282284fff93a7ea1322ed2.jpg: som (0.3665s)
/app/images/validation/verdin7_jpg.rf.e1d2e776604747c5ceea49fb1b3aebe4.jpg: som (0.3664s)

Images processed: 35
Mean inference time: 0.3665829249790737
Images/s: 2.7278957416172203
Std deviation: 0.00025898585318932067

As we can see from the above results, the NPU had a performance ten times faster than using CPU proving the proper behavior of NPU usage:

CPU: Mean inference time: 0.3665829249790737
NPU: Mean inference time: 0.030091490064348494

From scratch

To run your .tflite model, follow the instructions below:

On your host machine, create a Dockerfile as shown below. The Dockerfile will download the necessary runtime packages from Toradex/NXP's feeds and install them in your image.

tip

You do not need to use Toradex's packages to train your models. Toradex recommends using the upstream TensorFlow libraries for training.

Dockerfile
ARG IMAGE_ARCH=linux/arm64

FROM --platform=$IMAGE_ARCH torizon/debian-imx8:4

ARG TF_LITE_MODEL=""
ARG IMAGE_DATASET=""
ARG WORKDIR_PATH="/app"
ARG LABELMAP_FILE=""

# variables used by example.py script
ENV DATA_DIR=${IMAGE_DATASET}
ENV MODEL=${TF_LITE_MODEL}
ENV LABELS=${LABELMAP_FILE}
ENV APP_ROOT=${WORKDIR_PATH}

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1

WORKDIR ${WORKDIR_PATH}

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    python3 python3-venv python3-pip \
    python3-numpy python3-pil \
    libtim-vx \
    libtensorflow-lite2.16.2 \
    tflite-vx-delegate-imx \
    imx-gpu-viv-wayland \
    imx-gpu-viv-wayland-dev

RUN python3 -m venv ${WORKDIR_PATH}/.venv --system-site-packages

RUN . ${WORKDIR_PATH}/.venv/bin/activate && \
    pip3 install --upgrade pip && \
    pip3 install --no-cache-dir tflite-runtime

COPY ${TF_LITE_MODEL} ${WORKDIR_PATH}/model/
COPY ${LABELMAP_FILE} ${WORKDIR_PATH}
COPY ${IMAGE_DATASET} ${WORKDIR_PATH}/${IMAGE_DATASET}/

COPY example.py ${WORKDIR_PATH}/example.py

In the same directory where your Dockerfile is, create an example python script to run your .tflite model. The example below uses the CPU as a fallback. Note that model quantization is required for acceptable performance in most cases.

example.py
#!/usr/bin/env python3
import os, glob, time
import numpy as np
from PIL import Image

import tflite_runtime.interpreter as tfl


def find_vx_delegate():
    for p in (
        os.getenv("VX_DELEGATE", "/usr/lib/libvx_delegate.so"),
        "/usr/lib/libvx_delegate.so.2",
        "/usr/lib/libvx_delegate.so.1",
    ):
        if os.path.exists(p):
            return p
    return None


def preprocess(path, size, dtype):
    img = Image.open(path).convert("RGB")
    w, h = img.size
    s = max(w, h)
    canvas = Image.new("RGB", (s, s))
    canvas.paste(img, ((s - w) // 2, (s - h) // 2))
    canvas = canvas.resize((size, size))
    x = np.asarray(canvas, dtype=dtype)
    return np.expand_dims(x, 0)


def main():
    root = os.getenv("APP_ROOT", "/app")
    data_dir = os.getenv("DATA_DIR", f"{root}/images/validation")
    model = os.getenv("MODEL", f"{root}/ssd_detect_quant_only_som.tflite")
    labels_fn = os.getenv("LABELS", f"{root}/labelmap.txt")
    limit = int(os.getenv("LIMIT", "0"))

    files = sorted(glob.glob(f"{data_dir}/*.jpg"))
    
    if limit > 0:
        files = files[:limit]
    if not files:
        raise SystemExit(f"no images under {data_dir}")

    labels = [l.strip() for l in open(labels_fn, "r", encoding="utf-8") if l.strip()]

    delegates = []
    vx = find_vx_delegate()
    if vx:
        try:
            delegates.append(tfl.load_delegate(vx))
        except Exception:
            delegates = []

    itp = (
        tfl.Interpreter(model_path=model, experimental_delegates=delegates)
        if delegates
        else tfl.Interpreter(model_path=model)
    )
    itp.allocate_tensors()

    in0 = itp.get_input_details()[0]
    out0 = itp.get_output_details()[0]
    size = int(in0["shape"][1])
    dtype = in0["dtype"]

    times = []
    for i, p in enumerate(files):
        itp.set_tensor(in0["index"], preprocess(p, size, dtype))
        t1 = time.time()
        itp.invoke()
        t2 = time.time()
        y = itp.get_tensor(out0["index"])[0]
        top = int(np.argmax(y))
        print(
            f"{p}: {labels[top] if top < len(labels) else top} ({t2-t1:.4f}s)",
            flush=True,
        )
        if i != 0:
            times.append(t2 - t1)

    if times:
        a = np.array(times, dtype=np.float64)
        m = float(a.mean())
        print("\nImages processed:", len(a))
        print("Mean inference time:", m)
        print("Images/s:", (1.0 / m) if m > 0 else 0.0)
        print("Std deviation:", float(a.std()))


if __name__ == "__main__":
    main()

Build the image with the command below. Replace <username>/<image-name> with your Docker Hub username and an image name of your choice.

$ docker build -t <username>/<image-name> \
--build-arg TF_LITE_MODEL=<user-model-file> \
--build-arg IMAGE_DATASET=<user-images-folder> \
--build-arg LABELMAP_FILE=<user-label-file> .

$ docker push <username>/<image-name>

Run the container inside your target device:

# docker run -it \
   -v /dev:/dev \
   -e VX_DELEGATE=/usr/lib/libvx_delegate.so \
   --device-cgroup-rule "c 199:0 rmw" \
   <username>/<image-name>

Inside the container in your target, run the example script:

#. .venv/bin/activate && python3 example.py

Expected output:

images/validation/colibri_it7_jpg.rf.51b04122bd907e7eb255ac1e9387b6c4.jpg: 663 (0.0034s)
INFO: Delegate::Invoke node: 0x35921270
INFO: Copying input 88: input
INFO: Invoking graph
INFO: Copying output 87, MobilenetV1/Predictions/Reshape_1
images/validation/colibri_wifi4_jpg.rf.5374189653b8a1ed91b6225899687528.jpg: 845 (0.0037s)
INFO: Delegate::Invoke node: 0x35921270
INFO: Copying input 88: input
INFO: Invoking graph
INFO: Copying output 87, MobilenetV1/Predictions/Reshape_1
images/validation/colibrinowifi7_jpg.rf.f92a43a199f8593af6175bfaffd424e8.jpg: 482 (0.0033s)
INFO: Delegate::Invoke node: 0x35921270
INFO: Copying input 88: input
INFO: Invoking graph
INFO: Copying output 87, MobilenetV1/Predictions/Reshape_1
images/validation/verdin11_jpg.rf.40e36cc5e58bac016920773d25c4402d.jpg: 663 (0.0034s)
INFO: Delegate::Invoke node: 0x35921270
INFO: Copying input 88: input
INFO: Invoking graph
INFO: Copying output 87, MobilenetV1/Predictions/Reshape_1
images/validation/verdin5_jpg.rf.6c565df4d8282284fff93a7ea1322ed2.jpg: 663 (0.0034s)

Images processed: 34
Mean inference time: 0.0036329311483046588
Images/s: 275.2598271692155
Std deviation: 0.0003194358571149672

Introduction​

Requirements​

Run TensorFlow Lite on the NPU​

From Torizon example​

From scratch​

Introduction

Requirements

Run TensorFlow Lite on the NPU

From Torizon example

From scratch