Run TensorFlow Lite with the NPU on i.MX 8M Plus Based Modules on Torizon OS
Introduction
This article will guide you on how to run TensorFlow Lite with the NPU (VeriSilicon/VX) on i.MX 8M Plus based Toradex System on Modules (SoMs) on Torizon OS. This guide serves as a basic setup for using the NPU inside Torizon containers using libvx_delegate.so, with a CPU fallback.
Requirements
Hardware requirements:
- Hardware: An i.MX 8M Plus based Toradex module with NPU
Software Requirements:
- A Toradex System on Module with Torizon OS installed with NPU/VX support.
- A configured build environment, as described in the Configure Build Environment for Torizon Containers article.
- A quantized
.tflitemodel supported by the VX delegate (CPU fallback remains available)
Run TensorFlow Lite on the NPU
From Torizon example
To run using an example model and dataset from Torizon:
- On your target, runs
docker runcommand:
docker run --rm \
--name "tflite-example" \
-v /dev:/dev \
-v /tmp:/tmp \
--device-cgroup-rule "c 199:0 rmw" \
torizon/tensorflow-lite-imx8:4
- After run you should expect the following results:
- Running using NPU:
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/colibri_wifi4_jpg.rf.5374189653b8a1ed91b6225899687528.jpg: som (0.0299s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/colibrinowifi5_jpg.rf.2447307ad48bf09ff77c3a1d2db22b36.jpg: som (0.0307s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/colibrinowifi7_jpg.rf.f92a43a199f8593af6175bfaffd424e8.jpg: som (0.0301s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/verdin11_jpg.rf.40e36cc5e58bac016920773d25c4402d.jpg: som (0.0296s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/verdin5_jpg.rf.6c565df4d8282284fff93a7ea1322ed2.jpg: som (0.0300s)
INFO: Delegate::Invoke node: 0x9959a00
INFO: Copying input 0: serving_default_input:0
INFO: Invoking graph
INFO: Copying output 362, tfl.dequantize
INFO: Copying output 380, convert_scores1
/app/images/validation/verdin7_jpg.rf.e1d2e776604747c5ceea49fb1b3aebe4.jpg: som (0.0302s)
Images processed: 35
Mean inference time: 0.030091490064348494
Images/s: 33.23198677970322
Std deviation: 0.00035271071249791896- Running using CPU:
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
/app/images/validation/20170202_130713_jpg.rf.02141be2c7a33a178eb732e60d931b08.jpg: som (0.3679s)
/app/images/validation/20170202_130748_jpg.rf.59ba7f77331c66f87e1f84d681ae24af.jpg: som (0.3665s)
/app/images/validation/20170427_091556_jpg.rf.04de51bcf68c7dcaf9cef819dfb77f96.jpg: som (0.3668s)
/app/images/validation/20171215_120124_jpg.rf.42ab3a816917acbbc52f0085baca0ebc.jpg: som (0.3661s)
/app/images/validation/20180207_110215_jpg.rf.545bf8c1a4f8879453e8d50433a98186.jpg: som (0.3666s)
/app/images/validation/20200124_092137_jpg.rf.6fd46b3b2342b41af2731e965e096eec.jpg: som (0.3664s)
/app/images/validation/20211003_124958_jpg.rf.247d395827b04d847f6d1dd72763e72c.jpg: som (0.3666s)
/app/images/validation/20240208_180942_jpg.rf.50bfcc530043756d1afe42e34fae1899.jpg: som (0.3669s)
/app/images/validation/20240208_181002_jpg.rf.8daf107e086842d887652ad65e85d845.jpg: som (0.3664s)
/app/images/validation/20240208_181018_jpg.rf.d81032846504ab42fcd07caea1457f3d.jpg: som (0.3669s)
/app/images/validation/20240209_074516_jpg.rf.f63a60a8423ea7faa6d4cfb70a2fbcb4.jpg: som (0.3665s)
/app/images/validation/20240209_074532_jpg.rf.6563ef4a988952c3c6030120a6379827.jpg: som (0.3664s)
/app/images/validation/20240209_074541_jpg.rf.3125e79e48dc249c228c3dc78b601338.jpg: som (0.3669s)
/app/images/validation/20240209_074559_jpg.rf.63413dac635d4525c4a568dee4dc5bc0.jpg: som (0.3667s)
/app/images/validation/20240209_074622_jpg.rf.1367ecac766633d291c1e298297c5fea.jpg: som (0.3668s)
/app/images/validation/20240209_093936_jpg.rf.2e1e4301a5d3e7df4de4aee6f278c23c.jpg: som (0.3665s)
/app/images/validation/20240220_101428_jpg.rf.aaba70e29b442df34864333c81f317ce.jpg: som (0.3666s)
/app/images/validation/20240220_101437_jpg.rf.6cba29955ce6a40f6be9190084bd38cd.jpg: som (0.3665s)
/app/images/validation/20240228_101455_jpg.rf.7a8999b086694135841ca8bf0ea983e4.jpg: som (0.3656s)
/app/images/validation/20240229_081302_jpg.rf.c4e7ed154cb6cdf27940a738df3f37d0.jpg: som (0.3663s)
/app/images/validation/20240229_081333_jpg.rf.3aa011c33beba0cc76d367b9a348d9dd.jpg: som (0.3668s)
/app/images/validation/20240229_114225_jpg.rf.cbe35cee56aae023d9612166bf21ac0f.jpg: som (0.3667s)
/app/images/validation/20240229_120443_jpg.rf.d22060b22ae2c200ee607b5f2971a795.jpg: som (0.3668s)
/app/images/validation/DSC_0055_JPG.rf.846f69306b8ee55d4a0e02c0289b8aff.jpg: som (0.3666s)
/app/images/validation/colibri_it11_jpg.rf.e87591ae0966de61a44aa129f5fbcf76.jpg: som (0.3668s)
/app/images/validation/colibri_it13_jpg.rf.a2744fa933e6fc5950216967114a4cd0.jpg: som (0.3666s)
/app/images/validation/colibri_it2_jpg.rf.6c28a3a84c9bcaef3545adc488866eca.jpg: som (0.3666s)
/app/images/validation/colibri_it5_jpg.rf.7f4be685d931afed8e84306d3112bef0.jpg: som (0.3666s)
/app/images/validation/colibri_it7_jpg.rf.51b04122bd907e7eb255ac1e9387b6c4.jpg: som (0.3668s)
/app/images/validation/colibri_wifi1_jpg.rf.c4d21bffe663b54682c8b2ece7fc337a.jpg: som (0.3668s)
/app/images/validation/colibri_wifi4_jpg.rf.5374189653b8a1ed91b6225899687528.jpg: som (0.3667s)
/app/images/validation/colibrinowifi5_jpg.rf.2447307ad48bf09ff77c3a1d2db22b36.jpg: som (0.3663s)
/app/images/validation/colibrinowifi7_jpg.rf.f92a43a199f8593af6175bfaffd424e8.jpg: som (0.3667s)
/app/images/validation/verdin11_jpg.rf.40e36cc5e58bac016920773d25c4402d.jpg: som (0.3664s)
/app/images/validation/verdin5_jpg.rf.6c565df4d8282284fff93a7ea1322ed2.jpg: som (0.3665s)
/app/images/validation/verdin7_jpg.rf.e1d2e776604747c5ceea49fb1b3aebe4.jpg: som (0.3664s)
Images processed: 35
Mean inference time: 0.3665829249790737
Images/s: 2.7278957416172203
Std deviation: 0.00025898585318932067
As we can see from the above results, the NPU had a performance ten times faster than using CPU proving the proper behavior of NPU usage:
CPU: Mean inference time: 0.3665829249790737
NPU: Mean inference time: 0.030091490064348494
From scratch
To run your .tflite model, follow the instructions below:
-
On your host machine, create a Dockerfile as shown below. The Dockerfile will download the necessary runtime packages from Toradex/NXP's feeds and install them in your image.
tipYou do not need to use Toradex's packages to train your models. Toradex recommends using the upstream TensorFlow libraries for training.
DockerfileARG IMAGE_ARCH=linux/arm64
FROM torizon/debian-imx8:4
ARG TF_LITE_MODEL=""
ARG IMAGE_DATASET=""
ARG WORKDIR_PATH="/app"
ARG LABELMAP_FILE=""
# variables used by example.py script
ENV DATA_DIR=${IMAGE_DATASET}
ENV MODEL=${TF_LITE_MODEL}
ENV LABELS=${LABELMAP_FILE}
ENV APP_ROOT=${WORKDIR_PATH}
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
WORKDIR ${WORKDIR_PATH}
RUN apt-get update && \
apt-get install -y --no-install-recommends \
python3 python3-venv python3-pip \
python3-numpy python3-pil \
libtim-vx \
libtensorflow-lite2.16.2 \
tflite-vx-delegate-imx \
imx-gpu-viv-wayland \
imx-gpu-viv-wayland-dev
RUN python3 -m venv ${WORKDIR_PATH}/.venv --system-site-packages
RUN . ${WORKDIR_PATH}/.venv/bin/activate && \
pip3 install --upgrade pip && \
pip3 install --no-cache-dir tflite-runtime
COPY ${TF_LITE_MODEL} ${WORKDIR_PATH}/model/
COPY ${LABELMAP_FILE} ${WORKDIR_PATH}
COPY ${IMAGE_DATASET} ${WORKDIR_PATH}/${IMAGE_DATASET}/
COPY example.py ${WORKDIR_PATH}/example.py -
In the same directory where your
Dockerfileis, create an example python script to run your.tflitemodel. The example below uses the CPU as a fallback. Note that model quantization is required for acceptable performance in most cases.example.py#!/usr/bin/env python3
import os, glob, time
import numpy as np
from PIL import Image
import tflite_runtime.interpreter as tfl
def find_vx_delegate():
for p in (
os.getenv("VX_DELEGATE", "/usr/lib/libvx_delegate.so"),
"/usr/lib/libvx_delegate.so.2",
"/usr/lib/libvx_delegate.so.1",
):
if os.path.exists(p):
return p
return None
def preprocess(path, size, dtype):
img = Image.open(path).convert("RGB")
w, h = img.size
s = max(w, h)
canvas = Image.new("RGB", (s, s))
canvas.paste(img, ((s - w) // 2, (s - h) // 2))
canvas = canvas.resize((size, size))
x = np.asarray(canvas, dtype=dtype)
return np.expand_dims(x, 0)
def main():
root = os.getenv("APP_ROOT", "/app")
data_dir = os.getenv("DATA_DIR", f"{root}/images/validation")
model = os.getenv("MODEL", f"{root}/ssd_detect_quant_only_som.tflite")
labels_fn = os.getenv("LABELS", f"{root}/labelmap.txt")
limit = int(os.getenv("LIMIT", "0"))
files = sorted(glob.glob(f"{data_dir}/*.jpg"))
if limit > 0:
files = files[:limit]
if not files:
raise SystemExit(f"no images under {data_dir}")
labels = [l.strip() for l in open(labels_fn, "r", encoding="utf-8") if l.strip()]
delegates = []
vx = find_vx_delegate()
if vx:
try:
delegates.append(tfl.load_delegate(vx))
except Exception:
delegates = []
itp = (
tfl.Interpreter(model_path=model, experimental_delegates=delegates)
if delegates
else tfl.Interpreter(model_path=model)
)
itp.allocate_tensors()
in0 = itp.get_input_details()[0]
out0 = itp.get_output_details()[0]
size = int(in0["shape"][1])
dtype = in0["dtype"]
times = []
for i, p in enumerate(files):
itp.set_tensor(in0["index"], preprocess(p, size, dtype))
t1 = time.time()
itp.invoke()
t2 = time.time()
y = itp.get_tensor(out0["index"])[0]
top = int(np.argmax(y))
print(
f"{p}: {labels[top] if top < len(labels) else top} ({t2-t1:.4f}s)",
flush=True,
)
if i != 0:
times.append(t2 - t1)
if times:
a = np.array(times, dtype=np.float64)
m = float(a.mean())
print("\nImages processed:", len(a))
print("Mean inference time:", m)
print("Images/s:", (1.0 / m) if m > 0 else 0.0)
print("Std deviation:", float(a.std()))
if __name__ == "__main__":
main() -
Build the image with the command below. Replace
<username>/<image-name>with your Docker Hub username and an image name of your choice.$ docker build -t <username>/<image-name> \
--build-arg TF_LITE_MODEL=<user-model-file> \
--build-arg IMAGE_DATASET=<user-images-folder> \
--build-arg LABELMAP_FILE=<user-label-file> .
$ docker push <username>/<image-name> -
Run the container inside your target device:
# docker run -it \
-v /dev:/dev \
-e VX_DELEGATE=/usr/lib/libvx_delegate.so \
--device-cgroup-rule "c 199:0 rmw" \
<username>/<image-name> -
Inside the container in your target, run the example script:
#. .venv/bin/activate && python3 example.pyExpected output:
images/validation/colibri_it7_jpg.rf.51b04122bd907e7eb255ac1e9387b6c4.jpg: 663 (0.0034s)
INFO: Delegate::Invoke node: 0x35921270
INFO: Copying input 88: input
INFO: Invoking graph
INFO: Copying output 87, MobilenetV1/Predictions/Reshape_1
images/validation/colibri_wifi4_jpg.rf.5374189653b8a1ed91b6225899687528.jpg: 845 (0.0037s)
INFO: Delegate::Invoke node: 0x35921270
INFO: Copying input 88: input
INFO: Invoking graph
INFO: Copying output 87, MobilenetV1/Predictions/Reshape_1
images/validation/colibrinowifi7_jpg.rf.f92a43a199f8593af6175bfaffd424e8.jpg: 482 (0.0033s)
INFO: Delegate::Invoke node: 0x35921270
INFO: Copying input 88: input
INFO: Invoking graph
INFO: Copying output 87, MobilenetV1/Predictions/Reshape_1
images/validation/verdin11_jpg.rf.40e36cc5e58bac016920773d25c4402d.jpg: 663 (0.0034s)
INFO: Delegate::Invoke node: 0x35921270
INFO: Copying input 88: input
INFO: Invoking graph
INFO: Copying output 87, MobilenetV1/Predictions/Reshape_1
images/validation/verdin5_jpg.rf.6c565df4d8282284fff93a7ea1322ed2.jpg: 663 (0.0034s)
Images processed: 34
Mean inference time: 0.0036329311483046588
Images/s: 275.2598271692155
Std deviation: 0.0003194358571149672