How to use OpenCL 1.2 in iMX8 on Torizon
Introduction
Torizon features a container runtime, Debian container images and a deb package repository that greatly eases the development process for embedded applications.
In this article, we will show how you can install and test OpenCL libraries optimized for GPUs available on NXP i.MX8/8X/8M Plus SoCs to integrate into your application. We will also obtain, build, and run OpenCL info and benchmarking tools to check if the software is set up correctly and observe the GPU performance.
Prerequisites
- A Toradex Apalis iMX8, Colibri iMX8X or Verdin iMX8M Plus SoM with Torizon installed.
- Basic knowledge of containers.
- Toradex provides a list of related articles
- You can also refer to the Docker documentation.
Demo explained
The full Dockerfile implementation is available in our samples repository on GitHub. In this section, some key aspects of it are explained.
Dockerfile
Toradex provides Debian Containers for Torizon, which come with a Toradex-specific Debian package where you can find packages for the GPU that are not available in the Debian community feeds.
You can choose the one that is more suitable for your project's needs. For example, the sample is configured to use the Debian base, which is the smallest container variation we provide, and it is great for headless applications. In a comment on the Dockerfile, you can easily switch to a variant that comes with the Weston compositor installed, recommended for those who want to run OpenCL together with a GUI application.
The container image is built in two stages:
- Build: the stage that cross-compiles the clpeak application, where the build time dependencies are installed, such as CMake and the OpenCL headers.
- Deploy: the stage where the final binary, cross-compiled in the Build stage, and its runtime dependencies are installed, such as OpenCL and other libraries. It also installs the clinfo application provided by the Debian feeds.
clpeak
clpeak is a benchmarking tool to measure the peak capabilities of OpenCL devices. It is cross-compiled from source, as explained in the previous section.
clinfo
clinfo is a debugging tool that prints all available platforms and devices OpenCL info. It is installed from the Debian feeds.
How to use
This section explains how to use sample container, including how to build and deploy it to the board. For more detailed instructions and deployment alternatives, read the article Deploying Container Images to Torizon OS.
Build and deploy
Clone the source from our samples GitHub repository into your PC:
$ git clone -b bookworm https://github.com/toradex/torizon-samples.git
$ cd torizon-samples/opencl
Inside the opencl
directory that contains the Dockerfile, on the host PC, build the image:
$ docker build -t <your-dockerhub-username>/opencl-image .
After the build, push the image to your Docker Hub account:
$ docker push <your-dockerhub-username>/opencl-image
Pull it from your Docker Hub account to the board. In the terminal of your board:
These instructions assume that the Docker Hub credentials are already set up on the board. If you did not setup your credentials yet, execute docker login
# docker pull <your-dockerhub-username>/opencl-image
Run clpeak
The clpeak
application is configured as the entrypoint. In other words, it is the command ran by default when you start the container. To run it, execute the following command:
Please, note that by executing the following line you are accepting the NXP's terms and conditions of the End-User License Agreement (EULA)
# docker run -e ACCEPT_FSL_EULA=1 -it --rm --name=clpeak-container --device /dev/galcore:/dev/galcore <your-dockerhub-username>/opencl-image
The above command allows the container to only access the GPU device /dev/galcore
from the host system. If you want to run a graphical application alongside, you will need to grant additional permissions. Find them out in our Debian With Weston Wayland Compositor example, and learn more about granting hardware access and other types of permission in the Torizon Best Practices Guide.
This is the expected output from an Apalis iMX8 board:
Platform: Vivante OpenCL Platform
Device: Vivante OpenCL Device GC7000XSVX.6009.0000
Driver version : OpenCL 1.2 V6.2.4.p4.190076 (Linux ARM64)
Compute units : 1
Clock frequency : 996 MHz
Global memory bandwidth (GBPS)
float : 5.81
float2 : 9.74
float4 : 10.63
float8 : 9.36
float16 : 8.00
Single-precision compute (GFLOPS)
float : 14.14
float2 : 28.18
float4 : 55.87
float8 : 62.15
float16 : 61.45
No half precision support! Skipped
No double precision support! Skipped
Integer compute (GIOPS)
int : 14.13
int2 : 14.09
int4 : 15.84
int8 : 15.73
int16 : 14.54
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 1.43
enqueueReadBuffer : 0.08
enqueueMapBuffer(for read) : 301.68
memcpy from mapped ptr : 0.08
enqueueUnmap(after write) : 269.08
memcpy to mapped ptr : 1.43
Kernel launch latency : 97.56 us
Device: Vivante OpenCL Device GC7000XSVX.6009.0000
Driver version : OpenCL 1.2 V6.2.4.p4.190076 (Linux ARM64)
Compute units : 1
Clock frequency : 996 MHz
Global memory bandwidth (GBPS)
float : 5.59
float2 : 9.38
float4 : 10.33
float8 : 9.15
float16 : 7.85
Single-precision compute (GFLOPS)
float : 14.14
float2 : 28.18
float4 : 55.87
float8 : 62.15
float16 : 61.45
No half precision support! Skipped
No double precision support! Skipped
Integer compute (GIOPS)
int : 14.13
int2 : 14.09
int4 : 15.84
int8 : 15.73
int16 : 14.54
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 1.42
enqueueReadBuffer : 0.08
enqueueMapBuffer(for read) : 238.44
memcpy from mapped ptr : 0.08
enqueueUnmap(after write) : 207.03
memcpy to mapped ptr : 1.44
Kernel launch latency : 126.82 us
Run clinfo
To run the clinfo application, you must set it as the entrypoint in the docker run
command. If you ran clpeak
in the previous section, a small change is required:
Please, note that by executing the following line you are accepting the NXP's terms and conditions of the End-User License Agreement (EULA)
# docker run -e ACCEPT_FSL_EULA=1 -it --rm --name=clpeak-container --device /dev/galcore:/dev/galcore --entrypoint=clinfo <your-dockerhub-username>/opencl-image
This is the expected output from a Colibri iMX8X board:
clinfo: /usr/lib/aarch64-linux-gnu/libOpenCL.so.1: no version information available (required by clinfo)
Number of platforms 1
Platform Name Vivante OpenCL Platform
Platform Vendor Vivante Corporation
Platform Version OpenCL 1.2 V6.4.0.p2.234062
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd
Platform Extensions function suffix viv
Platform Name Vivante OpenCL Platform
Number of devices 1
Device Name Vivante OpenCL Device GC7000L.6214.0000
Device Vendor Vivante Corporation
Device Vendor ID 0x564956
Device Version OpenCL 1.2
Driver Version OpenCL 1.2 V6.4.0.p2.234062
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 1
Max clock frequency 850MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types (n/a)
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 1024
colibri-imx8x-06787861:~$ docker run -e ACCEPT_FSL_EULA=1 -it --rm --name=clpeak-container --device /dev/galcore:/dev/galcore --entrypoint=clinfo torizon-opencl-sample-debian:latest
clinfo: /usr/lib/aarch64-linux-gnu/libOpenCL.so.1: no version information available (required by clinfo)
Number of platforms 1
Platform Name Vivante OpenCL Platform
Platform Vendor Vivante Corporation
Platform Version OpenCL 1.2 V6.4.0.p2.234062
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd
Platform Extensions function suffix viv
Platform Name Vivante OpenCL Platform
Number of devices 1
Device Name Vivante OpenCL Device GC7000L.6214.0000
Device Vendor Vivante Corporation
Device Vendor ID 0x564956
Device Version OpenCL 1.2
Driver Version OpenCL 1.2 V6.4.0.p2.234062
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 1
Max clock frequency 850MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types (n/a)
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 1024
Preferred work group size multiple (kernel) 16
Preferred / native vector sizes
char 4 / 4
short 4 / 4
int 4 / 4
long 4 / 4
half 0 / 0 (cl_khr_fp16)
float 4 / 4
double 0 / 0 (n/a)
Half-precision Floating-point support <printDeviceInfo:86: get CL_DEVICE_HALF_FP_CONFIG : error -30>
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 268435456 (256MiB)
Error Correction support Yes
Max memory allocation 134217728 (128MiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 2048 bits (256 bytes)
Global Memory cache type Read/Write
Global Memory cache size 8192 (8KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 8192 images
Max 2D image size 8192x8192 pixels
Max 3D image size 8192x8192x8192 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Global
Local memory size 32768 (32KiB)
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Device Extensions cl_khr_byte_addressable_store cl_khr_gl_sharing cl_khr_fp16 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [viv]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name Vivante OpenCL Platform
Device Name Vivante OpenCL Device GC7000L.6214.0000
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Vivante OpenCL Platform
Device Name Vivante OpenCL Device GC7000L.6214.0000
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Vivante OpenCL Platform
Device Name Vivante OpenCL Device GC7000L.6214.0000