Version: Torizon OS 6.x.y

Is this page helpful?

How to use OpenCL 1.2 in iMX8 on Torizon

Introduction

Torizon features a container runtime, Debian container images and a deb package repository that greatly eases the development process for embedded applications.

In this article, we will show how you can install and test OpenCL libraries optimized for GPUs available on NXP i.MX8/8X/8M Plus SoCs to integrate into your application. We will also obtain, build, and run OpenCL info and benchmarking tools to check if the software is set up correctly and observe the GPU performance.

Prerequisites

A Toradex Apalis iMX8, Colibri iMX8X or Verdin iMX8M Plus SoM with Torizon installed.
Basic knowledge of containers.
- Toradex provides a list of related articles
- You can also refer to the Docker documentation.

Demo explained

The full Dockerfile implementation is available in our samples repository on GitHub. In this section, some key aspects of it are explained.

Dockerfile

Toradex provides Debian Containers for Torizon, which come with a Toradex-specific Debian package where you can find packages for the GPU that are not available in the Debian community feeds.

You can choose the one that is more suitable for your project's needs. For example, the sample is configured to use the Debian base, which is the smallest container variation we provide, and it is great for headless applications. In a comment on the Dockerfile, you can easily switch to a variant that comes with the Weston compositor installed, recommended for those who want to run OpenCL together with a GUI application.

The container image is built in two stages:

Build: the stage that cross-compiles the clpeak application, where the build time dependencies are installed, such as CMake and the OpenCL headers.
Deploy: the stage where the final binary, cross-compiled in the Build stage, and its runtime dependencies are installed, such as OpenCL and other libraries. It also installs the clinfo application provided by the Debian feeds.

clpeak

clpeak is a benchmarking tool to measure the peak capabilities of OpenCL devices. It is cross-compiled from source, as explained in the previous section.

clinfo

clinfo is a debugging tool that prints all available platforms and devices OpenCL info. It is installed from the Debian feeds.

How to use

This section explains how to use sample container, including how to build and deploy it to the board. For more detailed instructions and deployment alternatives, read the article Deploying Container Images to Torizon OS.

Build and deploy

Clone the source from our samples GitHub repository into your PC:

$ git clone -b bookworm https://github.com/toradex/torizon-samples.git
$ cd torizon-samples/opencl

Inside the opencl directory that contains the Dockerfile, on the host PC, build the image:

$ docker build -t <your-dockerhub-username>/opencl-image .

After the build, push the image to your Docker Hub account:

$ docker push <your-dockerhub-username>/opencl-image

Pull it from your Docker Hub account to the board. In the terminal of your board:

caution

These instructions assume that the Docker Hub credentials are already set up on the board. If you did not setup your credentials yet, execute docker login

# docker pull <your-dockerhub-username>/opencl-image

Run clpeak

The clpeak application is configured as the entrypoint. In other words, it is the command ran by default when you start the container. To run it, execute the following command:

danger

Please, note that by executing the following line you are accepting the NXP's terms and conditions of the End-User License Agreement (EULA)

# docker run -e ACCEPT_FSL_EULA=1 -it --rm --name=clpeak-container --device /dev/galcore:/dev/galcore <your-dockerhub-username>/opencl-image

The above command allows the container to only access the GPU device /dev/galcore from the host system. If you want to run a graphical application alongside, you will need to grant additional permissions. Find them out in our Debian With Weston Wayland Compositor example, and learn more about granting hardware access and other types of permission in the Torizon Best Practices Guide.

This is the expected output from an Apalis iMX8 board:

Output
Platform: Vivante OpenCL Platform
  Device: Vivante OpenCL Device GC7000XSVX.6009.0000
    Driver version  : OpenCL 1.2 V6.2.4.p4.190076 (Linux ARM64)
    Compute units   : 1
    Clock frequency : 996 MHz

    Global memory bandwidth (GBPS)
      float   : 5.81
      float2  : 9.74
      float4  : 10.63
      float8  : 9.36
      float16 : 8.00

    Single-precision compute (GFLOPS)
      float   : 14.14
      float2  : 28.18
      float4  : 55.87
      float8  : 62.15
      float16 : 61.45

    No half precision support! Skipped

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 14.13
      int2  : 14.09
      int4  : 15.84
      int8  : 15.73
      int16 : 14.54

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 1.43
      enqueueReadBuffer          : 0.08
      enqueueMapBuffer(for read) : 301.68
        memcpy from mapped ptr   : 0.08
      enqueueUnmap(after write)  : 269.08
        memcpy to mapped ptr     : 1.43

    Kernel launch latency : 97.56 us

  Device: Vivante OpenCL Device GC7000XSVX.6009.0000
    Driver version  : OpenCL 1.2 V6.2.4.p4.190076 (Linux ARM64)
    Compute units   : 1
    Clock frequency : 996 MHz

    Global memory bandwidth (GBPS)
      float   : 5.59
      float2  : 9.38
      float4  : 10.33
      float8  : 9.15
      float16 : 7.85

    Single-precision compute (GFLOPS)
      float   : 14.14
      float2  : 28.18
      float4  : 55.87
      float8  : 62.15
      float16 : 61.45

    No half precision support! Skipped

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 14.13
      int2  : 14.09
      int4  : 15.84
      int8  : 15.73
      int16 : 14.54

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 1.42
      enqueueReadBuffer          : 0.08
      enqueueMapBuffer(for read) : 238.44
        memcpy from mapped ptr   : 0.08
      enqueueUnmap(after write)  : 207.03
        memcpy to mapped ptr     : 1.44

    Kernel launch latency : 126.82 us

Run clinfo

To run the clinfo application, you must set it as the entrypoint in the docker run command. If you ran clpeak in the previous section, a small change is required:

danger

Please, note that by executing the following line you are accepting the NXP's terms and conditions of the End-User License Agreement (EULA)

# docker run -e ACCEPT_FSL_EULA=1 -it --rm --name=clpeak-container --device /dev/galcore:/dev/galcore --entrypoint=clinfo <your-dockerhub-username>/opencl-image

This is the expected output from a Colibri iMX8X board:

Output
clinfo: /usr/lib/aarch64-linux-gnu/libOpenCL.so.1: no version information available (required by clinfo)
Number of platforms                               1
  Platform Name                                   Vivante OpenCL Platform
  Platform Vendor                                 Vivante Corporation
  Platform Version                                OpenCL 1.2 V6.4.0.p2.234062
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions function suffix             viv

  Platform Name                                   Vivante OpenCL Platform
Number of devices                                 1
  Device Name                                     Vivante OpenCL Device GC7000L.6214.0000
  Device Vendor                                   Vivante Corporation
  Device Vendor ID                                0x564956
  Device Version                                  OpenCL 1.2 
  Driver Version                                  OpenCL 1.2 V6.4.0.p2.234062
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               1
  Max clock frequency                             850MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     (n/a)
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             1024
colibri-imx8x-06787861:~$ docker run -e ACCEPT_FSL_EULA=1 -it --rm --name=clpeak-container --device /dev/galcore:/dev/galcore --entrypoint=clinfo torizon-opencl-sample-debian:latest
clinfo: /usr/lib/aarch64-linux-gnu/libOpenCL.so.1: no version information available (required by clinfo)
Number of platforms                               1
  Platform Name                                   Vivante OpenCL Platform
  Platform Vendor                                 Vivante Corporation
  Platform Version                                OpenCL 1.2 V6.4.0.p2.234062
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions function suffix             viv

  Platform Name                                   Vivante OpenCL Platform
Number of devices                                 1
  Device Name                                     Vivante OpenCL Device GC7000L.6214.0000
  Device Vendor                                   Vivante Corporation
  Device Vendor ID                                0x564956
  Device Version                                  OpenCL 1.2 
  Driver Version                                  OpenCL 1.2 V6.4.0.p2.234062
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               1
  Max clock frequency                             850MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     (n/a)
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             1024
  Preferred work group size multiple (kernel)     16
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                4 / 4       
    int                                                  4 / 4       
    long                                                 4 / 4       
    half                                                 0 / 0        (cl_khr_fp16)
    float                                                4 / 4       
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           <printDeviceInfo:86: get CL_DEVICE_HALF_FP_CONFIG : error -30>
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              268435456 (256MiB)
  Error Correction support                        Yes
  Max memory allocation                           134217728 (128MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       2048 bits (256 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        8192 (8KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 8192 images
    Max 2D image size                             8192x8192 pixels
    Max 3D image size                             8192x8192x8192 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_gl_sharing cl_khr_fp16 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics 

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [viv]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Vivante OpenCL Platform
    Device Name                                   Vivante OpenCL Device GC7000L.6214.0000
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Vivante OpenCL Platform
    Device Name                                   Vivante OpenCL Device GC7000L.6214.0000
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Vivante OpenCL Platform
    Device Name                                   Vivante OpenCL Device GC7000L.6214.0000

How to use OpenCL 1.2 in iMX8 on Torizon

Introduction​

Prerequisites​

Demo explained​

Dockerfile​

clpeak​

clinfo​

How to use​

Build and deploy​

Run clpeak​

Run clinfo​