2022-09-20

Write and Run Open CL Code Using CLSimpleWrapper

After I have setup OpenCL in my Raspberry Pi, I decided to write a simple code to wrap the complexity of OpenCL in a C++ class that is easy to use. The result is CLSimpleWrapper that you can find in the Github.

For example, the code to list and print all the Open CL platform and device is simply like the following code:

#include "CLSimpleWrapper.h"
int main(int argc, char* argv[]){
    CLSimpleWrapper cl_wrapper;
    cl_wrapper.initOpenCL(-1, -1, true);
    return 0;
}

You can find the code in the example directory of CLSimpleWrapper. At the time of writing, I only have 2 example of OpenCL code: CLListDevices and CLMatrixMultiply. The first example CLListDevices is just a simple code to list all OpenCL platform and device in the computer. The second example CLMatrixMultiply is to perform Matrix multiplication using OpenCL. I have created 4 different Matrix Multiplication example; a regular matrix multiplication and another one with Matrix Transpose optimization, each wth option to use integer or double data type. I think these 2 example provide good starting point on writing other OpenCL code. I may add more example in the future in the Github repository.

The OpenCL code written here can work in my Raspberry Pi with OpenCL, it could also run in any other computer with OpenCL. I will use the result CLMatrixMultiply in Raspberry Pi 3 and compared it with running the same code in my desktop PC. As comparison, I also include the result of matrix multiplication using C++ Multithreading, using 8 thread. The matrix size below indicate the size of the row and column of the square matrix. All the test are using square matrix.

Platform/Device 1K Integer Matrix 1K Integer Matrix (transpose) 1K Double Matrix 1K Double Matrix (transpose)
Raspberry Pi 3 VC4VL ~118.5 s ~84.1047 s N/A N/A
Rasbberry Pi 3 PoCL ~16.3 s ~2.3 s ~21.8 s ~4.8 s
CPU-CPP-Multi Rasbberry Pi 3 N/A ~2.6 s N/A ~4.9 s
OpenCL Intel Iris Pro Graphics 580, Windows 10 ~1.3 s ~1.0 s ~1.8 s ~1.2 s
OpenCL Intel i7-6770HQ CPU @ 2.60GHz, Windows 10 ~1.5 s ~0.5 s ~1.6 s ~0.5 s
CPU-CPP-Multi Intel i7-6770HQ CPU @ 2.60GHz, Win10 N/A ~1.3 s N/A ~1.3 s
Platform/Device 2K Integer Matrix 2K Integer Matrix (transpose) 2K Double Matrix 2K Double Matrix (transpose)
Raspberry Pi 3 VC4VL N/A N/A N/A N/A
Rasbberry Pi 3 PoCL ~157.8 s ~18.5 s ~234.5 s ~41.5 s
CPU-CPP-Multi Rasbberry Pi 3 N/A ~22.1 s N/A ~43.8 s
OpenCL Intel Iris Pro Graphics 580, Windows 10 ~1.9 s ~2.3 s ~2.3 s ~2.6 s
OpenCL Intel i7-6770HQ CPU @ 2.60GHz, Windows 10 ~5.2 s ~2.6 s ~5.2 s ~2.9 s
CPU-CPP-Multi Intel i7-6770HQ CPU @ 2.60GHz, Win10 N/A ~7.9 s N/A ~8.2 s
Platform/Device 5K Integer Matrix 5K Integer Matrix (transpose) 5K Double Matrix 5K Double Matrix (transpose)
Raspberry Pi 3 VC4VL N/A N/A N/A N/A
Rasbberry Pi 3 PoCL * ~337.8 s N/A N/A
CPU-CPP-Multi Rasbberry Pi 3 N/A ~364.4 s N/A ~708.2 s
OpenCL Intel Iris Pro Graphics 580, Windows 10 ~8.1 s ~15.5 s ~18.1 s ~31.0 s
OpenCL Intel i7-6770HQ CPU @ 2.60GHz, Windows 10 ~86.8 s ~34.8 s ~93.0 s ~32.5 s
CPU-CPP-Multi Intel i7-6770HQ CPU @ 2.60GHz, Win10 N/A ~121.8 s N/A ~124.9 s

NOTE:

  1. Compilation of the OpenCL program (kernel) fail for VC4CL when using Matrix with data type double. Further check with clinfo shows that VC4CL does not support double-precision floating-point.
  2. I only calculate matrix multiplication using C++ multithreading with matrix transpose optimization.
  3. Since I only allocate 64GB GPU memory in Raspberry Pi, there is a limit on the data size that can run in VC4CL. In the test above, matrix size 2000x2000 cannot run in VC4CL, even for integer matrix.
  4. Matrix multiplication of 5000x5000 takes too long without transpose optimization. I gave up waiting.
  5. For some reason, 5000x5000 matrix with double-precision floating-point cannot run with PoCL, maybe because of the limited memory in Raspberry Pi 3.
  6. Note that my Raspberry Pi is always running in "low power" mode, so this is not the best possible performance. My Raspberry Pi 3 never reach 50 degree Celcius, even with very high workload in this test in my tropical home.

You may notice that I include the matrix transpose optimization in the test above. If you are not familiar with this optimization, I encourage you to read on this topic. In short, there are no difference in term of complexity of the computation with transposed matrix. However, for computer, this can cause significant increase in computation. Why is that? Because transposed matrix will makes the memory to be calculated to be arranged sequentially in the memory. Since modern computer have pre-fetch mechanism and using caching mechanism, having the data sequentially in memory will make it more likely that the data can be accessed faster for the computation. This can be observed in most of the result above. Interestingly, result in Intel Iris Pro Graphics 580 above shows that it can be faster than without using matrix transpose optimization, probably this graphic card have some special mechanism in handling this type of data.

Some observation on the result above:

  • The VC4CL is way slower than PoCL or even C++ multithread program and not able to handle double-precision floating-point.
  • The execution of OpenCL CPU can be as fast or faster than my implementation using multithreading C++ implementation. Probably my multithreading C++ implementation is not very optimized and can be further improved or the OpenCL done a very good job in running codes in parallel.
  • As mentioned above, it is interesting to see that Intel Iris Pro Graphics 580 are able to achieve better performance without matrix transpose optimization.
  • For Raspberry Pi 3, the result of CPU OpenCL (PoCL) is not that much different than using regular C++ multithreading code.

I originally intend of exploring OpenCL as a way to utilize both the CPU and CPU of Raspberry Pi in the cluster. Looking at the result above, I will probably not continue using OpenCL for my Raspberry Pi 3 / Pi Zero 2. In addition to limited memory available in these devices (especially GPU memory), OpenCL does not seems to offer significant benefit in term of performance. I will probably better off allocating more memory to CPU (allocating more memory to GPU means less memory for CPU), and use C++ or OpenMPI for developing application for my Raspberry Pi cluster.

That being said, It is still interesting to be able to use GPU in Raspberry Pi 3/ Zero 2. For some type of application/workload that can utilize both the CPU and GPU, this probably means about 10% increase in performance.

The OpenCL result when using Intel Integrated Graphic looks very interesting. I probably will look into this further.