2022-09-20
After I have setup OpenCL in my Raspberry Pi, I decided to write a simple code to wrap the complexity of OpenCL in a C++ class that is easy to use. The result is CLSimpleWrapper that you can find in the Github.
For example, the code to list and print all the Open CL platform and device is simply like the following code:
#include "CLSimpleWrapper.h"
int main(int argc, char* argv[]){
CLSimpleWrapper cl_wrapper;
cl_wrapper.initOpenCL(-1, -1, true);
return 0;
}
You can find the code in the example directory of CLSimpleWrapper. At the time of writing, I only have 2 example of OpenCL code: CLListDevices
and CLMatrixMultiply
. The first example CLListDevices
is just a simple code to list all OpenCL platform and device in the computer.
The second example CLMatrixMultiply
is to perform Matrix multiplication using OpenCL.
I have created 4 different Matrix Multiplication example; a regular matrix multiplication and another one with Matrix Transpose optimization, each wth option to use integer or double data type.
I think these 2 example provide good starting point on writing other OpenCL code.
I may add more example in the future in the Github repository.
The OpenCL code written here can work in my Raspberry Pi with OpenCL, it could also run in any other computer with OpenCL.
I will use the result CLMatrixMultiply
in Raspberry Pi 3 and compared it with running the same code in my desktop PC.
As comparison, I also include the result of matrix multiplication using C++ Multithreading, using 8 thread.
The matrix size below indicate the size of the row and column of the square matrix. All the test are using square matrix.
Platform/Device | 1K Integer Matrix | 1K Integer Matrix (transpose) | 1K Double Matrix | 1K Double Matrix (transpose) |
---|---|---|---|---|
Raspberry Pi 3 VC4VL | ~118.5 s | ~84.1047 s | N/A | N/A |
Rasbberry Pi 3 PoCL | ~16.3 s | ~2.3 s | ~21.8 s | ~4.8 s |
CPU-CPP-Multi Rasbberry Pi 3 | N/A | ~2.6 s | N/A | ~4.9 s |
OpenCL Intel Iris Pro Graphics 580, Windows 10 | ~1.3 s | ~1.0 s | ~1.8 s | ~1.2 s |
OpenCL Intel i7-6770HQ CPU @ 2.60GHz, Windows 10 | ~1.5 s | ~0.5 s | ~1.6 s | ~0.5 s |
CPU-CPP-Multi Intel i7-6770HQ CPU @ 2.60GHz, Win10 | N/A | ~1.3 s | N/A | ~1.3 s |
Platform/Device | 2K Integer Matrix | 2K Integer Matrix (transpose) | 2K Double Matrix | 2K Double Matrix (transpose) |
---|---|---|---|---|
Raspberry Pi 3 VC4VL | N/A | N/A | N/A | N/A |
Rasbberry Pi 3 PoCL | ~157.8 s | ~18.5 s | ~234.5 s | ~41.5 s |
CPU-CPP-Multi Rasbberry Pi 3 | N/A | ~22.1 s | N/A | ~43.8 s |
OpenCL Intel Iris Pro Graphics 580, Windows 10 | ~1.9 s | ~2.3 s | ~2.3 s | ~2.6 s |
OpenCL Intel i7-6770HQ CPU @ 2.60GHz, Windows 10 | ~5.2 s | ~2.6 s | ~5.2 s | ~2.9 s |
CPU-CPP-Multi Intel i7-6770HQ CPU @ 2.60GHz, Win10 | N/A | ~7.9 s | N/A | ~8.2 s |
Platform/Device | 5K Integer Matrix | 5K Integer Matrix (transpose) | 5K Double Matrix | 5K Double Matrix (transpose) |
---|---|---|---|---|
Raspberry Pi 3 VC4VL | N/A | N/A | N/A | N/A |
Rasbberry Pi 3 PoCL | * | ~337.8 s | N/A | N/A |
CPU-CPP-Multi Rasbberry Pi 3 | N/A | ~364.4 s | N/A | ~708.2 s |
OpenCL Intel Iris Pro Graphics 580, Windows 10 | ~8.1 s | ~15.5 s | ~18.1 s | ~31.0 s |
OpenCL Intel i7-6770HQ CPU @ 2.60GHz, Windows 10 | ~86.8 s | ~34.8 s | ~93.0 s | ~32.5 s |
CPU-CPP-Multi Intel i7-6770HQ CPU @ 2.60GHz, Win10 | N/A | ~121.8 s | N/A | ~124.9 s |
NOTE:
double
.
Further check with clinfo
shows that VC4CL does not support double-precision floating-point.You may notice that I include the matrix transpose optimization in the test above. If you are not familiar with this optimization, I encourage you to read on this topic. In short, there are no difference in term of complexity of the computation with transposed matrix. However, for computer, this can cause significant increase in computation. Why is that? Because transposed matrix will makes the memory to be calculated to be arranged sequentially in the memory. Since modern computer have pre-fetch mechanism and using caching mechanism, having the data sequentially in memory will make it more likely that the data can be accessed faster for the computation. This can be observed in most of the result above. Interestingly, result in Intel Iris Pro Graphics 580 above shows that it can be faster than without using matrix transpose optimization, probably this graphic card have some special mechanism in handling this type of data.
Some observation on the result above:
I originally intend of exploring OpenCL as a way to utilize both the CPU and CPU of Raspberry Pi in the cluster. Looking at the result above, I will probably not continue using OpenCL for my Raspberry Pi 3 / Pi Zero 2. In addition to limited memory available in these devices (especially GPU memory), OpenCL does not seems to offer significant benefit in term of performance. I will probably better off allocating more memory to CPU (allocating more memory to GPU means less memory for CPU), and use C++ or OpenMPI for developing application for my Raspberry Pi cluster.
That being said, It is still interesting to be able to use GPU in Raspberry Pi 3/ Zero 2. For some type of application/workload that can utilize both the CPU and GPU, this probably means about 10% increase in performance.
The OpenCL result when using Intel Integrated Graphic looks very interesting. I probably will look into this further.
Under Construction.