Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA

上QQ阅读APP看书，第一时间看更新

Executing threads on a device

We have seen that, while configuring kernel parameters, we can start multiple blocks and multiple threads in parallel. So, in which order do these blocks and threads start and finish their execution? It is important to know this if we want to use the output of one thread in other threads. To understand this, we have modified the kernel in the hello,CUDA! program we saw in the first chapter, by including a print statement in the kernel call, which prints the block number. The modified code is as follows:

#include <iostream>
#include <stdio.h>
__global__ void myfirstkernel(void) 
{
  //blockIdx.x gives the block number of current kernel
   printf("Hello!!!I'm thread in block: %d\n", blockIdx.x);
}
int main(void) 
{
   //A kernel call with 16 blocks and 1 thread per block
   myfirstkernel << <16,1>> >();
 
   //Function used for waiting for all kernels to finish
   cudaDeviceSynchronize();

   printf("All threads are finished!\n");
   return 0;
}

As can be seen from the code, we are launching a kernel with 16 blocks in parallel with each block having a single thread. In the kernel code, we are printing the block ID of the kernel execution. We can think that 16 copies of the same myfirstkernel start execution in parallel. Each of these copies will have a unique block ID, which can be accessed by the blockIdx.x CUDA directive, and a unique thread ID, which can be accessed by threadIdx.x. These IDs will tell us which block and thread are executing the kernel. When you run the program many times, you will find that, each time, blocks execute in a different order. One sample output can be shown as follows:

One question you should ask is how many different output patterns will the previous program produce? The correct answer is 16! It will produce n factorial number of outputs, where n indicates the number of blocks started in parallel. So, whenever you are writing the program in CUDA, you should be careful that the blocks execute in random order.

This program also contains one more CUDA directive: cudaDeviceSynchronize(). Why is it used? It is used because a kernel launch is an asynchronous process, which means it returns control to the CPU thread immediately after starting up the GPU process before the kernel has finished executing. In the previous code, the next line in CPU thread is print and application exit will terminate console before the kernel has finished execution. So, if we do not include this directive, you will not see any print statements of the kernel execution. The output that is generated later by the kernel has nowhere to go, and you won't see it. To see the outputs generated by the kernel, we will include this directive, which ensures that the kernel finishes before the application is allowed to exit, and the output from the kernel will find a waiting standard output queue.

本周热推：

Python 3网络爬虫实战 Python自然语言处理（微课版）代码揭秘 R语言数据处理及可视化分析 Effective Python：编写高质量Python代码的90个有效方法（原书第2版）