
Executing threads on a device
We have seen that, while configuring kernel parameters, we can start multiple blocks and multiple threads in parallel. So, in which order do these blocks and threads start and finish their execution? It is important to know this if we want to use the output of one thread in other threads. To understand this, we have modified the kernel in the hello,CUDA! program we saw in the first chapter, by including a print statement in the kernel call, which prints the block number. The modified code is as follows:
#include <iostream>
#include <stdio.h>
__global__ void myfirstkernel(void)
{
//blockIdx.x gives the block number of current kernel
printf("Hello!!!I'm thread in block: %d\n", blockIdx.x);
}
int main(void)
{
//A kernel call with 16 blocks and 1 thread per block
myfirstkernel << <16,1>> >();
//Function used for waiting for all kernels to finish
cudaDeviceSynchronize();
printf("All threads are finished!\n");
return 0;
}
As can be seen from the code, we are launching a kernel with 16 blocks in parallel with each block having a single thread. In the kernel code, we are printing the block ID of the kernel execution. We can think that 16 copies of the same myfirstkernel start execution in parallel. Each of these copies will have a unique block ID, which can be accessed by the blockIdx.x CUDA directive, and a unique thread ID, which can be accessed by threadIdx.x. These IDs will tell us which block and thread are executing the kernel. When you run the program many times, you will find that, each time, blocks execute in a different order. One sample output can be shown as follows:

One question you should ask is how many different output patterns will the previous program produce? The correct answer is 16! It will produce n factorial number of outputs, where n indicates the number of blocks started in parallel. So, whenever you are writing the program in CUDA, you should be careful that the blocks execute in random order.
This program also contains one more CUDA directive: cudaDeviceSynchronize(). Why is it used? It is used because a kernel launch is an asynchronous process, which means it returns control to the CPU thread immediately after starting up the GPU process before the kernel has finished executing. In the previous code, the next line in CPU thread is print and application exit will terminate console before the kernel has finished execution. So, if we do not include this directive, you will not see any print statements of the kernel execution. The output that is generated later by the kernel has nowhere to go, and you won't see it. To see the outputs generated by the kernel, we will include this directive, which ensures that the kernel finishes before the application is allowed to exit, and the output from the kernel will find a waiting standard output queue.