Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA

上QQ阅读APP看书，第一时间看更新

Elementwise squaring of vectors in CUDA

Now, one question you can ask is, now that we are launching N blocks in parallel with one thread in each block, can we work in a reverse way? The answer is yes. We can launch only one block with N threads in parallel. To show that and make you more familiar with working around vectors in CUDA, we take the second example of the element-wise squaring of numbers in an array. We take one array of numbers and return an array that contains the square of these numbers. The kernel function to find the element-wise square is shown here:

#include "stdio.h"
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>
 //Defining number of elements in Array
#define N 5
//Kernel function for squaring number
__global__ void gpuSquare(float *d_in, float *d_out) 
{
     //Getting thread index for current kernel
     int tid = threadIdx.x; // handle the data at this index
     float temp = d_in[tid];
     d_out[tid] = temp*temp;
 }

The gpuSquare kernel function has pointers to two arrays as arguments. The first pointer d_in points to the memory location where the input array is stored, while the second pointer d_out points to the memory location where output will be stored. In this program, instead of launching multiple blocks in parallel, we want to launch multiple threads in parallel, so tid is initialized with a particular thread ID using threadIdx.x. The main function for this program is as follows:

int main(void) 
{
 //Defining Arrays for host
     float h_in[N], h_out[N];
     float *d_in, *d_out;
// allocate the memory on the cpu
     cudaMalloc((void**)&d_in, N * sizeof(float));
     cudaMalloc((void**)&d_out, N * sizeof(float));
 //Initializing Array
     for (int i = 0; i < N; i++) 
    {
         h_in[i] = i;
     }
 //Copy Array from host to device
     cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
 //Calling square kernel with one block and N threads per block
     gpuSquare << <1, N >> >(d_in, d_out);
 //Coping result back to host from device memory
     cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
 //Printing result on console
     printf("Square of Number on GPU \n");
     for (int i = 0; i < N; i++) 
     {
         printf("The square of %f is %f\n", h_in[i], h_out[i]);
     }
 //Free up memory
     cudaFree(d_in);
     cudaFree(d_out);
     return 0;
 }

This main function follows a similar structure to the vector addition program. One difference that you will see here from the vector addition program is that we are launching a single block with N threads in parallel. The output of the program is as follows:

Whenever you are using this way of launching N threads in parallel, you should take care that the maximum threads per block are limited to 512 or 1,024. So, the value of N should be less than this value. If N is 2,000 and the maximum number of threads per block for your device is 512, then you can't write << <1,2000 > >>. Instead, you should use something such as << <4,500> >>. The choice of a number of blocks and the number of threads per block should be made judiciously.

To summarize, we have learned how to work with vectors and how we can launch multiple blocks and multiple threads in parallel. We have also seen that by doing vector operations on GPU, it improves throughput, compared to the same operation on the CPU. In the last section of this chapter, we discuss the various parallel communication patterns that are followed by threads executing in parallel.

本周热推：

Python 3网络爬虫实战 Python自然语言处理（微课版）代码揭秘 R语言数据处理及可视化分析 Effective Python：编写高质量Python代码的90个有效方法（原书第2版）