Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA

上QQ阅读APP看书，第一时间看更新

Dot product

The dot product between two vectors is an important mathematical operation. It will also explain one important concept in CUDA programming that is called reduction operation. The dot product between two vectors can be defined as follows:

(x1,x1,x3) . (y1,y2,y3) = x1y1 + x2y2 +x3y3

Now, if you see this operation, it is very similar to an element-wise addition operation on vectors. Instead of addition, you have to perform element-wise multiplication. All threads also have to keep running the sum of multiplication they have performed because all individual multiplications need to be summed up to get a final answer of the dot product. The answer of the dot product will be a single number. This operation where the final answer is the reduced version of the original two arrays is called a reduce operation in CUDA. It is useful in many applications. To perform this operation in CUDA, we will start by writing a kernel function for it, as follows:

#include <stdio.h>
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#define N 1024
#define threadsPerBlock 512


__global__ void gpu_dot(float *d_a, float *d_b, float *d_c) 
{
  //Define Shared Memory
  __shared__ float partial_sum[threadsPerBlock];
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  int index = threadIdx.x;

  float sum = 0;
  while (tid < N) 
  {
    sum += d_a[tid] * d_b[tid];
    tid += blockDim.x * gridDim.x;
  }

  // set the partial sum in shared memory
  partial_sum[index] = sum;

  // synchronize threads in this block
  __syncthreads();

  //Calculate Patial sum for a current block using data in shared memory
  int i = blockDim.x / 2;
  while (i != 0) {
    if (index < i)
      {partial_sum[index] += partial_sum[index + i];}
    __syncthreads();
    i /= 2;
  }
  //Store result of partial sum for a block in global memory
  if (index == 0)
    d_c[blockIdx.x] = partial_sum[0];
}

The kernel function takes two input arrays as input and stores the final partial sum in the third array. Shared memory is defined to store intermediate answers of the partial answer. The size of the shared memory is equal to the number of threads per block, as all separate blocks will have the separate copy of this shared memory. After that, two indexes are calculated; the first one, which calculates the unique thread ID, is similar to what we have done in the vector addition example. The second index is used to store the partial product answer on shared memory. Again, every block has a separate copy of shared memory, so only the thread ID used to index the shared memory is of a given block.

The while loop will perform element-wise multiplication of elements indexed by the thread ID. It will also do multiplication of elements that is offset by the total threads to the current thread ID. The partial sum of this element is stored in the shared memory. We are going to use these results from the shared memory to calculate the partial sum for a single block. So, before reading this shared memory block, we must ensure that all threads have finished writing to this shared memory. This is ensured by using the __syncthreads() directive.

Now, one method to get an answer for the dot product is that one thread iterates over all these partial sums to get a final answer. One thread can perform the reduce operation. This will take N operations to complete, where N is the number of partial sums to be added (equal to the number of threads per block) to get a final answer.

The question is, can we do this reduce operation in parallel? The answer is yes. The idea is that every thread will add two elements of the partial sum and store the answer in the location of the first element. Since each thread combines two entries in one, the operation can be completed in half entries. Now, we will repeat this operation for the remaining half until we get the final answer that calculates the partial sum for this entire block. The complexity of this operation is log₂(N) , which is far better than the complexity of N when one thread performs the reduce operation.

The operation explained is calculated by the block starting with while (i != 0) . The block sums the partial answer of the current thread and the thread offset by blockdim/2. It continues this addition until we get a final single answer, which is a sum of all partial products in a given block. The final answer is stored in the global memory. Each block will have a separate answer to be stored in the global memory so that it is indexed by the block ID, which is unique for each block. Still, we have not got the final answer. This can be performed in the device function or the main function.

Normally, the last few additions in the reduce operation need very little resources. Much of the GPU resource remains idle, and that is not the optimal use of the GPU. So, the final addition operation of all partial sums for an individual block is done in the main function. The main function is as follows:

int main(void) 
{
  float *h_a, *h_b, h_c, *partial_sum;
  float *d_a, *d_b, *d_partial_sum;

  //Calculate number of blocks and number of threads
  int block_calc = (N + threadsPerBlock - 1) / threadsPerBlock;
  int blocksPerGrid = (32 < block_calc ? 32 : block_calc);
  // allocate memory on the cpu side
  h_a = (float*)malloc(N * sizeof(float));
  h_b = (float*)malloc(N * sizeof(float));
  partial_sum = (float*)malloc(blocksPerGrid * sizeof(float));

  // allocate the memory on the gpu
  cudaMalloc((void**)&d_a, N * sizeof(float));
  cudaMalloc((void**)&d_b, N * sizeof(float));
  cudaMalloc((void**)&d_partial_sum, blocksPerGrid * sizeof(float));

  // fill in the host mempory with data
  for (int i = 0; i<N; i++) {
    h_a[i] = i;
    h_b[i] = 2;
  }


  // copy the arrays to the device
  cudaMemcpy(d_a, h_a, N * sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, h_b, N * sizeof(float), cudaMemcpyHostToDevice);

  gpu_dot << <blocksPerGrid, threadsPerBlock >> >(d_a, d_b, d_partial_sum);

  // copy the array back to the host
  cudaMemcpy(partial_sum, d_partial_sum, blocksPerGrid * sizeof(float), cudaMemcpyDeviceToHost);

  // Calculate final dot prodcut
  h_c = 0;
  for (int i = 0; i<blocksPerGrid; i++) 
 {
    h_c += partial_sum[i];
  }
  
}

Three arrays are defined and memory is allocated for both host and device to store inputs and output. The two host arrays are initialized inside a for loop. One array is initialized with 0 to N and the second is initialized with a constant value 2. The calculation of the number of blocks in a grid and number of threads in a block is also done. It is similar to what we did at the start of this chapter. Bear in mind, you can also keep these value as constants, like we did in the first program of this chapter, to avoid complexity.

These arrays are copied to device memory and passed as parameters to the kernel function. The kernel function will return an array, which has answers of the partial products of individual blocks indexed by their block ID. This array is copied back to the host in the partial_sum array. The final answer of the dot product is calculated by iterating over this partial_sum array, using the for loop starting from zero to the number of blocks per grid. The final dot product is stored in h_c. To check whether the calculated dot product is correct or not, the following code can be added to the main function:

printf("The computed dot product is: %f\n", h_c);
#define cpu_sum(x) (x*(x+1))
  if (h_c == cpu_sum((float)(N - 1)))
  {
    printf("The dot product computed by GPU is correct\n");
  }
  else
  {
    printf("Error in dot product computation");
  }
  // free memory on the gpu side
  cudaFree(d_a);
  cudaFree(d_b);
  cudaFree(d_partial_sum);
  // free memory on the cpu side
  free(h_a);
  free(h_b);
  free(partial_sum);

The answer is verified with the answer calculated mathematically. In two input arrays, if one array has values from 0 to N-1 and the second array has a constant value of 2, then the dot product will be N*(N+1). We print the answer of the dot product calculated mathematically, along with whether it has been calculated correctly or not. The host and device memory is freed up in the end. The output of the program is as follows: