Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA

上QQ阅读APP看书，第一时间看更新

Constant memory

The CUDA language makes another type of memory available to the programmer, which is known as constant memory. NVIDIA hardware provides 64 KB of this constant memory, which is used to store data that remains constant throughout the execution of the kernel. This constant memory is cached on-chip so that the use of constant memory instead of global memory can speed up execution. The use of constant memory will also reduce memory bandwidth to the device's global memory. In this section, we will see how to use constant memory in CUDA programs. A simple program that performs a simple math operation, a*x + b, where a and b are constants, is taken as an example. The kernel function code for this program is shown as follows:

#include "stdio.h"
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>

//Defining two constants
__constant__ int constant_f;
__constant__ int constant_g;
#define N 5

//Kernel function for using constant memory 
__global__ void gpu_constant_memory(float *d_in, float *d_out) 
{
  //Getting thread index for current kernel
  int tid = threadIdx.x; 
  d_out[tid] = constant_f*d_in[tid] + constant_g;
}

Constant memory variables are defined using the __constant__ keyword. In the preceding code, two float variables, constant_f and constant_g, are defined as constants that will not change throughout the kernel's execution. The second thing to note is that once variables are defined as constants, they should not be defined again in the kernel function. The kernel function computes a simple mathematical operation using these two constants. There is a special way in which constant variables are copied to memory from the main function. This is shown in the following code:

int main(void) 
{
  //Defining Arrays for host
  float h_in[N], h_out[N];
  //Defining Pointers for device
  float *d_in, *d_out;
  int h_f = 2;
  int h_g = 20;
  
  // allocate the memory on the cpu
  cudaMalloc((void**)&d_in, N * sizeof(float));
  cudaMalloc((void**)&d_out, N * sizeof(float));
  
  //Initializing Array
  for (int i = 0; i < N; i++) 
  {
    h_in[i] = i;
  }
  
  //Copy Array from host to device
  cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
  //Copy constants to constant memory
  cudaMemcpyToSymbol(constant_f, &h_f, sizeof(int),0,cudaMemcpyHostToDevice);
  cudaMemcpyToSymbol(constant_g, &h_g, sizeof(int));

  //Calling kernel with one block and N threads per block
  gpu_constant_memory << <1, N >> >(d_in, d_out);
  
  //Coping result back to host from device memory
  cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
  
  //Printing result on console
  printf("Use of Constant memory on GPU \n");
  for (int i = 0; i < N; i++) 
  {
    printf("The expression for index %f is %f\n", h_in[i], h_out[i]);
  }
  
  cudaFree(d_in);
  cudaFree(d_out);
  return 0;
}

In the main function, the h_f and h_g constants are defined and initialized on the host, which will be copied to constant memory. The cudaMemcpyToSymbol instruction is used to copy these constants onto constant memory for kernel execution. It has five arguments. First is the destination, which is defined using the __constant__ keyword. Second is the host address, third is the size of the transfer, fourth is memory offset, which is taken as zero, and fifth is the direction of data transfer, which is taken as the host to the device. The last two arguments are optional, and hence they are omitted in the second call to the cudaMemcpyToSymbol instruction.

The output of the code is shown as follows:

One thing to note is that constant memory is a Read-only memory. This example is used just to explain the use of the constant memory from the CUDA program. It is not the optimal use of constant memory. As discussed earlier, constant memory helps in conserving memory bandwidth to global memory. To understand this, you have to understand the concept of warp. One warp is a collection of 32 threads woven together and executed in lockstep. A single read from constant memory can be broadcast to half warp, which can reduce up to 15 memory transactions. Also, constant memory is cached so that memory access to a nearby location will not incur an additional memory transaction. When each half warp, which contains 16 threads, operates on the same memory locations, the use of constant memory saves a lot of execution time. It should also be noted that if half-warp threads use completely different memory locations, then the use of constant memory may increase the execution time. So, the constant memory should be used with proper care.