Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA

上QQ阅读APP看书，第一时间看更新

Global memory

All blocks have read and write access to global memory. This memory is slow but can be accessed from anywhere in your device code. The concept of caching is used to speed up access to a global memory. All memories allocated using cudaMalloc will be a global memory. The following simple example demonstrates how you can use a global memory from your program:

#include <stdio.h>
#define N 5

__global__ void gpu_global_memory(int *d_a)
{
  d_a[threadIdx.x] = threadIdx.x;
}

int main(int argc, char **argv)
{
  int h_a[N]; 
  int *d_a; 
  
  cudaMalloc((void **)&d_a, sizeof(int) *N);
  cudaMemcpy((void *)d_a, (void *)h_a, sizeof(int) *N, cudaMemcpyHostToDevice);
  
  gpu_global_memory << <1, N >> >(d_a); 
  cudaMemcpy((void *)h_a, (void *)d_a, sizeof(int) *N, cudaMemcpyDeviceToHost);
  
  printf("Array in Global Memory is: \n");
  for (int i = 0; i < N; i++) 
  {
    printf("At Index: %d --> %d \n", i, h_a[i]);
  }
  return 0;
}

This code demonstrates how you can write in global memory from your device code. The memory is allocated using cudaMalloc from the host code and a pointer to this array is passed as a parameter to the kernel function. The kernel function populates this memory chunk with values of the thread ID. This is copied back to host memory for printing. The result is shown as follows:

As we are using global memory, this operation will be slow. There are advanced concepts to speed up this operation which will be explained later on. In the next section, we will explain local memory and registers that are unique to all threads.