
Two-variable addition program in CUDA C
In the simple Hello, CUDA! code seen in Chapter 1, Introducing Cuda and Getting Started with Cuda, the device function was empty. It had nothing to do. This section explains a simple addition program that performs addition of two variables on the device. Though it is not exploiting any data parallelism of the device, it is very useful for demonstrating important programming concepts of CUDA C. First, we will see how to write a kernel function for adding two variables.
The code for the kernel function is shown here:
include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
//Definition of kernel function to add two variables
__global__ void gpuAdd(int d_a, int d_b, int *d_c)
{
*d_c = d_a + d_b;
}
The gpuAdd function looks very similar to a normal add function implemented in ANSI C. It takes two integer variables d_a and d_b as inputs and stores the addition at the memory location indicated by the third integer pointer d_c. The return value for the device function is void because it is storing the answer in the memory location pointed to by the device pointer and not explicitly returning any value. Now we will see how to write the main function for this code. The code for the main function is shown here:
int main(void)
{
//Defining host variable to store answer
int h_c;
//Defining device pointer
int *d_c;
//Allocating memory for device pointer
cudaMalloc((void**)&d_c, sizeof(int));
//Kernel call by passing 1 and 4 as inputs and storing answer in d_c
//<< <1,1> >> means 1 block is executed with 1 thread per block
gpuAdd << <1, 1 >> > (1, 4, d_c);
//Copy result from device memory to host memory
cudaMemcpy(&h_c, d_c, sizeof(int), cudaMemcpyDeviceToHost);
printf("1 + 4 = %d\n", h_c);
//Free up memory
cudaFree(d_c);
return 0;
}
In the main function, the first two lines define variables for host and device. The third line allocates memory of the d_c variable on the device using the cudaMalloc function. The cudaMalloc function is similar to the malloc function in C. In the fourth line of the main function, gpuAdd is called with 1 and 4 as two input variables and d_c, which is a device memory pointer as an output pointer variable. The weird syntax of the gpuAdd function, which is also called a kernel call, is explained in the next section. If the answer of gpuAdd needs to be used on the host, then it must be copied from the device's memory to the host's memory, which is done by the cudaMemcpy function. Then, this answer is printed using the printf function. The penultimate line frees the memory used on the device by using the cudafree function. It is very important to free up all the memory used on the device explicitly from the program; otherwise, you might run out of memory at some point. The lines that start with // are comments for more code readability, and these lines are ignored by compilers.
The two-variable addition program has two functions, main and gpuAdd. As you can see, gpuAdd is defined by using the __global__ keyword, and hence it is meant for execution on the device, while the main function will be executed on the host. The program adds two variables on the device and prints the output on the command line, as shown here:

We will use a convention in this book that host variables will be prefixed with h_ and device variables will be prefixed with d_. This is not compulsory; it is just done so that readers can understand the concepts easily without any confusion between host and device.
All CUDA APIs such as cudaMalloc, cudaMemcpy, and cudaFree, along with other important CUDA programming concepts such as kernel call, passing parameters to kernels, and memory allocation issues are discussed in upcoming sections.