
Elementwise squaring of vectors in CUDA
Now, one question you can ask is, now that we are launching N blocks in parallel with one thread in each block, can we work in a reverse way? The answer is yes. We can launch only one block with N threads in parallel. To show that and make you more familiar with working around vectors in CUDA, we take the second example of the element-wise squaring of numbers in an array. We take one array of numbers and return an array that contains the square of these numbers. The kernel function to find the element-wise square is shown here:
#include "stdio.h"
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>
//Defining number of elements in Array
#define N 5
//Kernel function for squaring number
__global__ void gpuSquare(float *d_in, float *d_out)
{
//Getting thread index for current kernel
int tid = threadIdx.x; // handle the data at this index
float temp = d_in[tid];
d_out[tid] = temp*temp;
}
The gpuSquare kernel function has pointers to two arrays as arguments. The first pointer d_in points to the memory location where the input array is stored, while the second pointer d_out points to the memory location where output will be stored. In this program, instead of launching multiple blocks in parallel, we want to launch multiple threads in parallel, so tid is initialized with a particular thread ID using threadIdx.x. The main function for this program is as follows:
int main(void)
{
//Defining Arrays for host
float h_in[N], h_out[N];
float *d_in, *d_out;
// allocate the memory on the cpu
cudaMalloc((void**)&d_in, N * sizeof(float));
cudaMalloc((void**)&d_out, N * sizeof(float));
//Initializing Array
for (int i = 0; i < N; i++)
{
h_in[i] = i;
}
//Copy Array from host to device
cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
//Calling square kernel with one block and N threads per block
gpuSquare << <1, N >> >(d_in, d_out);
//Coping result back to host from device memory
cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
//Printing result on console
printf("Square of Number on GPU \n");
for (int i = 0; i < N; i++)
{
printf("The square of %f is %f\n", h_in[i], h_out[i]);
}
//Free up memory
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
This main function follows a similar structure to the vector addition program. One difference that you will see here from the vector addition program is that we are launching a single block with N threads in parallel. The output of the program is as follows:

Whenever you are using this way of launching N threads in parallel, you should take care that the maximum threads per block are limited to 512 or 1,024. So, the value of N should be less than this value. If N is 2,000 and the maximum number of threads per block for your device is 512, then you can't write << <1,2000 > >>. Instead, you should use something such as << <4,500> >>. The choice of a number of blocks and the number of threads per block should be made judiciously.
To summarize, we have learned how to work with vectors and how we can launch multiple blocks and multiple threads in parallel. We have also seen that by doing vector operations on GPU, it improves throughput, compared to the same operation on the CPU. In the last section of this chapter, we discuss the various parallel communication patterns that are followed by threads executing in parallel.