Hands-On GPU:Accelerated Computer Vision with OpenCV and CUDA
上QQ阅读APP看书,第一时间看更新

Matrix multiplication

The second most important mathematical operation performed on a GPU using CUDA is matrix multiplication. It is a very complicated mathematical operation when the sizes of the matrices are very large. It should be kept in mind that for matrix multiplication, the number of columns in the first matrix should be equal to the number of rows in the second matrix. Matrix multiplication is not a cumulative operation. To avoid complexity, in this example, we are taking a square matrix of the same size. If you are familiar with the mathematics of matrix multiplication, then you may recall that a row in the first matrix will be multiplied with all the columns in the second matrix. This is repeated for all rows in the first matrix. It is shown as follows:

Same data is reused many times, so this is an ideal case of using shared memory. In this section, we will make two separate kernel functions, with and without using shared memory. You can compare the execution of two kernels to get an idea of how shared memory improves the performance of the program. We will first start by writing a kernel function without using shared memory:


#include <stdio.h>
#include<iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <math.h>

//This defines size of a small square box or thread dimensions in one block
#define TILE_SIZE 2

//Matrix multiplication using non shared kernel
__global__ void gpu_Matrix_Mul_nonshared(float *d_a, float *d_b, float *d_c, const int size)
{
int row, col;
col = TILE_SIZE * blockIdx.x + threadIdx.x;
row = TILE_SIZE * blockIdx.y + threadIdx.y;

for (int k = 0; k< size; k++)
{
d_c[row*size + col] += d_a[row * size + k] * d_b[k * size + col];
}
}

Matrix multiplication is performed using two-dimensional threads. If we launch two-dimensional threads with each thread performing a single element of the output matrix, then up to 16 x 16 matrices can be multiplied. If the size is greater than this, then it will need more than 512 threads for computation, which is not possible on most GPUs. So, we need to launch multiple blocks with each containing less than 512 threads. To accomplish this, the output matrix is divided into small square blocks having dimensions of TILE_SIZE in both directions. Each thread in a block will calculate elements of this square block. The total number of blocks for matrix multiplication will be calculated by dividing the size of the matrix by the size of this small square defined by TILE_SIZE.

If you understand this, then calculating the row and column index for the output will be very easy. It is similar to what we have done up till now, with blockdim.x  being equal to TILE_SIZE. Now, every element in the output will be the dot product of one row in the first matrix and one column in the second matrix. Both the matrices have the same size so the dot product has to be performed for a number of elements equal to the size variable. So the for loop in the kernel function is running from 0 to size.

To calculate the individual index of both matrices, consider that this matrix is stored as a linear array in system memory in row-major fashion. Its meaning is that all elements in the first row are placed in a consecutive memory location and then rows are placed one after the other, as follows:

The index of a linear array can be calculated by its row ID multiplied by the size of the matrix plus its column ID. So, the index for M1,0 will be 2 as its row ID is 1, the matrix size is 2 and the column ID is zero.  This method is used to calculate the element index in both the matrices.

To calculate the element at [row, col] in the resultant matrix, the index in the first matrix will be equal to row*size + k , and for the second matrix, it will be k*size + col. This is a very simple kernel function. There is a large amount of data reuse in matrix multiplication. This function is not utilizing the advantage of shared memory. So, we will try to modify the kernel function that makes use of shared memory. The modified kernel function is shown as follows:

// shared
__global__ void gpu_Matrix_Mul_shared(float *d_a, float *d_b, float *d_c, const int size)
{
int row, col;

__shared__ float shared_a[TILE_SIZE][TILE_SIZE];

__shared__ float shared_b[TILE_SIZE][TILE_SIZE];

// calculate thread id
col = TILE_SIZE * blockIdx.x + threadIdx.x;
row = TILE_SIZE * blockIdx.y + threadIdx.y;

for (int i = 0; i< size / TILE_SIZE; i++)
{
shared_a[threadIdx.y][threadIdx.x] = d_a[row* size + (i*TILE_SIZE + threadIdx.x)];
shared_b[threadIdx.y][threadIdx.x] = d_b[(i*TILE_SIZE + threadIdx.y) * size + col];
}
__syncthreads();


for (int j = 0; j<TILE_SIZE; j++)
d_c[row*size + col] += shared_a[threadIdx.x][j] * shared_b[j][threadIdx.y];
__syncthreads(); // for synchronizing the threads

}
}

A two-shared memory with size equal to the size of a small square block, which is TILE_SIZE, is defined for storing data for reuse. Row and column indexes are calculated in the same way as seen earlier. First, this shared memory is filled up in the first for loop. After that, __syncthreads() is included so that memory read from shared memory only happens when all threads have finished writing to it. The last for loop again calculates the dot product. As this is done by only using shared memory, this considerably reduces memory traffic to a global memory, which in turn improves the performance of the program for larger matrix dimensions. The main function of this program is shown as follows:

int main()
{
//Define size of the matrix
const int size = 4;
//Define host and device arrays
float h_a[size][size], h_b[size][size],h_result[size][size];
float *d_a, *d_b, *d_result; // device array
//input in host array
for (int i = 0; i<size; i++)
{
for (int j = 0; j<size; j++)
{
h_a[i][j] = i;
h_b[i][j] = j;
}
}


cudaMalloc((void **)&d_a, size*size*sizeof(int));
cudaMalloc((void **)&d_b, size*size * sizeof(int));
cudaMalloc((void **)&d_result, size*size* sizeof(int));
//copy host array to device array
cudaMemcpy(d_a, h_a, size*size* sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size*size* sizeof(int), cudaMemcpyHostToDevice);
//calling kernel
dim3 dimGrid(size / TILE_SIZE, size / TILE_SIZE, 1);
dim3 dimBlock(TILE_SIZE, TILE_SIZE, 1);

gpu_Matrix_Mul_nonshared << <dimGrid, dimBlock >> > (d_a, d_b, d_result, size);
//gpu_Matrix_Mul_shared << <dimGrid, dimBlock >> > (d_a, d_b, d_result, size);

cudaMemcpy(h_result, d_result, size*size * sizeof(int), cudaMemcpyDeviceToHost);

return 0;
}

After defining and allocating memory for host and device arrays, the host array is filled with some random values. These arrays are copied to device memory so that it can be passed to the kernel functions. The number of grid blocks and the number of block threads is defined using the dim3 structure, with the dimensions equal to that calculated earlier. You can call any of the kernels. The calculated answer is copied back to the host memory. To display the output on the console, the following code is added to the main function:

printf("The result of Matrix multiplication is: \n");

for (int i = 0; i< size; i++)
{
for (int j = 0; j < size; j++)
{
printf("%f ", h_result[i][j]);
}
printf("\n");
}
cudaFree(d_a)
cudaFree(d_b)
cudaFree(d_result)

The memory used to store matrices on device memory is also freed up. The output on the console is as follows: 

This section demonstrated CUDA programs for two important mathematical operations used in a wide range of applications. It also explained the use of shared memory and multidimensional threads.