Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA

上QQ阅读APP看书，第一时间看更新

Thread-related properties

As seen in earlier sections, blocks and threads can be multidimensional. So, it would be nice to know how many threads and blocks can be launched in parallel in each dimension. There is also a limit on the number of threads per multiprocessor and the number of threads per block. This number can be found by using the maxThreadsPerMultiProcessor and the maxThreadsPerBlock. It is very important in the configuration of kernel parameters. If you launch more threads per block than the maximum threads possible per block, your program can crash. The maximum threads per block in each dimension can be identified by the maxThreadsDim. In the same way, the maximum blocks per grid in each dimension can be identified by using the maxGridSize. Both of them return an array with three values, which shows the maximum value in the x, y, and z dimensions respectively. The following code snippet shows how to use thread-related properties from the CUDA code:

printf(" Maximum number of threads per multiprocessor: %d\n",              device_Property.maxThreadsPerMultiProcessor);
printf(" Maximum number of threads per block: %d\n",         device_Property.maxThreadsPerBlock);
printf(" Max dimension size of a thread block (x,y,z): (%d, %d, %d)\n",
    device_Property.maxThreadsDim[0],
    device_Property.maxThreadsDim[1],
    device_Property.maxThreadsDim[2]);
printf(" Max dimension size of a grid size (x,y,z): (%d, %d, %d)\n",
    device_Property.maxGridSize[0],
    device_Property.maxGridSize[1],
    device_Property.maxGridSize[2]);

There are many other properties available in the cudaDeviceProp structure. You can check the CUDA programming guide for details of other properties. The output from all preceding code sections combined and executed on the NVIDIA Geforce 940MX GPU and CUDA 9.0 is as follows:

One question you might ask is why you should be interested in knowing the device properties. The answer is that this will help you in choosing a GPU device with more multiprocessors, if multiple GPU devices are present. If in your application the kernel needs close interaction with the CPU, then you might want your kernel to run on an integrated GPU that shares system memory with the CPU. These properties will also help you in finding the number of blocks and number of threads per block available on your device. This will help you with the configuration of kernel parameters. To show you one use of device properties, suppose you have an application that requires double precision for floating-point operation. Not all GPU devices support this operation. To know whether your device supports double precision floating-point operation and set that device for your application, the following code can be used:

#include <memory>
#include <iostream>
#include <cuda_runtime.h>
// Main Program
int main(void)
{
int device;
cudaDeviceProp device_property;
cudaGetDevice(&device);
printf("ID of device: %d\n", device);
memset(&device_property, 0, sizeof(cudaDeviceProp));
device_property.major = 1;
device_property.minor = 3;
cudaChooseDevice(&device, &device_property);
printf("ID of device which supports double precision is: %d\n", device);
cudaSetDevice(device);
}

This code uses two properties available in the cudaDeviceprop structure that help in identifying whether the device supports double precision operations. These two properties are major and minor. CUDA documentation says us that if major is greater than 1 and minor is greater than 3, then that device will support double precision operations. So, the program's device_property structure is filled with these two values. CUDA also provides the cudaChooseDevice API that helps in choosing a device with particular properties. This API is used on the current device to identify whether it contains these two properties. If it contains properties, then that device is selected for your application using the cudaSetDevice API. If more than one device is present in the system, this code should be written inside a for a loop to iterate over all devices.

Though trivial, this section is very important for you in finding out which applications can be supported by your GPU device and which cannot.

本周热推：

Selenium自动化测试实战：基于Python 3ds Max 2018从入门到精通 Rust实战 “笨办法”学Python 3 More Effective C++：35个改善编程与设计的有效方法（中文版）