Concurrent Kernels and Multiple GPUs ==================================== Page Locked Host Memory ----------------------- In contrast to regular pageable host memory, the runtime provides functions to allocate (and free) *page locked* memory. Another name for memory that is page locked is *pinned*. Using page locked memory has several benefits: * Copies between page locked memory and device memory can be performed concurrently with kernel execution. * Page locked host memory can be mapped into the address space of the device, eliminating the need to copy, we say *zero copy*. * Bandwidth between page locked host memory and device may be higher. Page locked host memory is a scarce resource. The NVIDIA CUDA Best Practices Guide assigns a low priority to zero-copy operations (i.e.: mapping host memory to the device). To allocate page locked memory, we use ``cudaHostAlloc()`` and to free the memory, we call ``cudaFreeHost()``. To map host memory on the device: * The flag ``cudaHostAllocMapped`` must be given to ``cudaHostAlloc()`` when allocating host memory. * A call to ``cudaHostGetDevicePointer()`` maps the host memory to the device. If all goes well, then no copies from host to device memory and from device to host memory are needed. Not all devices support pinned memory, it is recommended practice to check the device properties (see the ``deviceQuery`` in the SDK). Next we illustrate how a programmer may use pinned memory with a simple program. A run of this program is below. :: $ /tmp/pinnedmemoryuse Tesla K20c supports mapping host memory. The error code of cudeHostAlloc : 0 Squaring 32 numbers 1 2 3 4 5 6 7 8 9 10 11 12 13 \ 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32... The fail code of cudaHostGetDevicePointer : 0 After squaring 32 numbers 1 4 9 16 25 36 49 64 81 \ 100 121 144 169 196 225 256 289 324 361 400 441 484 \ 529 576 625 676 729 784 841 900 961 1024... $ The execution of the program is defined by the code in ``pinnedmemoryuse.cu``, listed below. First we check whether the devide supports pinned memory. :: #include int checkDeviceProp ( cudaDeviceProp p ); /* * Returns 0 if the device does not support mapping * host memory, returns 1 otherwise. */ int checkDeviceProp ( cudaDeviceProp p ) { int support = p.canMapHostMemory; if(support == 0) printf("%s does not support mapping host memory.\n", p.name); else printf("%s supports mapping host memory.\n",p.name); return support; } To illustrate pinned memory, we use the following simple kernel. :: __global__ void Square ( float *x ) /* * A kernel where the i-th thread squares x[i] * and stores the result in x[i]. */ { int i = blockIdx.x*blockDim.x + threadIdx.x; x[i] = x[i]*x[i]; } This kernel is launched in the main program below. :: void square_with_pinned_memory ( int n ); /* * Illustrates the use of pinned memory to square * a sequence of n numbers. */ int main ( int argc, char* argv[] ) { cudaDeviceProp dev; cudaGetDeviceProperties(&dev,0); int success = checkDeviceProp(dev); if(success != 0) square_with_pinned_memory(32); return 0; } The function which allocates the pinned memory is below. :: void square_with_pinned_memory ( int n ) { float *xhost; size_t sz = n*sizeof(float); int error = cudaHostAlloc((void**)&xhost, sz,cudaHostAllocMapped); printf("\nThe error code of cudeHostAlloc : %d\n", error); for(int i=0; i>>(xdevice); cudaDeviceSynchronize(); printf("\nAfter squaring %d numbers",n); for(int i=0; i>> (&xdevice[i*chunk],&ydevice[i*chunk]); for(int i=0; i void printDeviceProp ( cudaDeviceProp p ) /* * prints some device properties */ { printf(" name : %s \n",p.name); printf(" number of multiprocessors : %d \n", p.multiProcessorCount); } int main ( int argc, char* argv[] ) { int deviceCount; cudaGetDeviceCount(&deviceCount); printf("number of devices : %d\n",deviceCount); for(int d = 0; d < deviceCount; d++) { cudaDeviceProp dev; cudaGetDeviceProperties(&dev,d); printf("graphics card %d :\n",d); printDeviceProp(dev); } return 0; } Chapter 8 of the NVIDIA CUDA Best Practices Guide describes multi-GPU programming. To work with *p* GPUs concurrently, the CPU can use * *p* lightweight threads (Pthreads, OpenMP, etc); or * *p* heavyweight threads (or processes) with MPI. The command to select a GPU is ``cudaSetDevice()``. All inter-GPU communication happens through the host. See the ``simpleMultiGPU`` of the GPU Computing SDK.