Performance Considerations¶
Our goal is to fully occupy the GPU. When launching a kernel, we set the number of blocks and number of threads per block. For full occupancy, we want to reach the largest number of resident blocks and threads. The number of threads ready for execution may be limited by constraints on the number of registers and shared memory.
Dynamic Partitioning of Resources¶
In Table 28 we compare the compute capabilities of a Streaming Multiprocessor (SM) for the graphics cards with respective compute capabilities 1.1, 2.0, 3.5, and 6.0: GeForce 9400M, Tesla C2050/C2070, K20C, and P100.
compute capability |
1.1 |
2.0 |
3.5 |
6.0 |
---|---|---|---|---|
maximum number of threads per block |
512 |
1,024 |
1,024 |
1,024 |
maximum number of blocks per SM |
8 |
8 |
16 |
32 |
warp size |
32 |
32 |
32 |
32 |
maximum number of warps per SM |
24 |
48 |
64 |
64 |
maximum number of threads per SM |
768 |
1,536 |
2,048 |
2,048 |
During runtime, thread slots are partitioned and assigned to thread blocks. Streaming multiprocessors are versatile by their ability to dynamically partition the thread slots among thread blocks. They can either execute many thread blocks of few threads each, or execute a few thread blocks of many threads each. In contrast, fixed partitioning where the number of blocks and threads per block are fixed will lead to waste.
We consider the interactions between resource limitations on the C2050. The Tesla C2050/C2070 has 1,536 thread slots per streaming multiprocessor. As \(1,536 = 32 \times 48\), we have
For 32 threads per block, we have 1,536/32 = 48 blocks. However, we can have at most 8 blocks per streaming multiprocessor. Therefore, to fully utilize both the block and thread slots, to have 8 blocks, we should have
\(1,536/8 = 192\) threads per block, or
\(192/32 = 6\) warps per block.
On the K20C, the interaction between resource liminations differ. The K20C has 2,048 thread slots per streaming multiprocessor. The total number of thread slots equals \(2,048 = 32 \times 64\). For 32 threads per block, we have 2,048/32 = 64 blocks. However, we can have at most 16 blocks per streaming multiprocessor. Therefore, to fully utilize both the block and thread slots, to have 16 blocks, we should have
\(2,048/16 = 128\) threads per block, or
\(128/32 = 4\) warps per block.
On the P100, there is another slight difference in the resource limitation, which leads to another outcome. In particular, we now can have at most 32 blocks per streaming multiprocessor. To have 32 blocks, we should have
\(2,048/32 = 64\) threads per block, or
\(64/32 = 2\) warps per block.
The memory resources of a streaming multiprocessor are compared in Table 29, for the graphics cards with respective compute capabilities 1.1, 2.0, 3.5, and 6.0: GeForce 9400M, Tesla C2050/C2070, K20C, and P100.
compute capability |
1.1 |
2.0 |
3.5 |
6.0 |
---|---|---|---|---|
number of 32-bit registers per SM |
8K |
32KB |
64KB |
64KB |
maximum amount of shared memory per SM |
16KB |
48KB |
48KB |
64KB |
number of shared memory banks |
16 |
32 |
32 |
32 |
amount of local memory per thread |
16KB |
512KB |
512KB |
512KB |
constant memory size |
64KB |
64KB |
64KB |
64KB |
cache working set for constant memory per SM |
8KB |
8KB |
8KB |
10KB |
Local memory resides in device memory, so local memory accesses have the same high latency and low bandwidth as global memory.
Registers hold frequently used programmer and compiler-generated variables to reduce access latency and conserve memory bandwidth. Variables in a kernel that are not arrays are automatically placed into registers.
By dynamically partitioning the registers among blocks, a streaming multiprocessor can accommodate more blocks if they require few registers, and fewer blocks if they require many registers. As with block and thread slots, there is a potential interaction between register limitations and other resource limitations.
Consider the matrix-matrix multiplication example. Assume
the kernel uses 21 registers, and
we have 16-by-16 thread blocks.
How many threads can run on each streaming multiprocessor?
We calculate the number of registers for each block: \(16 \times 16 \times 21 = 5,376\) registers.
We have \(32 \times 1,024\) registers per SM: \(32 \times 1,024/5,376 = 6\) blocks; and \(6 < 8 =\) the maximum number of blocks per SM.
We calculate the number of threads per SM: \(16 \times 16 \times 6 = 1,536\) threads; and we can have at most 1,536 threads per SM.
We now introduce the performance cliff, assuming a slight increase in one resource. Suppose we use one extra register, 22 instead of 21. To answer how many threads now can run on each SM, we follow the same calculations.
We calculate the number of registers for each block: \(16 \times 16 \times 22 = 5,632\) registers.
We have \(32 \times 1,024\) registers per SM: \(32 \times 1,024/5,632 = 5\) blocks.
We calculate the number of threads per SM: \(16 \times 16 \times 5 = 1,280\) threads; and with 21 registers we could use all 1,536 threads per SM.
Adding one register led to a reduction of 17% in the parallelism.
The CUDA compiler tool set contains a spreadsheet to compute the occupancy of the GPU, as shown in Fig. 120.

Fig. 120 The CUDA occupancy calculator.¶
The Compute Visual Profiler¶
The Compute Visual Profiler is a graphical user interface based profiling tool to measure performance and to find potential opportunities for optimization in order to achieve maximum performance.
We look at one of the example projects matrixMul
.
The analysis of the kernel matrixMul
is displayed
in Fig. 121, Fig. 122,
Fig. 123, Fig. 124, and
Fig. 125.

Fig. 121 GPU time summary of the matrixMul kernel.¶

Fig. 122 Limiting factor identification of the matrixMul kernel, IPC = Instructions Per Cycle.¶

Fig. 123 Memory throughput analysis of the matrixMul kernel.¶

Fig. 124 Instruction throughput analysis of the matrixMul kernel, IPC = Instructions Per Cycle.¶

Fig. 125 Occupancy analysis of the matrixMul kernel.¶
Data Prefetching and Instruction Mix¶
One of the most important resource limitations is access to global memory and long latencies. Scheduling other warps while waiting for memory access is powerful, but often not enough. A complementary to warp scheduling solution is to prefetch the next data elements while processing the current data elements. Combined with tiling, data prefetching provides extra independent instructions to enable the scheduling of more warps to tolerate long memory access latencies.
For the tiled matrix-matrix multiplication, the pseudo code below combines prefetching with tiling:
load first tile from global memory into registers;
loop
{
deposit tile from registers to shared memory;
__syncthreads();
load next tile from global memory into registers;
process current tile;
__syncthreads();
}
The prefetching adds independent instructions between loading the data from global memory and processing the data.
The data in Table 30 is copied from Table 2 of the CUDA C Programming Guide. The ftp in Table 30 stands for floating-point and int for integer.
compute capability |
1.x |
2.0 |
3.5 |
6.0 |
---|---|---|---|---|
32-bit fpt add, multiply, multiply-add |
8 |
32 |
192 |
64 |
64-bit fpt add, multiply, multiply-add |
1 |
16 |
64 |
4 |
32-bit int add, logical operation, shift, compare |
8 |
32 |
160 |
128 |
32-bit fpt reciprocal, sqrt, log, exp, sin, cos |
2 |
4 |
32 |
32 |
Consider the following code snippet:
for(int k = 0; k < m; k++)
C[i][j] += A[i][k]*B[k][j];
Counting all instructions:
1 loop branch instruction (
k < m
);1 loop counter update instruction (
k++
);3 address arithmetic instructions (
[i][j], [i][k], [k][j]
);2 floating-point arithmetic instructions (
+
and*
).
Of the 7 instructions, only 2 are floating point.
Loop unrolling reduces the number of loop branch instructions,
loop counter updates, address arithmetic instructions.
Note: gcc -funroll-loops
is enabled with gcc -O2
.
Thread Coarsening¶
Acceleration by GPUs applies fine grained parallelism, often at the instruction level, following the single instruction multiple data model.
One typical situation occurs with the block size limitation, when the number of threads is insufficient. As a consequence of thread coarsening, the number of threads in a block decreases, overcoming the block size limitation.
The application of thread coarsening to tiled matrix matrix multiplication is illustrated in a sequence of three pictures, in Fig. 126, in Fig. 127, and in Fig. 127.

Fig. 126 Tiled matrix matrix multiplication.¶

Fig. 127 In tiled matrix matrix multiplication, one block of threads computes one output tile.¶

Fig. 128 In tiled matrix matrix multiplication, with thread coarsening, adjacent tiles of the second matrix are loaded by the same block of threads.¶
In the matrix matrix multiplication with shared memory, one output tile is computed by one block of threads:
Each block loads one tile of \(A\) and one tile of \(B\).
Shared memory is not shared among the blocks.
Each output tile is processed by a different block. The same input tiles for \(A\) are loaded for output tiles.
With thread coarsening, one block of threads loads one tile of \(A\), and several vertically adjacent tiles of \(B\). The coarse factor equals the number of tiles of \(B\) that are multiplied in the inner loop of the new kernel.
To clarify, pseudo code is below, for the tiled matrix multiplication,
to multiply matrices A
and B
to make C
:
block of threads loads a tile of A
block of threads loads a tile of B
block of threads updates a tile of C
With thread coarsening, the code is expanded into:
block of threads loads a tile of A
for k in 1, 2, ..., coarse factor do
block of threads loads the next tile of B
block of threads updates the next tile of C
The fourth edition of Programming Massively Parallel Processors by Wen-mei Hwu, David B. Kirk, and Izzat El Hajj contains explicit C code.
Thread coarsening is similar to the topic of granularity and while it is a powerful optimization, there are pitfalls:
Do not apply when not needed. Example: vector addition.
Thread coarsening may lead to underutilization. Coarsening factors depend on the type of a device and/or the specifics of the data that is processed.
Thread coarsening may reduce the occupancy. After thread coarsening, threads may use more registers and/or too much shared memory reducing the occupancy of the device.
It is important to know the performance bottleneck of a computation.
If an optimization does not target the performance bottleneck, then the optimization attempt may even hurt performance. For example, ask the following questions: Is the computation compute or memory bound? Is the performance limited by occupancy? In answering those questions, understand the GPU architecture, and familiarize yourself with profiling tools.
At this point in the course, we have covered the fundamental topics of GPU acceleration.
Exercises¶
Consider a GPU with 2048 threads/SM, 32 blocks/SM, 64K registers/SM, and 96KB of shared memory/SM.
Kernel \(A\) uses 64 threads/block, 27 registers per thread, and 4KB of shared memory/block.
Kernel \(B\) uses 256 threads/block, 31 registers per thread, and 8KB of shared memory/block.
Determine if the kernels achieve full occupancy. If not, specify the limiting factor(s).
Read the user guide of the compute visual profiler and perform a run on GPU code you wrote (of some previous exercise or your code for the third project). Explain the analysis of the kernel.
Redo the first interactions between resource limitations of this lecture using the specifications for compute capability 1.1.