Local Padding

codegen.test_1_local_padding.test_local_padding()[source]

This test case shows the performance improvement brought by local padding. We compare the compute throughputs between two schedules, one without local padding (baseline) and the other with local padding.

We use the dense layer as an example workload:

\[Y = XW^T, X : [B\times T, I], W : [H, I], Y : [B\times T, H]\]

In tvm.te form:

for (i = 0; i < B * T; ++i)
  for (j = 0: j < H; ++j)
    for (k = 0; k < I; ++k)
      Y[i, j] += X[i, k] * W[j, k];

where \((B, T, I, H)\) stand for (batch size, sequence length, input size, hidden dimension) respectively (namings adopted from the [BERT] model).

We adopt a sample schedule dense_128x128x4 for this workload. The schedule, as its name suggests, has a tile size of \((128, 128, 4)\) along the \((i, j, k)\) dimension respectively. Furthermore, we deliberately set the value of the sequence length \(T\) to \(60\), and the input size \(I\) to \(770\), so that

\[(B\times T = 16\times 60 = 960) \% 128 \ne 0, (I = 770) \% 4 \ne 0\]

which is common in the case of handling dynamic-shape workloads.

Due to the imperfect tiling, out-of-boundary checks (a.k.a., predicates) have to be injected inside the loop body, which can greatly hurt the performance. However, we make the following key observations:

  • The schedule generated by the auto-scheduler usually consists of 3 main stages, namely

    • Fetch: Obtain the input data from the global to the shared memory.

      X_shared[...] = X[...]; W_shared[...] = W[...];
      
    • Compute: Compute output results using the shared memory variables and write to the registers.

      Y_local[...] = X_shared[...] * W_shared[...];
      
    • Writeback: Write the results of the registers back to the global memory.

      Y[...] = Y_local[...];
      
  • The predicates at the Compute stage have the biggest impact on the runtime performance, but the good news is they duplicate those at the Fetch and the Writeback stage and hence can be safely removed. This is in essence padding the compute by the size of the local workspace, hence the name Local Padding.

Our evaluations show that local padding can significantly boost the performance of the generated CUDA kernel by as much as \(10\times\) in this case (on modern NVIDIA RTX GPUs). The table below illustrates the results we get from the CI workflow:

GPU

Baseline

Local Padding

RTX 3090

~0.98

~11.6

RTX 2080 Ti

~0.43

~5.27

where the numbers denote the compute throughputs (in TFLOPs/sec), and hence the higher the better.

codegen.test_1_local_padding.test_local_padding_ii()[source]

This test case shows the necessity of NOT shrinking the local workspace, which is required for local padding.

In TVM, the size of the local workspace (e.g., shared memory or registers) will be automatically shrinked if the code generator detects that the workspace is not fully utilized by the tensor programs. However, this optimization can be NOT desirable.

For example, consider the batched matrix multiply workload:

\[Y = \operatorname{BatchMatmulNT}(X, W), X : \left[B\times NH, T, \frac{H}{A}\right], W : \left[B\times NH, T, \frac{H}{NH}\right], Y : [B\times NH, T, T]\]
for (i = 0; i < B * NH; ++i)
  for (j = 0: j < T; ++j)
    for (k = 0; k < T; ++k)
      for (l = 0; l < H / NH; ++l)
        Y[i, j, k] += X[i, j, l] * W[i, k, l]

where \(NH\) stands for the number of attention heads (also adopted from the [BERT] model, other parameters are the same as above).

We adopt a sample schedule batch_matmul_nt_1x128x128x8 for this workload. The schedule has a tile size of \((1, 128, 128, 8)\) along the \((i, j, k, l)\) dimension respectively. By the tile sizes, the local workspace (i.e., shared memory allocations) for \(X\) and \(W\) are \(1\times 128\times 8\) and \(1\times 128\times 8\) respectively.

The allocations work fine in the case when \(T=128\), but if \(T\) is just a little smaller (e.g., \(T=120\)), the TVM code generator will find out that all the thread blocks are not fully utilizing the shared memory variables and shrink them, which prevents local padding from taking place (as it requests access to the whole workspace).

Our evaluations show that preserving the local workspace can boost the performance of the generated CUDA kernel by as much as \(2.5\times\) in this case. The table below illustrates the results we get from the CI workflow:

GPU

Baseline

Local Padding

RTX 3090

~3.5

~8.7

RTX 2080 Ti

~1.8

~5.5

where the numbers again denote the compute throughputs (in TFLOPs/sec), and hence the higher the better.

BERT(1,2)

J. Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-NLT 2019