Loop Partitioning

codegen.test_2_loop_partitioning.test_loop_partitioning()[source]

This test case shows the performance improvement brought by loop partitioning, used as an optimization technique in the [Nimble] paper. Similar to local padding, the goal of loop partitioning is also to mitigate the performance overhead brought by out-of-boundary checks.

Given below is an example that illustrates how loop partitioning works:

for (i = 0; i < ceil(T / t); ++i)
  for (j = 0: j < t; ++j)
    if (i * t + j < T) // do something

The above code snippet can be transformed into

for (i = 0; i < floor(T / t); ++i)
  for (j = 0: j < t; ++j)
    // do something
for (j = 0; j < T - floor(T / t) * t; ++j)
  // do something

which does not have any predicates.

We compare the compute throughputs between four schedules:

  1. without any optimization (baseline)

  2. loop partitioning

  3. local padding

  4. local padding + loop partitioning

The benchmark we use is the same as the one in local padding (codegen.test_1_local_padding.test_local_padding()). Our evaluations show that

  1. Loop partitioning can also significantly boost the performance of the generated CUDA kernel by as much as \(10\times\) (the same order of speedup as local padding).

  2. Although local padding and loop partitioning are orthogonal to each other, applying them jointly is NOT able to boost the performance in this case.

The table below illustrates the results we get from the CI workflow:

GPU

Baseline

Loop Partitioning

Local Padding

Local Padding +
Loop Partitioning

RTX 3090

~0.98

~11.2

~11.6

~11.0

RTX 2080 Ti

~0.43

~5.98

~5.27

where the numbers denote the compute throughputs (in TFLOPs/sec), and hence the higher the better.

codegen.test_2_loop_partitioning.test_loop_partitioning_ii()[source]

Therefore, one natural question to ask is:

What is the difference between local padding and loop partitioning, given that they share the same objective?

Although the two techniques are indeed similar in what they are going to achieve, we pick local padding primarily due to the following reasons:

  • Due to the partition nature, loop partitioning needs to duplicate the body statements (as can be seen in the simple example illustrated before, where the comment // do something appears twice). Depending on the number of spatial axes, this duplication can happen multiple times. This can significantly elongate the CUDA kernel body, and further lead to a gigantic kernel in the case when the original kernel body is already long (e.g., because of a large unrolling factor), which eventually increases the compilation time for the kernel (can be several minutes for a single kernel by our measurements).

  • There are cases that can be handled by local padding but NOT by loop partitioning. We refer to the example below, which is again the same as the one in local padding (codegen.test_1_local_padding.test_local_padding_ii()). In this example, all the thread blocks possess the same predicates, which prevents the loop partitioning from taking place (since there is no region that can be partitioned out without predicates).

Nevertheless, there are still advantages of loop partitioning that local padding does not have: Since loop partitioning does not involve padding, it does not have to perform any redundant computations, which explains why its performance can be better than local padding on RTX 2080 Ti (as can be seen in the previous table). However, this is NOT a problem for the DietCode auto-scheduling framework since we expect that the auto-scheduling outcomes should have relatively low padding ratio.

Nimble

H. Shen et al. Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference. MLSys 2021