From Cuda to Cutlass: A Gentle Introduction to CuTe

Introduction

In this article I'll be detailing how to use CuTe, it's benefits, and I will show examples of how it is used on a GEMM kernel.

Along with this article I recommend reading these:

After completing this you will be more than ready to use, CuTe in your kernels!

Why use CuTe?
- CuTe Layout
- Static Vs Dynamic Integers

Why use CuTe?

CuTe, was introduced in Cutlass 3.0, and was created by Cris Cecka. It's a C++ template library that allows you to easily operate on memory using CuTe Layouts and Tensors.

Accessing memory is very critical for any program, but for GPU programming, it's often the major bottleneck. To mitigate this, a range of access patterns are used. You have global memory coalescing, shared memory transfers while avoiding bank conflicts, register caching, swizzling, etc. Because of this and the independent nature of each thread you often have insanely complicated indexing math for memory accesses.

For example check out the indexing math needed for this complex kernel taken from Lei Mao's blog

// 2D Block Tiling and 2D Warp Tiling and
// 2D Thread Tiling and Vectorized Memory Access
size_t const C_row_idx{
    blockIdx.y * BLOCK_TILE_SIZE_Y +
    warp_row_idx * WARP_TILE_SIZE_Y +
    thread_tile_repeat_row_idx *
    (WARP_TILE_SIZE_Y / NUM_THREAD_TILES_PER_WARP_Y) +
    thread_linear_row_idx_in_warp * THREAD_TILE_SIZE_Y +
    thread_tile_y_idx};

size_t const C_col_idx{
    blockIdx.x * BLOCK_TILE_SIZE_X +
    warp_col_idx * WARP_TILE_SIZE_X +
    thread_tile_repeat_col_idx *
    (WARP_TILE_SIZE_X / NUM_THREAD_TILES_PER_WARP_X) +
    thread_linear_col_idx_in_warp * THREAD_TILE_SIZE_X +
    thread_tile_x_vector_idx * NUM_VECTOR_UNITS};

CuTe tries to solve this by introducing 2 abstractions the Layout and the Tensor. The Layout consists of two values a shape, and a stride. When working with CUDA kernels we generally represent tensors using contiguous data like so contiguous array

Index 0 gives you 0.0, index 1 gives you 0.1 and so on and so on. However, even though the data is laid out like this we internally represent it as a matrix like so (assume row major) contiguous matrix

You tell the kernel the dimensions of the matrix, when you pass in parameters such M (matrix A rows), N (matrix B cols) and K(matrix A cols, and Matrix B rows). For this matrix the M and K are (2, 3) this is also known as the shape of the matrix.

Now lets say our matrix and array looked like this padded matrix

This is an example of a padded matrix, a padded matrix has 0's at the end of each row (in this case 0.0f), in reality this matrix still has a shape of (2, 3) but to access all elements requires indexing over 8 elements not 6. So along with M, N, K in many GEMM kernels we also pass the leading dimension. The leading dimension tells the GEMM algorithm how much to stride by when matching the corresponding column in the next row. For this example the leading dimension would be 4.

CuTe Layout

CuTe applies these same concepts, but instead they pack all this information into a Layout. Here's some ways you can declare a Layout:

// vector layout with dynamic numbers
make_layout(
    make_shape(8),
    make_stride(1)
)

// matrix layout with static numbers
// defaults to LayoutLeft stride
// for this example that is (_1{}, _8{})
make_layout(
    make_shape(_8{}, Int<8>{}),
)

// high dimensional matrix layout
// Shape ((_8, _8), _8) : Stride ((_16, _16), _1)
make_layout(
    make_shape(
        make_shape(_8{}, _8{}),
        _8{}
    ),
    make_stride(
        make_stride(_16{}, _16{}),
        _1{}
    )
)

// high dimensional dynamic tensor layout
// Shape ((8, 8), 8) : Stride ((16, 16), 1)
make_layout(
    make_shape(
        make_shape(8, 8), 8
    ),
    make_stride(
        make_stride(16, 16), 1
    )
)

// high dimensional static Tensor layout
// Shape (_10, _8, _3) : Stride (_24, _3, _1)
constexpr Layout<Shape<_10, _8, _3>, Stride<_24, _3, _1>> my_layout{}

Static Vs Dynamic Integers

First we need to discuss the difference between static and dynamic numbers. Static numbers are known at compile time and can be used in CuTe using the Int<> template, for common numbers a shorthand exists such as _1, all other integers declared any other way are considered dynamic. In general you want your layouts to be static. This will allow the compiler to perform aggressive loop unrolling, and allow it to precompute any index math. Both static and dynamic integers can be used together.