Numba¶

Numba provides the module numba.cuda for creating JIT (just-in-time) compiled functions which are parsed into CUDA. Much like the way standard Numba provides the opportunity to write Python code with performance comparable to that of C, numba.cuda allows you to write Python code which is parsed and compiled into CUDA kernels.

Generally, using Numba to write efficient code can require fairly advanced CUDA programming knowledge - this means paying attention to shared memory, array slicing, registers, and so forth. It is typically difficult to write code that is competetive with standard libraries for e.g. matrix multiplication, as we can see in this example of the same matrix-addition reduction which borrows some lightly optimized code from the Numba documentation:

Bad example: Matrix multiplication and reduction in Numba¶

import numba.cuda as cuda
from numba import float32
import numpy as np
from timeit import timeit


m, k, n = (10000, 5000, 7000)
r = np.random.default_rng(987654321)
A = r.random(size=(m, k), dtype=np.float32)
B = r.random(size=(k, n), dtype=np.float32)
out = np.zeros((n), dtype=np.float32)

TPB = 16
threadsperblock =  (TPB, TPB)
blockspergrid = (int(np.ceil(A.shape[0] / (TPB))),
                 int(np.ceil(B.shape[1] / (TPB))))


@cuda.jit
def matmul_reduce(A, B, out):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sC = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid
    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory

        if x <= A.shape[0] and ty + i * TPB <= A.shape[1]:
            sA[tx, ty] = A[x, ty + i * TPB]
        else:
            sA[tx, ty] = 0
        if tx + i * TPB <= B.shape[0] and y<= B.shape[1]:
            sB[tx, ty] = B[tx + i * TPB, y]
        else:
            sB[tx, ty] = 0
        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    sC[tx, ty] = tmp
    cuda.syncthreads()
    s = 1
    while s < TPB:
        if tx % (2 * s) == 0:
            # Stride by `s` and add
            sC[tx,ty] += sC[tx + s, ty]
        s *= 2
        cuda.syncthreads()

    # After the loop, the zeroth  element contains the sum
    if (tx == 0 and y < out.size):
        cuda.atomic.add(out, y, sC[tx, ty])

# Move arrays to device (GPU)
A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
out_gpu = cuda.to_device(out)
ntimes = 3
def init_numba():
    matmul_reduce[blockspergrid, threadsperblock](
        A_gpu, B_gpu, cuda.device_array_like(out))
def test_numba():
    for i in range(ntimes):
        matmul_reduce[blockspergrid, threadsperblock](A_gpu, B_gpu, out_gpu)
    # Copy result back to host
    out[...] = out_gpu.copy_to_host()


time_gpu = timeit(test_numba, setup=init_numba, number=1)

print(f'Time: {time_gpu/ntimes:.2f}')

On an A40 GPU:

>>> Time: 7.26

This is not competetive with even NumPy, let alone CuPy. In fact, writing code to be competetive with the cuBLAS and cuTensor libraries requires many, many layers of optimization, as described in this excellent illustrated blog post. For this reason, it is always best to check if there are standard libraries in CUDA that perform similar functions before attempting to write a kernel.

Numba can interact with CuPy; a cupy.ndarray can be used directly by a Numba kernel with no need for wrapping. Thus, it is best to use cupy for linear algebra and tensor operations, and Numba for more general CUDA programming, as Numba does not have any bindings to CUDA linear algebra or tensor libraries.

Good example: 2D diffusion¶

A good example of a relatively simple CUDA kernel that performs well is a diffusion kernel. First, we generate a jagged image:

import numpy as np
from numpy import fft

j, k = (2000, 2000)
r = np.random.default_rng(3966789)
field = r.random(size=(j, k), dtype=np.float32)
field = fft.fftshift(fft.fft2(field))
radius = 1 + np.sqrt((np.arange(j).reshape(-1, 1) - j // 2) ** 2 +
                     (np.arange(k).reshape(1, -1) - k // 2) ** 2)
field = field * (radius ** -1.7)
field = fft.ifft2(fft.ifftshift(field)) * (radius < 0.45 * j)
field = field.real * (field.real > np.percentile(field.real, 70))

Jagged image, suitable for illustrating diffusion

Then we write a kernel which loads each block, plus extra rows and columns at the edges, into shared memory, and calculates the two-dimensional discrete laplacian. It outputs the laplacian (which we suppose we might want to know at any given time), and moves the diffusion forward by one step, determined by a coefficient.

from numba import float32
import numba.cuda as cuda
TPB = 8
threadsperblock =  (TPB, TPB)
blockspergrid = (int(np.ceil(field.shape[0] / (TPB))),
                 int(np.ceil(field.shape[1] / (TPB))))
# Compile-time constants.
shasize = TPB + 2
coeff = 0.25
@cuda.jit
def diffusion(field, lapl):
    x, y = cuda.grid(2)
    # Shared array has two extra rows and columns to store edge values
    sharedarray = cuda.shared.array(shape=(shasize, shasize), dtype=float32)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    # Don't try to load edge values from outside the array
    if tx == 0 and x > 0:
        sharedarray[0, ty + 1] = field[x - 1, y]
    if ty == 0 and y > 0:
        sharedarray[tx + 1, 0] = field[x, y - 1]
    if tx == TPB - 1 and x < field.shape[0] - 1:
        sharedarray[-1, ty + 1] = field[x + 1, y]
    if ty == TPB - 1 and y < field.shape[1] - 1:
        sharedarray[tx + 1, -1] = field[x, y + 1]
    # Load all inner values into tx + 1, ty + 1.
    sharedarray[tx + 1, ty + 1] = field[x, y]
    val = 4 * sharedarray[tx + 1, ty + 1]
    # Synchronize after loading
    cuda.syncthreads()
    # Check that we are not at the very edges of the array.
    if x > 0:
        val -= sharedarray[tx, ty + 1]
    if y > 0:
        val -= sharedarray[tx + 1, ty]
    if x < field.shape[0] - 1:
        val -= sharedarray[tx + 2, ty + 1]
    if y < field.shape[1] - 1:
        val -= sharedarray[tx + 1, ty + 2]
    # Check that we are within boundaries when saving.
    if x < field.shape[0] and y < field.shape[1]:
        lapl[x, y] = val
        field[x, y] -= val * coeff

Suppose we want to iterate this 10000 times. How long will it take on an A40 GPU?

orig_field = field.copy()
field_gpu = cuda.to_device(field)
lapl = np.zeros_like(field)
lapl_gpu = cuda.to_device(lapl)
ntimes = 10000
def init_numba():
    diffusion[blockspergrid, threadsperblock](field_gpu, lapl_gpu)
def test_numba():
    for i in range(ntimes):
        diffusion[blockspergrid, threadsperblock](field_gpu, lapl_gpu)
    lapl[...] = lapl_gpu.copy_to_host()
    field[...] = field_gpu.copy_to_host()
time_gpu = timeit(test_numba, setup=init_numba, number=1)
print(f'Time per iteration: {time_gpu / ntimes}')

import matplotlib.pyplot as plt
f, ax = plt.subplots(1, 3, figsize=(12, 3.5))
im1 = ax[0].imshow(orig_field)
ax[0].set_title('Original image')
im2 = ax[1].imshow(field)
ax[1].set_title('10000 iterations')
im3 = ax[2].imshow(lapl, cmap='inferno',
                   vmin=np.percentile(lapl, 5),
                   vmax=np.percentile(lapl, 95))
ax[2].set_title('Laplacian')

>>> Time per iteration: 0.000272

The result of the calculation

In the end, 10000 diffusion iterations took us only about 2.7 seconds! If we suppose that we do not care about knowing the laplacian after every invocation, the time goes down to 2.4 seconds. This type of problem, that can be broken down into repeated action on a localized part of an array, is a typical case where a GPU implementation can be relatively easy to draft.