Numba¶
Numba provides the module numba.cuda
for creating JIT (just-in-time) compiled
functions which are parsed into CUDA. Much like the way standard Numba provides
the opportunity to write Python code with performance comparable to that of C,
numba.cuda
allows you to write Python code which is parsed and compiled into
CUDA kernels.
Generally, using Numba to write efficient code can require fairly advanced CUDA programming knowledge - this means paying attention to shared memory, array slicing, registers, and so forth. It is typically difficult to write code that is competetive with standard libraries for e.g. matrix multiplication, as we can see in this example of the same matrix-addition reduction which borrows some lightly optimized code from the Numba documentation:
Bad example: Matrix multiplication and reduction in Numba¶
import numba.cuda as cuda
from numba import float32
import numpy as np
from timeit import timeit
m, k, n = (10000, 5000, 7000)
r = np.random.default_rng(987654321)
A = r.random(size=(m, k), dtype=np.float32)
B = r.random(size=(k, n), dtype=np.float32)
out = np.zeros((n), dtype=np.float32)
TPB = 16
threadsperblock = (TPB, TPB)
blockspergrid = (int(np.ceil(A.shape[0] / (TPB))),
int(np.ceil(B.shape[1] / (TPB))))
@cuda.jit
def matmul_reduce(A, B, out):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sC = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
if x <= A.shape[0] and ty + i * TPB <= A.shape[1]:
sA[tx, ty] = A[x, ty + i * TPB]
else:
sA[tx, ty] = 0
if tx + i * TPB <= B.shape[0] and y<= B.shape[1]:
sB[tx, ty] = B[tx + i * TPB, y]
else:
sB[tx, ty] = 0
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
sC[tx, ty] = tmp
cuda.syncthreads()
s = 1
while s < TPB:
if tx % (2 * s) == 0:
# Stride by `s` and add
sC[tx,ty] += sC[tx + s, ty]
s *= 2
cuda.syncthreads()
# After the loop, the zeroth element contains the sum
if (tx == 0 and y < out.size):
cuda.atomic.add(out, y, sC[tx, ty])
# Move arrays to device (GPU)
A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
out_gpu = cuda.to_device(out)
ntimes = 3
def init_numba():
matmul_reduce[blockspergrid, threadsperblock](
A_gpu, B_gpu, cuda.device_array_like(out))
def test_numba():
for i in range(ntimes):
matmul_reduce[blockspergrid, threadsperblock](A_gpu, B_gpu, out_gpu)
# Copy result back to host
out[...] = out_gpu.copy_to_host()
time_gpu = timeit(test_numba, setup=init_numba, number=1)
print(f'Time: {time_gpu/ntimes:.2f}')
On an A40 GPU:
This is not competetive with even NumPy, let alone CuPy. In fact, writing code to be competetive with the cuBLAS and cuTensor libraries requires many, many layers of optimization, as described in this excellent illustrated blog post. For this reason, it is always best to check if there are standard libraries in CUDA that perform similar functions before attempting to write a kernel.
Numba can interact with CuPy; a cupy.ndarray
can be used directly by a Numba
kernel with no need for wrapping. Thus, it is best to use cupy
for linear
algebra and tensor operations, and Numba for more general CUDA programming, as
Numba does not have any bindings to CUDA linear algebra or tensor libraries.
Good example: 2D diffusion¶
A good example of a relatively simple CUDA kernel that performs well is a diffusion kernel. First, we generate a jagged image:
import numpy as np
from numpy import fft
j, k = (2000, 2000)
r = np.random.default_rng(3966789)
field = r.random(size=(j, k), dtype=np.float32)
field = fft.fftshift(fft.fft2(field))
radius = 1 + np.sqrt((np.arange(j).reshape(-1, 1) - j // 2) ** 2 +
(np.arange(k).reshape(1, -1) - k // 2) ** 2)
field = field * (radius ** -1.7)
field = fft.ifft2(fft.ifftshift(field)) * (radius < 0.45 * j)
field = field.real * (field.real > np.percentile(field.real, 70))
Then we write a kernel which loads each block, plus extra rows and columns at the edges, into shared memory, and calculates the two-dimensional discrete laplacian. It outputs the laplacian (which we suppose we might want to know at any given time), and moves the diffusion forward by one step, determined by a coefficient.
from numba import float32
import numba.cuda as cuda
TPB = 8
threadsperblock = (TPB, TPB)
blockspergrid = (int(np.ceil(field.shape[0] / (TPB))),
int(np.ceil(field.shape[1] / (TPB))))
# Compile-time constants.
shasize = TPB + 2
coeff = 0.25
@cuda.jit
def diffusion(field, lapl):
x, y = cuda.grid(2)
# Shared array has two extra rows and columns to store edge values
sharedarray = cuda.shared.array(shape=(shasize, shasize), dtype=float32)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
# Don't try to load edge values from outside the array
if tx == 0 and x > 0:
sharedarray[0, ty + 1] = field[x - 1, y]
if ty == 0 and y > 0:
sharedarray[tx + 1, 0] = field[x, y - 1]
if tx == TPB - 1 and x < field.shape[0] - 1:
sharedarray[-1, ty + 1] = field[x + 1, y]
if ty == TPB - 1 and y < field.shape[1] - 1:
sharedarray[tx + 1, -1] = field[x, y + 1]
# Load all inner values into tx + 1, ty + 1.
sharedarray[tx + 1, ty + 1] = field[x, y]
val = 4 * sharedarray[tx + 1, ty + 1]
# Synchronize after loading
cuda.syncthreads()
# Check that we are not at the very edges of the array.
if x > 0:
val -= sharedarray[tx, ty + 1]
if y > 0:
val -= sharedarray[tx + 1, ty]
if x < field.shape[0] - 1:
val -= sharedarray[tx + 2, ty + 1]
if y < field.shape[1] - 1:
val -= sharedarray[tx + 1, ty + 2]
# Check that we are within boundaries when saving.
if x < field.shape[0] and y < field.shape[1]:
lapl[x, y] = val
field[x, y] -= val * coeff
Suppose we want to iterate this 10000 times. How long will it take on an A40 GPU?
orig_field = field.copy()
field_gpu = cuda.to_device(field)
lapl = np.zeros_like(field)
lapl_gpu = cuda.to_device(lapl)
ntimes = 10000
def init_numba():
diffusion[blockspergrid, threadsperblock](field_gpu, lapl_gpu)
def test_numba():
for i in range(ntimes):
diffusion[blockspergrid, threadsperblock](field_gpu, lapl_gpu)
lapl[...] = lapl_gpu.copy_to_host()
field[...] = field_gpu.copy_to_host()
time_gpu = timeit(test_numba, setup=init_numba, number=1)
print(f'Time per iteration: {time_gpu / ntimes}')
import matplotlib.pyplot as plt
f, ax = plt.subplots(1, 3, figsize=(12, 3.5))
im1 = ax[0].imshow(orig_field)
ax[0].set_title('Original image')
im2 = ax[1].imshow(field)
ax[1].set_title('10000 iterations')
im3 = ax[2].imshow(lapl, cmap='inferno',
vmin=np.percentile(lapl, 5),
vmax=np.percentile(lapl, 95))
ax[2].set_title('Laplacian')
In the end, 10000 diffusion iterations took us only about 2.7 seconds! If we suppose that we do not care about knowing the laplacian after every invocation, the time goes down to 2.4 seconds. This type of problem, that can be broken down into repeated action on a localized part of an array, is a typical case where a GPU implementation can be relatively easy to draft.