CuPy¶

CuPy is a Python package which provides CUDA wrappers for many functions in Numpy and SciPy. It allows for very easy porting of code written in Numpy and SciPy, and allows access to cuTensor, cuBLAS and cuSparse, often by simply replacing import numpy as np with import cupy as np.

The following examples were run on Vera nodes with 8 CPU:s in a Jupyter notebook, with the modules CuPy, matplotlib and IPython loaded. We will show tests with A40 and T4 GPU:s. Note that if your calculations particularly depend on using float64, you may be best off using V100 or A100 GPU:s.

Example: Matrix multiplication and reduction in NumPy and CuPy¶

import numpy as np
import cupy as cp
from timeit import timeit

dtype = np.float32
m, k, n = (10000, 5000, 7000)
# These matrices are quite large, needing about 800 MB RAM
r = np.random.default_rng(987654321)
matrix_1 = r.random(size=(m, k), dtype=dtype)
matrix_2 = r.random(size=(k, n), dtype=dtype)
result_matrix_cpu = np.zeros((n), dtype=dtype)
temp_cpu = np.zeros((m, n), dtype=dtype)

# `cp.array` will copy the matrices to the GPU.
matrix_1_gpu = cp.array(matrix_1, dtype=dtype)
matrix_2_gpu = cp.array(matrix_2, dtype=dtype)
temp_gpu = cp.zeros((m, n), dtype=dtype)
result_matrix_gpu = np.zeros_like(result_matrix_cpu, dtype=dtype)

# Small functions for testing
def cpu_test_1():
    for i in range(10):
        np.matmul(matrix_1, matrix_2, out=temp_cpu)
        np.sum(temp_cpu, axis=0, out=result_matrix_cpu)

def cpu_test_2():
    for i in range(10):
        np.einsum('ij,jk->...k', matrix_1, matrix_2,
                  out=result_matrix_cpu, optimize='optimal')

# To avoid over-syncing, we put the .get() outside the loop.
def gpu_test_1():
    for i in range(10):
        cp.matmul(matrix_1_gpu, matrix_2_gpu, out=temp_gpu)
    result_matrix_gpu[...] = cp.sum(temp_gpu, axis=0).get()

def gpu_test_2():
    for i in range(10):
        temp = cp.einsum('ij,jk->...k', matrix_1_gpu, matrix_2_gpu)
    result_matrix_gpu[...] = temp.get()

# We test with 3 iterations. Note that we run each function before testing.
# For CuPy, the first function call is typically slower than subsequent ones.
time_cpu_1 = timeit(cpu_test_1, setup=cpu_test_1, number=1)
time_gpu_1 = timeit(gpu_test_1, setup=gpu_test_1, number=1)

# Print results.
print(f'Data type: {matrix_1.dtype}\n'
      '---------------------------------------------\n')

print('====TEST MATMUL -> SUM====\n'
      '==========================\n'
      f'CPU time: {time_cpu_1:.4f}\n'
      f'GPU time: {time_gpu_1:.4f}\n'
      f'Speedup: {time_cpu_1 / time_gpu_1:.2f}\n'
      '==========================\n')
check_elements = np.allclose(
    result_matrix_cpu, result_matrix_gpu, atol=1e-8, rtol=1e-8)
check_elements_looser = np.allclose(
    result_matrix_cpu, result_matrix_gpu, atol=1e-4, rtol=1e-4)

print(f'All elements within allowance of 1e-8: {check_elements}')
print(f'All elements within allowance of 1e-4: {check_elements_looser}\n')
if m*k*n < 1e10:
    time_cpu_2 = timeit(cpu_test_2, setup=cpu_test_2, number=1)
else:
    time_cpu_2 = np.inf
time_gpu_2 = timeit(gpu_test_2, setup=gpu_test_2, number=1)

# Print results.
print('=======TEST EINSUM========\n'
      '==========================\n'
      f'CPU time: {time_cpu_2:.4f}\n'
      f'GPU time: {time_gpu_2:.4f}\n'
      f'Speedup: {time_cpu_2 / time_gpu_2:.2f}\n'
      '===========================\n')
check_elements = np.allclose(
    result_matrix_cpu, result_matrix_gpu, atol=1e-8, rtol=1e-8)
check_elements_looser = np.allclose(
    result_matrix_cpu, result_matrix_gpu, atol=1e-4, rtol=1e-4)

print(f'All elements within allowance of 1e-8: {check_elements}')
print(f'All elements within allowance of 1e-4: {check_elements_looser}\n')

cpu_best = min(time_cpu_1, time_cpu_2)
gpu_best = min(time_gpu_1, time_gpu_2)
print('=======BEST TIMES========\n'
      '==========================\n'
      f'CPU time: {cpu_best:.4f}\n'
      f'GPU time: {gpu_best:.4f}\n'
      f'Speedup: {cpu_best / gpu_best:.2f}\n'
      '==========================\n')
print('--------------------------------------------\n')

Results for A40 GPU¶

>>> Data type: float32
>>> ---------------------------------------------
>>>
>>> ====TEST MATMUL -> SUM====
>>> ==========================
>>> CPU time: 5.4996
>>> GPU time: 0.3111
>>> Speedup: 17.68
>>> ==========================
>>>
>>> All elements within allowance of 1e-8: False
>>> All elements within allowance of 1e-4: True
>>>
>>> =======TEST EINSUM========
>>> ==========================
>>> CPU time: inf
>>> GPU time: 0.0168
>>> Speedup: inf
>>> ===========================
>>>
>>> All elements within allowance of 1e-8: False
>>> All elements within allowance of 1e-4: True
>>>
>>> =======BEST TIMES========
>>> ==========================
>>> CPU time: 5.4996
>>> GPU time: 0.0168
>>> Speedup: 327.41
>>> ==========================
>>>
>>> --------------------------------------------
>>> --------------------------------------------
>>>
>>> Data type: float64
>>> ---------------------------------------------
>>>
>>> ====TEST MATMUL -> SUM====
>>> ==========================
>>> CPU time: 10.1334
>>> GPU time: 14.1453
>>> Speedup: 0.72
>>> ==========================
>>>
>>> All elements within allowance of 1e-8: True
>>> All elements within allowance of 1e-4: True
>>>
>>> =======TEST EINSUM========
>>> ==========================
>>> CPU time: inf
>>> GPU time: 0.0289
>>> Speedup: inf
>>> ===========================
>>>
>>> All elements within allowance of 1e-8: True
>>> All elements within allowance of 1e-4: True
>>>
>>> =======BEST TIMES========
>>> ==========================
>>> CPU time: 10.1334
>>> GPU time: 0.0289
>>> Speedup: 350.36
>>> ==========================
>>>
>>> --------------------------------------------

Note that for the matrix multiplication test, the CuPy version is only faster in the case of the data type float32, and about the same for float64. On the other hand, NumPy refuses to parallelize this particular np.einsum paths (more restrictive paths which do not require reduction can be parallelized), which makes it nearly impossible to run with large matrices, whereas CuPy can compile it into a single CUDA kernel executed in the most optimal way, and thereby obtain a large speedup.

Additionally, note that if we set the environmental variable CUPY_TF32 to 1, the matrix multiplication of float32 will see a speedup of nearly 3 times, to 0.1168. This is useful if your code relies heavily on matrix multiplications. Another option is to use float16, which will speed up your multiplications further still, to 0.0668 seconds (a speedup of 2 times), and the einsum statement to 0.0125 seconds (a speedup of about 1.35 times).

The get() method of CuPy arrays force synchronization at the end of each test, which penalizes CuPy somewhat - in a longer computation where most of it is done on the GPU, CuPy would see larger benefits.

Results for T4 GPU¶

>>> Data type: float32
>>> ---------------------------------------------
>>>
>>> ====TEST MATMUL -> SUM====
>>> ==========================
>>> CPU time: 6.5629
>>> GPU time: 1.6036
>>> Speedup: 4.09
>>> ==========================
>>>
>>> All elements within allowance of 1e-8: False
>>> All elements within allowance of 1e-4: True
>>>
>>> =======TEST EINSUM========
>>> ==========================
>>> CPU time: inf
>>> GPU time: 0.0417
>>> Speedup: inf
>>> ===========================
>>>
>>> All elements within allowance of 1e-8: False
>>> All elements within allowance of 1e-4: True
>>>
>>> =======BEST TIMES========
>>> ==========================
>>> CPU time: 6.5629
>>> GPU time: 0.0417
>>> Speedup: 157.38
>>> ==========================
>>>
>>> --------------------------------------------
>>> --------------------------------------------
>>>
>>> Data type: float64
>>> --------------------------------------------
>>>
>>> ====TEST MATMUL -> SUM====
>>> ==========================
>>> CPU time: 13.0603
>>> GPU time: 27.8300
>>> Speedup: 0.47
>>> ==========================
>>>
>>> All elements within allowance of 1e-8: True
>>> All elements within allowance of 1e-4: True
>>>
>>> =======TEST EINSUM========
>>> ==========================
>>> CPU time: inf
>>> GPU time: 0.0730
>>> Speedup: inf
>>> ===========================
>>>
>>> All elements within allowance of 1e-8: True
>>> All elements within allowance of 1e-4: True
>>>
>>> =======BEST TIMES========
>>> ==========================
>>> CPU time: 13.0603
>>> GPU time: 0.0730
>>> Speedup: 178.97
>>> ==========================
>>>
>>> --------------------------------------------

The results for the T4 GPU are broadly similar, except the A40 is about 2 (for float64) to 5 (for float32) times as fast as the T4, which matches roughly with their stated capabilities on the hardware details page. The TF32 has much more to gain by using float16 than the A40, reducing the matrix multiplication time down to 0.3290 seconds for the matrix multiplication (a speedup of about 5 times), and 0.0317 seconds for the einsum operation (a speedup of 2 times). However, caution is needed with float16, since this format can incur significant precision losses and thereby affect numerical accuracy.

Example: Regularized least squares with sparse matrices¶

This is a more complicated example, using a common type of optimization - ridge regression of a sparse matrix equation with $L_2$ regularization.

import cupyx.scipy.sparse as sp
import cupyx.scipy.sparse.linalg as splg
import numpy as np
from scipy.sparse import random as sparse_random
import cupy as cp
from time import time

r = np.random.default_rng(1234)
rows = 15000
columns = 15000
# We create a random system matrix with a density of 0.1 (thus, fairly dense)
system_matrix = sparse_random(
    m=columns, n=rows, density=0.1, dtype=np.float32, random_state=r)
system_matrix = sp.csr_matrix(system_matrix)

# We create random "data" using an normal distribution.
data = cp.array(r.normal(loc=3., scale=1., size=(rows, 1)))
extended_data = cp.vstack([data, cp.zeros((columns, 1))])

# Our solver, which will behave like an iterator.
def solve_reg_system(coeff_range, coeff_number):
    coeffs = np.geomspace(*coeff_range, coeff_number)
    for reg_coeff in coeffs:
        # Our regularization can represented by
        # simply appending an identity matrix to the system matrix.
        reg_matrix = reg_coeff * sp.eye(rows, format="csr")
        reg_system_matrix = sp.vstack([system_matrix, reg_matrix])
        result = splg.lsmr(A=reg_system_matrix, b=extended_data)
        model = result[0][:rows]
        yield (reg_coeff, cp.linalg.norm(system_matrix @ model - data).get(),
               cp.linalg.norm(reg_matrix @ model).get() / reg_coeff)

wall_time = time()
# We hand the range of regularization coefficients to the solver.
solver = solve_reg_system((1e-1, 5e3), 10)
l_curve = np.array([t for t in solver])
coeff, residual, reg_norm = l_curve[:, 0], l_curve[:, 1], l_curve[:, 2]
wall_time = wall_time - time()
print(f'Wall time: {wall_time} s')

# A40
>>> Wall time: 2.54 s
# T4
>>> Wall time: 4.46 s

For comparison, running the nearly identical NumPy/SciPy code (replacing cp with np and removing the .get() methods) yields a wall time of around 90 s, for a speedup of about 20-40. This code is far from optimal for this problem, but the speedup is still substantial, although it will depend greatly on the parameters of the system matrix.

We plot the residual as a function of the regularization norm to get an "L-curve".

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_ylabel('2-norm of vector')
ax.set_xlabel('2-norm of residual')
ax.plot(residual, reg_norm, '--', marker='v')
for i, c in enumerate(coeff):
    ax.annotate(f'{c:.1e}', (residual[i], reg_norm[i]))

L-curve plot for the regularized optimization

CuPy is somewhat limited in its scope, focusing exclusively on replicating the functionality of these libraries. This means that it is not necessarily optimal in terms of performance, and it does not as of this writing provide any convenient tools for handling common limitations in GPU-based programming, such as running out of VRAM. It can be particularly ungracious when handling e.g. sparse matrix-dense vector multiplication, requesting copious amounts of VRAM rather than attempting to slice the problem.

However, it is very useful for speeding up already existing code, especially if the code is also modified to best leverage CuPy. In particular, as we have seen, across-the-board improvements are seen with float32 operations, but in cases where GPU hardware is most effectively utilized, float64 operations can also benefit greatly. Moreover, CuPy is one of the most convenient ways to use the cuTensor, cuBLAS, and cuSparse libraries.