CuPy¶
CuPy is a Python package
which provides CUDA wrappers for many functions in Numpy and SciPy. It allows
for very easy porting of code written in Numpy and SciPy, and allows access to
cuTensor, cuBLAS and cuSparse, often by simply replacing import numpy as np
with import cupy as np
.
The following examples were run on Vera nodes with 8 CPU:s in a Jupyter
notebook, with the modules CuPy
, matplotlib
and IPython
loaded. We will
show tests with A40 and T4 GPU:s. Note that if your calculations particularly
depend on using float64
, you may be best off using V100 or A100 GPU:s.
Example: Matrix multiplication and reduction in NumPy and CuPy¶
import numpy as np
import cupy as cp
from timeit import timeit
dtype = np.float32
m, k, n = (10000, 5000, 7000)
# These matrices are quite large, needing about 800 MB RAM
r = np.random.default_rng(987654321)
matrix_1 = r.random(size=(m, k), dtype=dtype)
matrix_2 = r.random(size=(k, n), dtype=dtype)
result_matrix_cpu = np.zeros((n), dtype=dtype)
temp_cpu = np.zeros((m, n), dtype=dtype)
# `cp.array` will copy the matrices to the GPU.
matrix_1_gpu = cp.array(matrix_1, dtype=dtype)
matrix_2_gpu = cp.array(matrix_2, dtype=dtype)
temp_gpu = cp.zeros((m, n), dtype=dtype)
result_matrix_gpu = np.zeros_like(result_matrix_cpu, dtype=dtype)
# Small functions for testing
def cpu_test_1():
for i in range(10):
np.matmul(matrix_1, matrix_2, out=temp_cpu)
np.sum(temp_cpu, axis=0, out=result_matrix_cpu)
def cpu_test_2():
for i in range(10):
np.einsum('ij,jk->...k', matrix_1, matrix_2,
out=result_matrix_cpu, optimize='optimal')
# To avoid over-syncing, we put the .get() outside the loop.
def gpu_test_1():
for i in range(10):
cp.matmul(matrix_1_gpu, matrix_2_gpu, out=temp_gpu)
result_matrix_gpu[...] = cp.sum(temp_gpu, axis=0).get()
def gpu_test_2():
for i in range(10):
temp = cp.einsum('ij,jk->...k', matrix_1_gpu, matrix_2_gpu)
result_matrix_gpu[...] = temp.get()
# We test with 3 iterations. Note that we run each function before testing.
# For CuPy, the first function call is typically slower than subsequent ones.
time_cpu_1 = timeit(cpu_test_1, setup=cpu_test_1, number=1)
time_gpu_1 = timeit(gpu_test_1, setup=gpu_test_1, number=1)
# Print results.
print(f'Data type: {matrix_1.dtype}\n'
'---------------------------------------------\n')
print('====TEST MATMUL -> SUM====\n'
'==========================\n'
f'CPU time: {time_cpu_1:.4f}\n'
f'GPU time: {time_gpu_1:.4f}\n'
f'Speedup: {time_cpu_1 / time_gpu_1:.2f}\n'
'==========================\n')
check_elements = np.allclose(
result_matrix_cpu, result_matrix_gpu, atol=1e-8, rtol=1e-8)
check_elements_looser = np.allclose(
result_matrix_cpu, result_matrix_gpu, atol=1e-4, rtol=1e-4)
print(f'All elements within allowance of 1e-8: {check_elements}')
print(f'All elements within allowance of 1e-4: {check_elements_looser}\n')
if m*k*n < 1e10:
time_cpu_2 = timeit(cpu_test_2, setup=cpu_test_2, number=1)
else:
time_cpu_2 = np.inf
time_gpu_2 = timeit(gpu_test_2, setup=gpu_test_2, number=1)
# Print results.
print('=======TEST EINSUM========\n'
'==========================\n'
f'CPU time: {time_cpu_2:.4f}\n'
f'GPU time: {time_gpu_2:.4f}\n'
f'Speedup: {time_cpu_2 / time_gpu_2:.2f}\n'
'===========================\n')
check_elements = np.allclose(
result_matrix_cpu, result_matrix_gpu, atol=1e-8, rtol=1e-8)
check_elements_looser = np.allclose(
result_matrix_cpu, result_matrix_gpu, atol=1e-4, rtol=1e-4)
print(f'All elements within allowance of 1e-8: {check_elements}')
print(f'All elements within allowance of 1e-4: {check_elements_looser}\n')
cpu_best = min(time_cpu_1, time_cpu_2)
gpu_best = min(time_gpu_1, time_gpu_2)
print('=======BEST TIMES========\n'
'==========================\n'
f'CPU time: {cpu_best:.4f}\n'
f'GPU time: {gpu_best:.4f}\n'
f'Speedup: {cpu_best / gpu_best:.2f}\n'
'==========================\n')
print('--------------------------------------------\n')
Results for A40 GPU¶
>>> Data type: float32
>>> ---------------------------------------------
>>>
>>> ====TEST MATMUL -> SUM====
>>> ==========================
>>> CPU time: 5.4996
>>> GPU time: 0.3111
>>> Speedup: 17.68
>>> ==========================
>>>
>>> All elements within allowance of 1e-8: False
>>> All elements within allowance of 1e-4: True
>>>
>>> =======TEST EINSUM========
>>> ==========================
>>> CPU time: inf
>>> GPU time: 0.0168
>>> Speedup: inf
>>> ===========================
>>>
>>> All elements within allowance of 1e-8: False
>>> All elements within allowance of 1e-4: True
>>>
>>> =======BEST TIMES========
>>> ==========================
>>> CPU time: 5.4996
>>> GPU time: 0.0168
>>> Speedup: 327.41
>>> ==========================
>>>
>>> --------------------------------------------
>>> --------------------------------------------
>>>
>>> Data type: float64
>>> ---------------------------------------------
>>>
>>> ====TEST MATMUL -> SUM====
>>> ==========================
>>> CPU time: 10.1334
>>> GPU time: 14.1453
>>> Speedup: 0.72
>>> ==========================
>>>
>>> All elements within allowance of 1e-8: True
>>> All elements within allowance of 1e-4: True
>>>
>>> =======TEST EINSUM========
>>> ==========================
>>> CPU time: inf
>>> GPU time: 0.0289
>>> Speedup: inf
>>> ===========================
>>>
>>> All elements within allowance of 1e-8: True
>>> All elements within allowance of 1e-4: True
>>>
>>> =======BEST TIMES========
>>> ==========================
>>> CPU time: 10.1334
>>> GPU time: 0.0289
>>> Speedup: 350.36
>>> ==========================
>>>
>>> --------------------------------------------
Note that for the matrix multiplication test, the CuPy
version is only faster
in the case of the data type float32
, and about the same for float64
. On the
other hand, NumPy
refuses to parallelize this particular np.einsum
paths
(more restrictive paths which do not require reduction can be parallelized),
which makes it nearly impossible to run with large matrices, whereas CuPy
can
compile it into a single CUDA kernel executed in the most optimal way, and
thereby obtain a large speedup.
Additionally, note that if we set the environmental variable CUPY_TF32
to 1
,
the matrix multiplication of float32
will see a speedup of nearly 3 times, to
0.1168
. This is useful if your code relies heavily on matrix multiplications.
Another option is to use float16
, which will speed up your multiplications
further still, to 0.0668
seconds (a speedup of 2 times), and the einsum
statement to 0.0125
seconds (a speedup of about 1.35 times).
The get()
method of CuPy
arrays force synchronization at the end of each
test, which penalizes CuPy
somewhat - in a longer computation where most of it
is done on the GPU, CuPy
would see larger benefits.
Results for T4 GPU¶
>>> Data type: float32
>>> ---------------------------------------------
>>>
>>> ====TEST MATMUL -> SUM====
>>> ==========================
>>> CPU time: 6.5629
>>> GPU time: 1.6036
>>> Speedup: 4.09
>>> ==========================
>>>
>>> All elements within allowance of 1e-8: False
>>> All elements within allowance of 1e-4: True
>>>
>>> =======TEST EINSUM========
>>> ==========================
>>> CPU time: inf
>>> GPU time: 0.0417
>>> Speedup: inf
>>> ===========================
>>>
>>> All elements within allowance of 1e-8: False
>>> All elements within allowance of 1e-4: True
>>>
>>> =======BEST TIMES========
>>> ==========================
>>> CPU time: 6.5629
>>> GPU time: 0.0417
>>> Speedup: 157.38
>>> ==========================
>>>
>>> --------------------------------------------
>>> --------------------------------------------
>>>
>>> Data type: float64
>>> --------------------------------------------
>>>
>>> ====TEST MATMUL -> SUM====
>>> ==========================
>>> CPU time: 13.0603
>>> GPU time: 27.8300
>>> Speedup: 0.47
>>> ==========================
>>>
>>> All elements within allowance of 1e-8: True
>>> All elements within allowance of 1e-4: True
>>>
>>> =======TEST EINSUM========
>>> ==========================
>>> CPU time: inf
>>> GPU time: 0.0730
>>> Speedup: inf
>>> ===========================
>>>
>>> All elements within allowance of 1e-8: True
>>> All elements within allowance of 1e-4: True
>>>
>>> =======BEST TIMES========
>>> ==========================
>>> CPU time: 13.0603
>>> GPU time: 0.0730
>>> Speedup: 178.97
>>> ==========================
>>>
>>> --------------------------------------------
The results for the T4 GPU are broadly similar, except the A40 is about 2 (for
float64
) to 5 (for float32
) times as fast as the T4, which matches roughly
with their stated capabilities on the
hardware details page.
The TF32 has much more to gain by using float16
than the A40, reducing the
matrix multiplication time down to 0.3290
seconds for the matrix
multiplication (a speedup of about 5 times), and 0.0317
seconds for the
einsum
operation (a speedup of 2 times). However, caution is needed with
float16
, since this format can incur significant precision losses and thereby
affect numerical accuracy.
Example: Regularized least squares with sparse matrices¶
This is a more complicated example, using a common type of optimization - ridge regression of a sparse matrix equation with $L_2$ regularization.
import cupyx.scipy.sparse as sp
import cupyx.scipy.sparse.linalg as splg
import numpy as np
from scipy.sparse import random as sparse_random
import cupy as cp
from time import time
r = np.random.default_rng(1234)
rows = 15000
columns = 15000
# We create a random system matrix with a density of 0.1 (thus, fairly dense)
system_matrix = sparse_random(
m=columns, n=rows, density=0.1, dtype=np.float32, random_state=r)
system_matrix = sp.csr_matrix(system_matrix)
# We create random "data" using an normal distribution.
data = cp.array(r.normal(loc=3., scale=1., size=(rows, 1)))
extended_data = cp.vstack([data, cp.zeros((columns, 1))])
# Our solver, which will behave like an iterator.
def solve_reg_system(coeff_range, coeff_number):
coeffs = np.geomspace(*coeff_range, coeff_number)
for reg_coeff in coeffs:
# Our regularization can represented by
# simply appending an identity matrix to the system matrix.
reg_matrix = reg_coeff * sp.eye(rows, format="csr")
reg_system_matrix = sp.vstack([system_matrix, reg_matrix])
result = splg.lsmr(A=reg_system_matrix, b=extended_data)
model = result[0][:rows]
yield (reg_coeff, cp.linalg.norm(system_matrix @ model - data).get(),
cp.linalg.norm(reg_matrix @ model).get() / reg_coeff)
wall_time = time()
# We hand the range of regularization coefficients to the solver.
solver = solve_reg_system((1e-1, 5e3), 10)
l_curve = np.array([t for t in solver])
coeff, residual, reg_norm = l_curve[:, 0], l_curve[:, 1], l_curve[:, 2]
wall_time = wall_time - time()
print(f'Wall time: {wall_time} s')
For comparison, running the nearly identical NumPy/SciPy code (replacing cp
with np
and removing the .get()
methods) yields a wall time of around 90 s,
for a speedup of about 20-40. This code is far from optimal for this problem,
but the speedup is still substantial, although it will depend greatly on the
parameters of the system matrix.
We plot the residual as a function of the regularization norm to get an "L-curve".
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_ylabel('2-norm of vector')
ax.set_xlabel('2-norm of residual')
ax.plot(residual, reg_norm, '--', marker='v')
for i, c in enumerate(coeff):
ax.annotate(f'{c:.1e}', (residual[i], reg_norm[i]))
CuPy is somewhat limited in its scope, focusing exclusively on replicating the functionality of these libraries. This means that it is not necessarily optimal in terms of performance, and it does not as of this writing provide any convenient tools for handling common limitations in GPU-based programming, such as running out of VRAM. It can be particularly ungracious when handling e.g. sparse matrix-dense vector multiplication, requesting copious amounts of VRAM rather than attempting to slice the problem.
However, it is very useful for speeding up already existing code, especially if
the code is also modified to best leverage CuPy
. In particular, as we have
seen, across-the-board improvements are seen with float32
operations, but in
cases where GPU hardware is most effectively utilized, float64
operations can
also benefit greatly. Moreover, CuPy
is one of the most convenient ways to use
the cuTensor
, cuBLAS
, and cuSparse
libraries.