Profiling🔗
C3SE provides several tools to assist with profiling your applications.
Questions profiling may answer🔗
- How well is the GPU utilised?
- Where in my code is most time spent?
- Where is the greatest opportunity for optimisation? Memory transfers, communication, floating-point operations, or something else?
- What should I tweak in my model to lower training time?
General profiling advice🔗
A big part of profiling is picking the right tool for the right job. Picking a low-level tool for system analysis will most likely overwhelm you with graphs and traces of such detail that you miss the forest for the trees. The opposite situation, using a system level tool for low-level analysis, is of course also not ideal, as the bottlenecks might get abstracted away.
- Start at the top: Start with a high-level (or "system") profiler and see how far you get. A high-level tool should highlight "hotspots", i.e. sections of your code eligible for optimisation, for which you might need to use a medium or low-level tool to investigate further.
- Know your application - Check if your application, library or framework provides their own profiling facilities. The large and popular code bases might includes some built-in profiling features and spending just a few minutes reading the documentation can save you a lot of time!
- Profile your environment - If you see unexplained performance drops and you have not changed your code - something else in your environment has likely changed. Check the C3SE status page for maintenance work, service updates or other system news. System, Slurm, kernel or driver updates are applied regularly to fix bug and security problems. These impacts performance in various degrees and may make your jobs slower. If you suspect an unannounced system change has made your code slower - please contact the support at support@c3se.chalmers.se.
- Remember where you run - Not all compute nodes has the same set of hardware. For instance
Alvis is equipped with several different type of GPUs. Make sure you are
profiling your application on the same hardware for which you intend to run
your production workloads. Note that once Slurm starts your job job you can
SSH into the compute nodes running it. At the compute node you can use the
standard Linux utilities (
top
,ps
,perf
, etc.) including the tools described on this page to profile your job. Read more about how jobs are run in the Running jobs guide - Remember what you run - Always specify what version of a module you load, e.g. prefer
module load X/1.0
tomodule load X
. Do not rely on the default versions as they might get updated!
High-level tools - job_stats.py and DCGM🔗
The first tool in your profiling arsenal should be the job_stats.py
tool and
the second, if you are running on Alvis, the DCGM report that gets
automatically generated for each completed job. These high-level tools are
readily available, easy to use, and provide a nice overview of your job metrics with
plenty of details to explore if needed.
job_stats.py🔗
The job_stats.py
tool is an in-house written tool available for all C3SE computing
systems. job_stats.py
takes a Slurm job ID and returns a link to a Grafana dashboard.
$ job_stats.py 39375
https://scruffy.c3se.chalmers.se/d/alvis-job/alvis-job?var-jobid=39375&from=1612571366000&to=1612571370000
Note: Your job must have been started by Slurm before you run job_stats.py
The URL will take you to an interactive dashboard with real-time metrics of your running job.
In the top right corner you can adjust the time interval. You can click and interact with each graph and highlight specific points in time.
DCGM statistics report - how did your job run on the GPUs🔗
The NVIDIA Datacenter GPU Management (DCGM) Interface is an API with many
different features including performance profiling. DCGM profiles processes by
continuously collecting low-level metrics from the device (i.e. GPU) itself by
polling hardware counters. This data can be quite tricky to extract and
interpret for yourself as it requires information how the administrators
assigned and configured the GPUs in the system, as well as some knowledge about
hardware architecture. Alvis runs the command-line tool dcgmi
in the
background for each job and automatically generates a statistics report for
each of your job as they complete. For multi-node jobs you get one report per
each node. The report (or reports) shows up inside the job working directory
(the directory from where you submitted your job).
$ ls
job_script.sh dcgm-gpu-stats-alvis2-02-jobid-39378.log slurm-39378.out
The filename consists of the compute node hostname and the job id. In the
above example the report was generated on compute node alvis2-02
for
job 39378
. The report contains information summarised in Execution
(Total execution time), Performance (Energy, SM and memory utilisation), and
Events (ECC errors) sections. You can explore the dcmgi dmon
command yourself
to generate similar reports.
nvidia-smi and nvtop - top
like tools for GPUs🔗
The NVIDIA System Management Interface tool or nvidia-smi
is in NVIDIAs own
words
a command line utility aimed at managing and monitoring NVIDIA GPU devices [...]
(link).
nvidia-smi
is just one of several tools that runs on top of the NVIDIA
Management Library (NVML) (see also nvtop
below) hence you can find plenty of
tools with similar reporting abilities online (and you can even create your
own). nvidia-smi
provides similar data as dcgmi
but nvidia-smi
is in
general easier to use. The nvidia-smi
tool comes bundled with CUDA Toolkit
and if you have a dabbled with CUDA on your private workstation chances are you
already have it installed. nvidia-smi
can be used to show GPU and
memory utilisation, fan speed and temperatures, and works across multiple GPUs.
Although nvidia-smi
comes bundled with CUDA it does not provide kernel
profiling. For this use-case we recommend you look at the NVIDIA Nsight suite.
$ nvidia-smi
Wed Feb 3 00:08:27 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:06:00.0 Off | 0 |
| N/A 63C P0 64W / 70W | 14726MiB / 15109MiB | 48% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:07:00.0 Off | 0 |
| N/A 28C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:2F:00.0 Off | 0 |
| N/A 75C P0 67W / 70W | 748MiB / 15109MiB | 77% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:30:00.0 Off | 0 |
| N/A 29C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla T4 On | 00000000:86:00.0 Off | 0 |
| N/A 61C P0 60W / 70W | 14726MiB / 15109MiB | 51% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla T4 On | 00000000:87:00.0 Off | 0 |
| N/A 25C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla T4 On | 00000000:D8:00.0 Off | 0 |
| N/A 24C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla T4 On | 00000000:D9:00.0 Off | 0 |
| N/A 24C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 98549 C python3 14723MiB |
| 2 N/A N/A 183345 C /opt/miniconda3/bin/python 745MiB |
| 4 N/A N/A 154624 C python3 14723MiB |
+-----------------------------------------------------------------------------+
As seen above nvidia-smi
found and collected data from eight GPUs. Reading
from the right most column we observe that not all devices are in fact utilised
as only GPU 0, 2, 4 show non-zero utilisation. The middle column shows high
memory usage of GPUs index 0 and 4 while GPU 2 only uses ~750MiB out of
15109MiB. The bottom table shows which processes are running. In this case it
only tells us that the python
interpreter is being run. The Type "C" stands
for Compute ("G" for Graphics is also available if you are running a supported
graphics API, such as OpenGL). If we want to examine the processes using the
Linux standard tools we can run ps
etc. but we might also pick a medium-level
or application specific tool to find out more information about what is
happening inside the processes. In summary, nvidia-smi
works good at the
system level.
nvidia-smi
also supports real-time profiling using the dmon
("device monitor") option:
nvidia-smi dmon -o DT
#Date Time gpu pwr gtemp mtemp sm mem enc dec mclk pclk
#YYYYMMDD HH:MM:SS Idx W C C % % % % MHz MHz
20210203 00:16:45 0 36 63 - 40 26 0 0 5000 1575
20210203 00:16:45 1 9 28 - 0 0 0 0 405 300
20210203 00:16:45 2 58 76 - 78 40 0 0 5000 1515
20210203 00:16:45 3 9 29 - 0 0 0 0 405 300
20210203 00:16:45 4 55 61 - 37 19 0 0 5000 1545
20210203 00:16:45 5 9 25 - 0 0 0 0 405 300
20210203 00:16:45 6 9 24 - 0 0 0 0 405 300
20210203 00:16:45 7 9 24 - 0 0 0 0 405 300
We include date and timestamps using -o DT
. You can target a single metric
using -s
. Tip: Save the output to a file using -f
instead of writing to
stdout. You can monitor processes equivalently by replacing dmon
with pmon
in the above example.
If you are interested in just a few metrics, such as memory utilisation, and want to sample a run you can pick the columns using --query-compute-apps
and write the statistics as a CSV file using --format
and -f
. -loop
can be used to set the poll frequency (in seconds).
$ nvidia-smi --format=noheader,csv --query-compute-apps=timestamp,gpu_name,pid,name,used_memory --loop=1 -f sample_run.log
2021/02/16 23:28:37.301, Tesla T4, 82839, /nix/store/cpzs1hpwzs23c41haa4dap0zjfx6xych-python3-3.7.9/bin/python3, 8821 MiB
2021/02/16 23:28:37.301, Tesla T4, 83434, /nix/store/cpzs1hpwzs23c41haa4dap0zjfx6xych-python3-3.7.9/bin/python3, 8821 MiB
2021/02/16 23:28:37.302, Tesla T4, 228746, /opt/miniconda3/bin/python, 13451 MiB
2021/02/16 23:28:37.303, Tesla T4, 82578, /nix/store/cpzs1hpwzs23c41haa4dap0zjfx6xych-python3-3.7.9/bin/python3, 1125 MiB
2021/02/16 23:28:37.303, Tesla T4, 83153, /nix/store/cpzs1hpwzs23c41haa4dap0zjfx6xych-python3-3.7.9/bin/python3, 8821 MiB
[...]
nvtop🔗
If you want something very similar to Linux top
and htop
for NVIDIA GPUs you should try nvtop
.
$ module spider nvtop
nvtop
is terminal-based process and resource monitor for NVIDIA GPUs. nvtop
like nvidia-smi
is based on NVML and can also report GPU and memory
utilisation. The advantage of nvtop
is that it provides live graphs.
.
Medium-level tools - application and domain specific profiling🔗
If your application, such as TensorFlow, provides built-in facilities for profiling your application this is a good place to start. Low-level tools such as standalone profilers are available and can be used if the application specific analysis shows a particular area to be a hotspot, but lacking in details.
TensorBoard🔗
TensorBoard is a visualisation toolkit bundled together with TensorFlow.
TensorBoard is not only useful for understanding network topology but can also
be used for profiling by leveraging the TensorFlow
Profiler. TensorBoard can be used
for debugging and profiling not only TensorFlow but other ML-libraries such as
PyTorch and XGBoo. TensorBoard is available when you load TensorFlow from the
module system or run TensorFlow as a container
(/apps/containers/TensorFlow
).
To illustrate what information you can get from TensorBoard please have a look at the screencaptures below.
.
This is the overview page. You can use the drop-down menu on the right to toggle different views depending on what type of profiling information you are interested in (e.g. traces, memory or kernel statistics, etc.). In the capture below we focus on the TensorFlow statistics execution time broke down per compute device.
.
The insight you can gather from the above is not easily seen from the
perspective of high-level tools such as
nvidia-smi
or low-level standalone
compute kernel profilers such as NVIDIA Nsight. These tool while valuable does
not present optimisation opportunities in the "language of the application
domain", which can be very useful for non-experts. If you are running
TensorFlow you lose very little by getting to know TensorBoard - try it!
To read more about TensorBoard please read our TensorBoard guide
Low-level tools - standalone profilers, NVIDIA Nsight🔗
Collecting profiling metrics using a standalone profiler provides the most most
details but requires more experience to be used effectively. The process of
profiling code on the GPU is in principle similar to profiling applications on
a CPU. You submit a Slurm job using either sbatch
or srun
as normal,
however you will ask a profiler to start the application for you (and you will
likely want to tell the profiler what to trace). The profiler will run your
application, collect trace data and save store it in files for later analysis
(most often using a graphical tool).
As an example of the above using the nsys
tool from NVIDIA Nsight:
$ srun -A NAISS2023-123-45 nsys -t cuda --stats cuda ./my_prog -n 1024 -l 3
The above example uses the profiler nsys
tracing CUDA calls (-t cuda
) and
prints a performance summary at the end (--stats true
) and profiles my_prog
with application specific arguments -n 1024 -l 3
.
Warning: Profiling can incur a large overhead. Be prepared to increase your wall time!
Collecting metrics should be run on the compute node. Visualising the metrics often involves launching a GUI and can be done from a login node. We recommend you read our Remote graphics guide for information how you can improve remote graphics performance . Note that launching GUIs using SSH and X11Forwarding will almost always result in poor performance.
NVIDIA Nsight🔗
NVIDIA provides their own profiling tools that we provide as as regular modules in the module system. This guide only provides a brief introduction to NVIDIA family of profiling tools, how to access them on Alvis, and how you can get started.
Looking for information about nvprof or the nvvp?
NVIDIA has marked the NVIDIA Profiler (nvprof
) and the NVIDIA Visual profiler
(nvvp
) for deprecation and will not be supported in CUDA version 10 (Ampere
architecture, found in Alvis, and later). NVIDIA replaces nvprof
and
nvvp
with the NVIDIA Nsight suite. Alvis is based on the Turing (T4 and V100)
and Ampere (A100) architecture and is only fully supported by NVIDIA Nsight
[2]. We recommend you update and future proof your workflow by switching to
NVIDIA Nsight. Please see the following blog post from NVIDIA
about migrating from nvprof
and nvvp
to Nsight.
The NVIDIA Nsight suite of profilers is the successor of nvprof
and Visual
profiler starting at CUDA 10. Nsight allows for application and kernel
level profiling using either graphical user interface or command line interface
(CLI). NVIDIA Nsight is split into three tools of which two are provided in Alvis:
- NVIDIA Nsight Systems - for system-wide profiling across CPUs and GPUs.
- NVIDIA Nsight Compute - an interactive CUDA kernel-level profiling tool.
If we follow our general recommendation to begin profiling at the systems level first - the NVIDIA Nsight systems tool is a good tool to start with. Or more practical: if you are looking to profile your entire algorithm or program starts with Nsight systems. If you have written your own compute kernel you might get better information using Nsight compute.
.
Image source: https://developer.nvidia.com/tools-overview
Where to run?🔗
The first decision is where to start Nsight. If you are a member of the NVIDIA Developer Program (registration is "free" - you pay with your data) NVIDIA Nsight can be downloaded from NVIDIA and run on your laptop or workstation.
For non-interactive profiling you can generate the profiling data on Alvis using the pre-installed Nsight versions we provide. You can then copy back the report for inspection on your local machine. While slightly cumbersome this solution will likely provide a good user experience as you will not need to use remote graphics. It however does not work if you want to profile or develop interactively.
How about remotely attaching Nsight (Systems|Compute) to Alvis?
NVIDIA Nsight supports remote profiling where you run a local instance of Nsight and attach to a daemon running on a target system (i.e. Alvis). The profiling daemon samples statistics for your application and is polled for data. We do however not recommend this solution as currently provided by NVIDIA for the following reasons:
- The connection from your system to the remote system is unencrypted (port 45555). SSH is only used for initialising the connection (link).
-
The password for SSH is stored in plain text in the configuration file on the host.
-
The daemon on the target system listens for incoming connection to a range of ports (starting with 45555 and 49152 respectively for Nsight Systems and Compute). In Alvis, as with most HPC-clusters, only incoming traffic is allowed on port 22/SSH.
We recommend you run Nsight on the login nodes using remote graphics.
You can launch either NVIDIA Nsight Systems and/or Compute from the module
system by loading Nsight-Systems
or NVIDIA-Compute
.
Nsight Systems
$ module spider Nsight-Systems
nsys-ui
Nsight Compute
$ module spider Nsight-Compute
ncu
The binaries nsys-ui
ncu
opens the graphical user interface.
ncu
was renamed to nvidia-nsight-cu
in Nsight Compute Update 2020.1 [1] but
old names will continue to work for backwards compatibility. If you prefer to
work with the command line version you should launch ncu-cli
instead.
[1] https://docs.nvidia.com/nsight-compute/pdf/ReleaseNotes.pdf, p. 5