Profiling

C3SE provides several tools to assist with profiling your applications.

Questions profiling may answer

  • How well is the GPU utilised?
  • Where in my code is most time spent?
  • Where is the greatest opportunity for optimisation? Memory transfers, communication, floating-point operations, or something else?
  • What should I tweak in my model to lower training time?

General profiling advice

A big part of profiling is picking the right tool for the right job. Picking a low-level tool for system analysis will most likely overwhelm you with graphs and traces of such detail that you miss the forest for the trees. The opposite situation, using a system level tool for low-level analysis, is of course also not ideal, as the bottlenecks might get abstracted away.

  • Start at the top: Start with a high-level (or "system") profiler and see how far you get. A high-level tool should highlight "hotspots", i.e. sections of your code eligible for optimisation, for which you might need to use a medium or low-level tool to investigate further.
  • Know your application - Check if your application, library or framework provides their own profiling facilities. The large and popular code bases might includes some built-in profiling features and spending just a few minutes reading the documentation can save you a lot of time!
  • Profile your environment - If you see unexplained performance drops and you have not changed your code - something else in your environment has likely changed. Check the C3SE status page for maintenance work, service updates or other system news. System, Slurm, kernel or driver updates are applied regularly to fix bug and security problems. These impacts performance in various degrees and may make your jobs slower. If you suspect an unannounced system change has made your code slower - please contact the support at support@c3se.chalmers.se.
  • Remember where you run - Not all compute nodes has the same set of hardware. For instance Alvis is equipped with several different type of GPUs. Make sure you are profiling your application on the same hardware for which you intend to run your production workloads. Note that once Slurm starts your job job you can SSH into the compute nodes running it. At the compute node you can use the standard Linux utilities (top, ps, perf, etc.) including the tools described on this page to profile your job. Read more about how jobs are run in the Running jobs guide
  • Remember what you run - Always specify what version of a module you load, e.g. prefer module load X/1.0 to module load X. Do not rely on the default versions as they might get updated!

High-level tools - job_stats.py and DCGM

The first tool in your profiling arsenal should be the job_stats.py tool and the second, if you are running on Alvis, the DCGM report that gets automatically generated for each completed job. These high-level tools are readily available, easy to use, and provide a nice overview of your job metrics with plenty of details to explore if needed.

job_stats.py

The job_stats.py tool is an in-house written tool available for all C3SE computing systems. job_stats.py takes a Slurm job ID and returns a link to a Grafana dashboard.

$ job_stats.py 39375
https://scruffy.c3se.chalmers.se/d/alvis-job/alvis-job?var-jobid=39375&from=1612571366000&to=1612571370000

Note: Your job must have been started by Slurm before you run job_stats.py

The URL will take you to an interactive dashboard with real-time metrics of your running job. Grafana dashboard

In the top right corner you can adjust the time interval. You can click and interact with each graph and highlight specific points in time.

DCGM statistics report - how did your job run on the GPUs

The NVIDIA Datacenter GPU Management (DCGM) Interface is an API with many different features including performance profiling. DCGM profiles processes by continuously collecting low-level metrics from the device (i.e. GPU) itself by polling hardware counters. This data can be quite tricky to extract and interpret for yourself as it requires information how the administrators assigned and configured the GPUs in the system, as well as some knowledge about hardware architecture. Alvis runs the command-line tool dcgmi in the background for each job and automatically generates a statistics report for each of your job as they complete. For multi-node jobs you get one report per each node. The report (or reports) shows up inside the job working directory (the directory from where you submitted your job).

$ ls
job_script.sh dcgm-gpu-stats-alvis2-02-jobid-39378.log slurm-39378.out

The filename consists of the compute node hostname and the job id. In the above example the report was generated on compute node alvis2-02 for job 39378. The report contains information summarised in Execution (Total execution time), Performance (Energy, SM and memory utilisation), and Events (ECC errors) sections. You can explore the dcmgi dmon command yourself to generate similar reports.

nvidia-smi and nvtop - top like tools for GPUs

The NVIDIA System Management Interface tool or nvidia-smi is in NVIDIAs own words

a command line utility aimed at managing and monitoring NVIDIA GPU devices [...]

(link).

nvidia-smi is just one of several tools that runs on top of the NVIDIA Management Library (NVML) (see also nvtop below) hence you can find plenty of tools with similar reporting abilities online (and you can even create your own). nvidia-smi provides similar data as dcgmi but nvidia-smi is in general easier to use. The nvidia-smi tool comes bundled with CUDA Toolkit and if you have a dabbled with CUDA on your private workstation chances are you already have it installed. nvidia-smi can be used to show GPU and memory utilisation, fan speed and temperatures, and works across multiple GPUs.

Although nvidia-smi comes bundled with CUDA it does not provide kernel profiling. For this use-case we recommend you look at the NVIDIA Nsight suite.

$ nvidia-smi
Wed Feb  3 00:08:27 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:06:00.0 Off |                    0 |
| N/A   63C    P0    64W /  70W |  14726MiB / 15109MiB |     48%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:07:00.0 Off |                    0 |
| N/A   28C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:2F:00.0 Off |                    0 |
| N/A   75C    P0    67W /  70W |    748MiB / 15109MiB |     77%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:30:00.0 Off |                    0 |
| N/A   29C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla T4            On   | 00000000:86:00.0 Off |                    0 |
| N/A   61C    P0    60W /  70W |  14726MiB / 15109MiB |     51%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla T4            On   | 00000000:87:00.0 Off |                    0 |
| N/A   25C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
| N/A   24C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla T4            On   | 00000000:D9:00.0 Off |                    0 |
| N/A   24C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     98549      C   python3                         14723MiB |
|    2   N/A  N/A    183345      C   /opt/miniconda3/bin/python        745MiB |
|    4   N/A  N/A    154624      C   python3                         14723MiB |
+-----------------------------------------------------------------------------+

As seen above nvidia-smi found and collected data from eight GPUs. Reading from the right most column we observe that not all devices are in fact utilised as only GPU 0, 2, 4 show non-zero utilisation. The middle column shows high memory usage of GPUs index 0 and 4 while GPU 2 only uses ~750MiB out of 15109MiB. The bottom table shows which processes are running. In this case it only tells us that the python interpreter is being run. The Type "C" stands for Compute ("G" for Graphics is also available if you are running a supported graphics API, such as OpenGL). If we want to examine the processes using the Linux standard tools we can run ps etc. but we might also pick a medium-level or application specific tool to find out more information about what is happening inside the processes. In summary, nvidia-smi works good at the system level.

nvidia-smi also supports real-time profiling using the dmon ("device monitor") option:

nvidia-smi dmon -o DT
#Date       Time        gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
#YYYYMMDD   HH:MM:SS    Idx     W     C     C     %     %     %     %   MHz   MHz
 20210203   00:16:45      0    36    63     -    40    26     0     0  5000  1575
 20210203   00:16:45      1     9    28     -     0     0     0     0   405   300
 20210203   00:16:45      2    58    76     -    78    40     0     0  5000  1515
 20210203   00:16:45      3     9    29     -     0     0     0     0   405   300
 20210203   00:16:45      4    55    61     -    37    19     0     0  5000  1545
 20210203   00:16:45      5     9    25     -     0     0     0     0   405   300
 20210203   00:16:45      6     9    24     -     0     0     0     0   405   300
 20210203   00:16:45      7     9    24     -     0     0     0     0   405   300

We include date and timestamps using -o DT. You can target a single metric using -s. Tip: Save the output to a file using -f instead of writing to stdout. You can monitor processes equivalently by replacing dmon with pmon in the above example.

If you are interested in just a few metrics, such as memory utilisation, and want to sample a run you can pick the columns using --query-compute-apps and write the statistics as a CSV file using --format and -f. -loop can be used to set the poll frequency (in seconds).

$ nvidia-smi --format=noheader,csv --query-compute-apps=timestamp,gpu_name,pid,name,used_memory --loop=1 -f sample_run.log
2021/02/16 23:28:37.301, Tesla T4, 82839, /nix/store/cpzs1hpwzs23c41haa4dap0zjfx6xych-python3-3.7.9/bin/python3, 8821 MiB
2021/02/16 23:28:37.301, Tesla T4, 83434, /nix/store/cpzs1hpwzs23c41haa4dap0zjfx6xych-python3-3.7.9/bin/python3, 8821 MiB
2021/02/16 23:28:37.302, Tesla T4, 228746, /opt/miniconda3/bin/python, 13451 MiB
2021/02/16 23:28:37.303, Tesla T4, 82578, /nix/store/cpzs1hpwzs23c41haa4dap0zjfx6xych-python3-3.7.9/bin/python3, 1125 MiB
2021/02/16 23:28:37.303, Tesla T4, 83153, /nix/store/cpzs1hpwzs23c41haa4dap0zjfx6xych-python3-3.7.9/bin/python3, 8821 MiB
[...]

nvtop

If you want something very similar to Linux top and htop for NVIDIA GPUs you should try nvtop.

$ module spider nvtop

nvtop is terminal-based process and resource monitor for NVIDIA GPUs. nvtop like nvidia-smi is based on NVML and can also report GPU and memory utilisation. The advantage of nvtop is that it provides live graphs.

nvtop graph.

Medium-level tools - application and domain specific profiling

If your application, such as TensorFlow, provides built-in facilities for profiling your application this is a good place to start. Low-level tools such as standalone profilers are available and can be used if the application specific analysis shows a particular area to be a hotspot, but lacking in details.

TensorBoard

TensorBoard is a visualisation toolkit bundled together with TensorFlow. TensorBoard is not only useful for understanding network topology but can also be used for profiling by leveraging the TensorFlow Profiler. TensorBoard can be used for debugging and profiling not only TensorFlow but other ML-libraries such as PyTorch and XGBoo. TensorBoard is available when you load TensorFlow from the module system or run TensorFlow as a Singularity container (/apps/containers/TensorFlow).

To illustrate what information you can get from TensorBoard please have a look at the screencaptures below.

TensorBoard overview.

This is the overview page. You can use the drop-down menu on the right to toggle different views depending on what type of profiling information you are interested in (e.g. traces, memory or kernel statistics, etc.). In the capture below we focus on the TensorFlow statistics execution time broke down per compute device.

TensorBoard stats. The insight you can gather from the above is not easily seen from the perspective of high-level tools such as nvidia-smi or low-level standalone compute kernel profilers such as NVIDIA Nsight. These tool while valuable does not present optimisation opportunities in the "language of the application domain", which can be very useful for non-experts. If you are running TensorFlow you lose very little by getting to know TensorBoard - try it!

To read more about TensorBoard please read our TensorBoard guide

Low-level tools - standalone profilers, NVIDIA Nsight

Collecting profiling metrics using a standalone profiler provides the most most details but requires more experience to be used effectively. The process of profiling code on the GPU is in principle similar to profiling applications on a CPU. You submit a Slurm job using either sbatch or srun as normal, however you will ask a profiler to start the application for you (and you will likely want to tell the profiler what to trace). The profiler will run your application, collect trace data and save store it in files for later analysis (most often using a graphical tool).

As an example of the above using the nsys tool from NVIDIA Nsight:

$ srun -A SNIC2021-123-45 nsys -t cuda --stats cuda ./my_prog -n 1024 -l 3

The above example uses the profiler nsys tracing CUDA calls (-t cuda) and prints a performance summary at the end (--stats true) and profiles my_prog with application specific arguments -n 1024 -l 3.

Warning: Profiling can incur a large overhead. Be prepared to increase your wall time!

Collecting metrics should be run on the compute node. Visualising the metrics often involves launching a GUI and can be done from a login node. We recommend you read our Remote graphics guide for information how you can improve remote graphics performance . Note that launching GUIs using SSH and X11Forwarding will almost always result in poor performance.

NVIDIA Nsight

NVIDIA provides their own profiling tools that we provide as as regular modules in the module system. This guide only provides a brief introduction to NVIDIA family of profiling tools, how to access them on Alvis, and how you can get started.


Looking for information about nvprof or the nvvp?

NVIDIA has marked the NVIDIA Profiler (nvprof) and the NVIDIA Visual profiler (nvvp) for deprecation and will not be supported in CUDA version 10 (Ampere architecture, found in Alvis, and later). NVIDIA replaces nvprof and nvvp with the NVIDIA Nsight suite. Alvis is based on the Turing (T4 and V100) and Ampere (A100) architecture and is only fully supported by NVIDIA Nsight [2]. We recommend you update and future proof your workflow by switching to NVIDIA Nsight. Please see the following blog post from NVIDIA about migrating from nvprof and nvvp to Nsight.


The NVIDIA Nsight suite of profilers is the successor of nvprof and Visual profiler starting at CUDA 10. Nsight allows for application and kernel level profiling using either graphical user interface or command line interface (CLI). NVIDIA Nsight is split into three tools of which two are provided in Alvis:

  • NVIDIA Nsight Systems - for system-wide profiling across CPUs and GPUs.
  • NVIDIA Nsight Compute - an interactive CUDA kernel-level profiling tool.

If we follow our general recommendation to begin profiling at the systems level first - the NVIDIA Nsight systems tool is a good tool to start with. Or more practical: if you are looking to profile your entire algorithm or program starts with Nsight systems. If you have written your own compute kernel you might get better information using Nsight compute.

Nsight Overview.

Image source: https://developer.nvidia.com/tools-overview

Where to run?

The first decision is where to start Nsight. If you are a member of the NVIDIA Developer Program (registration is "free" - you pay with your data) NVIDIA Nsight can be downloaded from NVIDIA and run on your laptop or workstation.

For non-interactive profiling you can generate the profiling data on Alvis using the pre-installed Nsight versions we provide. You can then copy back the report for inspection on your local machine. While slightly cumbersome this solution will likely provide a good user experience as you will not need to use remote graphics. It however does not work if you want to profile or develop interactively.


How about remotely attaching Nsight (Systems|Compute) to Alvis?

NVIDIA Nsight supports remote profiling where you run a local instance of Nsight and attach to a daemon running on a target system (i.e. Alvis). The profiling daemon samples statistics for your application and is polled for data. We do however not recommend this solution as currently provided by NVIDIA for the following reasons:

  • The connection from your system to the remote system is unencrypted (port 45555). SSH is only used for initialising the connection (link).
  • The password for SSH is stored in plain text in the configuration file on the host.

  • The daemon on the target system listens for incoming connection to a range of ports (starting with 45555 and 49152 respectively for Nsight Systems and Compute). In Alvis, as with most HPC-clusters, only incoming traffic is allowed on port 22/SSH.

We recommend you run Nsight on the login nodes using remote graphics.


You can launch either NVIDIA Nsight Systems and/or Compute from the module system by loading Nsight-Systems or NVIDIA-Compute.

Nsight Systems

$ module spider Nsight-Systems
nsys-ui

Nsight Compute

$ module spider Nsight-Compute
ncu

The binaries nsys-ui ncu opens the graphical user interface.

ncu was renamed to nvidia-nsight-cu in Nsight Compute Update 2020.1 [1] but old names will continue to work for backwards compatibility. If you prefer to work with the command line version you should launch ncu-cli instead.

[1] https://docs.nvidia.com/nsight-compute/pdf/ReleaseNotes.pdf, p. 5