Several tools exist for high level monitoring your jobs which should be used to verify that your job is running as intended. These tools give hints towards problem areas and bottlenecks or underutilized resources (e.g. did you use all the GPUs you requested?).
After monitoring a job you can take a deeper dive and profile the application to find out more.
job_stats.py command takes a Slurm job ID and returns a link to a custom Grafana dashboard:
$ job_stats.py 39375
Note: Your job must have been started by Slurm before you run
The URL will take you to an interactive dashboard with real-time metrics of your running job.
In the top right corner you can adjust the time interval. You can click and interact with each graph and highlight specific points in time.
Take special care to notice that you are actually utilizing all GPUs when you allocate several!
ssh into the respective nodes for your job, you can use
nvtop commands to view CPU and GPU load respectively.
For monitoring the GPU usage you'll find recent versions of
nvtop in the module system:
$ module spider nvtop
You can also run
nvidia-smi to view the current state of the GPUs, and power users could also use it for sampling metrics, e.g:
$ nvidia-smi --format=noheader,csv --query-compute-apps=timestamp,gpu_name,pid,name,used_memory --loop=1 -f sample_run.log