Tensorboard - visualise, debug and profile your model🔗

Tensorboard is built-in virtualisation toolkit bundled with Tensorflow and can be used for tracking ML-metrics, such as loss and accuracy, and provides facilities for debugging and profiling. Tensorboard can also be used with libraries such as PyTorch and XGBoost. You can find more information about Tensorboard at the official website.

Tensorboard is accessed from a web browser and you start it by launching a web server on either a login or compute node. We recommend you start Tensorboard on a login node and access it using an SSH-tunnel or ThinLinc.

Generating Tensorboard data in TensorFlow🔗

The first step in visualising your experiments is of course to generate, or log, the appropriate data for it. This is easily done in Keras by adding the Tensorboard callback keras.callback.TensorBoard to your model callbacks, e.g:

import os
import tensorflow as tf
from tensorflow.keras.callbacks import Tensorboard
[...]
jobid = os.getenv('SLURM_JOB_ID')
tensorboard_callback = TensorBoard(log_dir="tensorlog-log-{}".format(jobid), histogram_freq=1)
model.fit(x_train, y_train, epochs=5, callbacks=[tensorboard_callback])

Tensorboard requires you to specify the mandatory log_dir argument and also takes several optional arguments (e.g. histogram_freq as seen above). Please consult the official documentation for more information (link) about customising your Tensorboard callback. The above snippet illustrates how the Slurm environment variable $SLURM_JOB_ID can be used to create an unique log_dir for each Slurm job. You are free to modify this, but each run must have a unique directory for Tensorboard to parse the logs correctly.

Generating Tensorboard data in PyTorch🔗

In PyTorch

import os
import torch
[...]
jobid = os.getenv('SLURM_JOB_ID')

with torch.profiler.profile(
    schedule=torch.profiler.schedule(wait=0, warmup=1, active=3, repeat=1),
    on_trace_ready=torch.profiler.tensorboard_trace_handler(f'torch-log-{jobid}'),
) as prof:
    for batch in dataloader:
        train(model, batch)

        # Notify profiler of steps boundary
        prof.step()

        if prof.step_num >= (0 + 1 + 3) * 1:
            break

# You can access profiling data in the code as well
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

There are several options with regards to what to profile and how to set up the scheduler. You'll find more information in PyTorch's official documentation.

Note that you'll need the torch_tb_profiler python package to visualize the profiling results in TensorBoard. That package is available in for example the PyTorch-bundle module.

Starting Tensorboard🔗

Begin by locating the appropriate Tensorflow version. Tensorboard is typically available in most containers with TensorFlow, or with any of the TensorFlow modules that we provide. Tensorboard can then be started from the command line simply as tensorboard. You need to specify the log directory using --logdir.

tensorboard --logdir=./tensorboard-log-1234

By default Tensorboard attaches to localhost and port 6006 (unless the port is already taken, in which case it will try 6007, 6008, etc.). This is visible in the output when you start Tensorboard.

Tip: You close the server with Ctrl+C.

Accessing Tensorboard🔗

Tensorboard by default does not bind to an external network interface (it default binds to localhost) - meaning it does not become accessible by default over a network (i.e Internet). To access the web UI you can either create an SSH-tunnel from your own computer and use your web browser of choice, or you can run ThinLinc and start the browser on the login node.

This is an example on Linux and macOS of creating an SSH-tunnel by forwarding local port 8080 on your computer to port 6006 on alvis1.c3se.chalmers.se. The command should be run in the terminal on your own computer.

ssh -L 8080:localhost:6006 CID@alvis1.c3se.chalmers.se

Remember to update port 6006 if Tensorboard listens on a different port.

Next, on your computer open a web browser and visit http://localhost:8080. You should now see the Tensorboard UI.

For more information about SSH-tunnelling, and how you can use PuTTY to achieve the same thing, please see the following guide.

Using ThinLinc🔗

If you for some reason can not use a SSH-tunnel you can launch a web browser directly on the login node using ThinLinc.

  • Login to Alvis using ThinLinc and launch a web browser.
  • Visit http://localhost:6006 (remember to update the port if you were not assigned the default)

You should see the Tensorboard UI.

Running and accessing Tensorboard on a compute host🔗

If you need to run Tensorboard on a compute host you can use the C3SE proxy to reach the web UI. First we need to make sure Tensorboard attaches to an external interface using the --bind_all flag and listens on an available port in range 8888-8988, as allowed by the proxy:

#!/usr/bin/env bash
#SBATCH -A NAISS2023-Y-X -p alvis
#SBATCH -t 01:00:00
[...]
# Select a random port from 8888 to 8988:
FREE_PORT=`comm -23 <(seq "8888" "8988" | sort) <(ss -Htan | awk '{print $4}' | cut -d':' -f2 | sort -u) | shuf -n 1`
echo "Tensorboard URL: https://proxy.c3se.chalmers.se:${FREE_PORT}/`hostname`/"
tensorboard --path_prefix /`hostname`/ --bind_all --port $FREE_PORT --logdir=./tensorboard-log-$SLURM_JOB_ID &

python3 my_tensorflow_project.py

Your Tensorboard is now completely open to anyone on the internet. There is no security here.

A note on security🔗

Tensorboard does not provide authentication. If you run Tensorboard on a shared multi-user system (such as Alvis) other Alvis users can access your instance of Tensorboard (and thus visualise/analyse the data in your --logdir). You need to take this into consideration before using Tensorboard.