If you run your training on a Slurm cluster chances are that all ports but SSH are blocked and cannot be accessed from outside of the cluster. This also affects TensorBoard that visualizes various training metrics.

The solution to this problem is to create a SSH tunnel and forward all requests to a specific TCP port on the client to the Slurm cluster where TensorBoard runs.

Let’s create a Slurm script tensorboard.sbatch that starts the process as a job:

#!/bin/bash
#SBATCH --job-name="tensorboard"
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

module load TensorFlow/2.5.0-fosscuda-2020b

source .env-tf2.5/bin/activate

echo "Starting ${logdir} on port ${port}."

tensorboard --logdir=$logdir --port=$port

Then we just need to start the job:

sbatch --time=1-00:00:00 --export=logdir="/path/to/logdir",port=16006 tensorboard.sbatch

On the client we just forward a local port to the remote port:

ssh -N -L 16006:127.0.0.1:6006 user@slurm

Now we can access Tensorboard on localhost:6006.