TensorBoard on a Slurm cluster
If you run your training on a Slurm cluster chances are that all ports but SSH are blocked and cannot be accessed from outside of the cluster. This also affects TensorBoard that visualizes various training metrics.
The solution to this problem is to create a SSH tunnel and forward all requests to a specific TCP port on the client to the Slurm cluster where TensorBoard runs.
Let’s create a Slurm script tensorboard.sbatch
that starts the process as a job:
#!/bin/bash
#SBATCH --job-name="tensorboard"
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
module load TensorFlow/2.5.0-fosscuda-2020b
source .env-tf2.5/bin/activate
echo "Starting ${logdir} on port ${port}."
tensorboard --logdir=$logdir --port=$port
Then we just need to start the job:
sbatch --time=1-00:00:00 --export=logdir="/path/to/logdir",port=16006 tensorboard.sbatch
On the client we just forward a local port to the remote port:
ssh -N -L 16006:127.0.0.1:6006 user@slurm
Now we can access Tensorboard on localhost:6006
.