Using Fox's GPU's
A GPU, or Graphics Processing Unit, is a computational unit, which as the name suggest, is optimized to work on graphics tasks. Nearly every computer device that one interacts with contains a GPU of some sort, responsible for transforming the information we want to display into actual pixels on our screens.
One question that might immediately present itself is, if GPUs are optimized for graphics - why are they interesting in the context of computational resources? The answer to that is of course complicated, but the short explanation is that many computational tasks have a lot in common with graphical computations. The reason for this is that GPUs are optimized for working with pixels on the screen, and a lot of them. Since all of these operations are almost identical, mainly working on floating point values, they can be run in parallel on dedicated hardware that is tailored and optimized for this particular task (i.e. the GPU). Working with a grid of pixles might sound familiar if one is already working with a discrete grid in e.g. atmospheric simulation, which points to the reason why GPUs can be interesting in a computational context.
Since GPUs are optimized for working on grids of data and how to transform this data, they are quite well suited for matrix calculations. For some indication of this we can compare the theoretical performance of one GPU with one CPU.
AMD Epyc 7552 | Nvidia A100 | |
---|---|---|
Half Precision | N/A | 78 TFLOPS |
Single Precision | 1,5 TFLOPS | 19.5 TFLOPS |
Double Precision | N/A | 9.7 TFLOPS |
Based on this it is no wonder why tensor libraries such as TensorFlow
and
PyTorch
report
speedup
on accelerators between 23x
and 190x
compared to using only a CPU.
Getting started
Of the resources provided on Fox, only the accel
job
type currently has GPUs available. To access these one has to select the
correct partition as well as request one or more GPUs to utilize.
To select the correct partition use the --partition=accel
flag with either
srun
or salloc
or in your Slurm
script. This flag will ensure that your job is only
run on machines in the accel
partition which have attached GPUs. However, to
be able to actually interact with one or more GPUs we will have to also add
--gpus=N
, which tells Slurm that we would also like to use N
GPUs (read more about available flags in the official Slurm
documentation). Each accel
node in Fox contains four GPUs, i.e. N
above can be set to either {1, 2, 3, 4}
.
Note |
---|
Research groups are already contributing to the growth of Fox and as such Fox already contains nodes with different GPU configurations. To learn how to select different GPUs see our documentation on selecting GPUs in Slurm |
Connecting to the cluster
To get started we first have to SSH on Fox:
$ ssh <username>@fox.educloud.no
Interactive testing
All projects should have access to GPU resources, and to that end we will start
by simply testing that we can get access to a single GPU. To do this we will
run an interactive job, on the accel
partition and asking for a single GPU.
$ salloc --account=ec<XX> --ntasks=1 --mem-per-cpu=1G --time=00:05:00 --qos=devel --partition=accel --gpus=1
$ nvidia-smi
The two commands above should result in something like:
Thu Jun 17 08:49:01 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:24:00.0 Off | 0 |
| N/A 28C P0 33W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Note |
---|
In the above Slurm specification we combined --qos=devel with GPUs and interactive operations so that we can experiment with commands interactively. This can be a good way to perform short tests to ensure that libraries correctly pick up GPUs when developing your experiments. Read more about --qos=devel in our guide on Interactive jobs. |
Slurm script testing
The next thing that we will try to do is to utilize the
TensorFlow/2.4.1-fosscuda-2020b
library to execute a very simple computation
on the GPU. We could do the following interactively in Python, but to introduce
Slurm scripts we will now make a quick transition (which can also make it a bit
easier since we don’t have to sit and wait for the interactive session to
start).
We will use the following simple calculation in Python and Tensorflow
to test
the GPUs of Fox:
#!/usr/bin/env python3
import tensorflow as tf
# Test if there are any GPUs available
print(f"Num GPUs Available: {len(tf.config.list_physical_devices('GPU'))}")
# Have Tensorflow output where computations are run
tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
# Print result
print(c)
Save the above as gpu_intro.py
on Fox.
To run this we will first have to create a Slurm script in which we will
request resources. A good place to start is with a basic job script (see Job
Scripts). Use the following to create submit_gpu.sh
(remember to substitute your project number under --account
):
#!/bin/bash
#SBATCH --job-name=TestGPUOnFox
#SBATCH --account=ec<XX>
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=512M
#SBATCH --qos=devel
## Set up job environment:
set -o errexit # Exit the script on any error
set -o nounset # Treat any unset variables as an error
module --quiet purge # Reset the modules to the system default
module load TensorFlow/2.4.1-fosscuda-2020b
module list
python gpu_intro.py
If we just run the above Slurm script with sbatch submit_gpu.sh
the output
(found in the same directory as you executed the sbatch
command with a name
like slurm-<job-id>.out
) will contain several errors as Tensorflow
attempts
to communicate with the GPU, however, the program will still run and give the
following successful output:
Num GPUs Available: 0
tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)
So the above, eventually, ran fine, but did not report any GPUs. The reason for
this is of course that we never asked for any GPUs in the first place. To
remedy this we will change the Slurm script to include the --partition=accel
and --gpus=1
, as follows:
#!/bin/bash
#SBATCH --job-name=TestGPUOnFox
#SBATCH --account=ec<XX>
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=512M
#SBATCH --qos=devel
#SBATCH --partition=accel
#SBATCH --gpus=1
## Set up job environment:
set -o errexit # Exit the script on any error
set -o nounset # Treat any unset variables as an error
module --quiet purge # Reset the modules to the system default
module load TensorFlow/2.4.1-fosscuda-2020b
module list
python gpu_intro.py
We should now see the following output:
Num GPUs Available: 1
tf.Tensor(
[[22. 28.]
[49. 64.]], shape=(2, 2), dtype=float32)
However, with complicated libraries such as Tensorflow
we are still not
guaranteed that the above actually ran on the GPU. There is some output to
verify this, but we will check this manually as that can be applied more
generally.
Monitoring the GPUs
To monitor the GPU(s), we will start nvidia-smi
before our job and let it run
while we use the GPU. We will change the submit_gpu.sh
Slurm script above to
submit_monitor.sh
, shown below:
#!/bin/bash
#SBATCH --job-name=TestGPUOnFox
#SBATCH --account=ec<XX>
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=512M
#SBATCH --qos=devel
#SBATCH --partition=accel
#SBATCH --gpus=1
## Set up job environment:
set -o errexit # Exit the script on any error
set -o nounset # Treat any unset variables as an error
module --quiet purge # Reset the modules to the system default
module load TensorFlow/2.4.1-fosscuda-2020b
module list
# Setup monitoring
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory \
--format=csv --loop=1 > "gpu_util-$SLURM_JOB_ID.csv" &
NVIDIA_MONITOR_PID=$! # Capture PID of monitoring process
# Run our computation
python gpu_intro.py
# After computation stop monitoring
kill -SIGINT "$NVIDIA_MONITOR_PID"
Note |
---|
The query used to monitor the GPU can be further extended by adding additional parameters to the --query-gpu flag. Check available options here. |
Run this script with sbatch submit_monitor.sh
to test if the output
gpu_util-<job id>.csv
actually contains some data. We can then use this data
to ensure that we are actually using the GPU as intended. Pay specific
attention to utilization.gpu
which shows the percentage of how much
processing the GPU is doing. It is not expected that this will always be 100%
as we will need to transfer data, but the average should be quite high.
Note |
---|
We are working on automating the above monitoring solution so that all GPU jobs output similar statistics. In the mean time the above solution could help indicating your job's resource utilization. |
CC Attribution: This page is maintained by the University of Oslo IT FFU-BT group. It has either been modified from, or is a derivative of, "Introduction to using GPU compute" by NRIS under CC-BY-4.0. Changes: Removed "Next steps" section.