Using TensorFlow on GPU's in Fox

The intention of this guide is to teach students and researchers with access to Fox how to use machine learning libraries on CPU and GPU. The guide is optimized for TensorFlow 2, however, we hope that if you utilize other libraries this guide still holds some value. Do not hesitate to {ref}contact us <support-line> for additional assistance.

For the rest of this guide many of the examples ask for Fox resources with GPU. This is achieved with the --partition=accel --gres=gpu:1 ({ref}job_scripts_saga_accel), however, TensorFlow does not require the use of a GPU so for testing it is recommended to not ask for GPU resources (to be scheduled quicker) and then, once your experiments are ready to run for longer, add back inn the request for GPU.

A complete example, both python and Slurm file, can be found at {download}files/mnist.py and {download}files/run_mnist.sh.

Installing python libraries

The preferred way to "install" the necessary machine learning libraries is to load one of the pre-built {ref}modules <module-scheme> below. By using the built-in modules any required third-party module is automatically loaded and ready for use, minimizing the amount of packages to load.

When loading modules pressing `<tab>` gives you
autocomplete options.
It can be useful to do an initial `module purge` to
ensure nothing from previous experiments is loaded before
loading modules for the first time.
Modules are regularly updated so if you would like a newer
version, than what is listed above, use `module avail |
less` to browse all available packages.

If you need additional python libraries it is recommended to create a virtualenv environment and install packages there. This increases reproducibility and makes it easy to test different packages without needing to install libraries globally.

Once the desired module is loaded, create a new virtualenv environment as follows.

# The following will create a new folder called 'tensor_env' which will hold
# our 'virtualenv' and installed packages
$ virtualenv -p python3 tensor_env
# Next we need to activate the new environment
# NOTE: The 'Python' module loaded above must be loaded for activation to
# function, this is important when logging in and out or doing a 'module purge'
$ source tensor_env/bin/activate
# If you need to do other python related stuff outside the virtualenv you will
# need to 'deactivate' the environment with the following
$ deactivate

Once the environment is activated, new packages can be installed by using pip install <package>. If you end up using additional packages make sure that the virtualenv is activated in your {ref}job-scripts.

# Often useful to purge modules before running experiments
module purge

# Load machine learning library along with python
module load TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
# Activate virtualenv
source $SLURM_SUBMIT_DIR/tensor_env/bin/activate

Manual route

If you still would like to install packages through pip the following will guide you through how to install the latest version of TensorFlow and load the necessary packages for GPU compute.

To start, load the desired python version - here we will use the newest as of writing.

$ module load Python/3.8.2-GCCcore-9.3.0
This has to be done on the login node so that we have access to
the internet and can download `pip` packages.

Then create a virtual environment which we will install packages into.

# The following will create a new folder called 'tensor_env' which will hold
# our 'virtualenv' and installed packages
$ virtualenv -p python3 tensor_env
# Next we need to activate the new environment
# NOTE: The 'Python' module loaded above must be loaded for activation to
# function, this is important when logging in and out or doing a 'module purge'
$ source tensor_env/bin/activate
# If you need to do other python related stuff outside the virtualenv you will
# need to 'deactivate' the environment with the following
$ deactivate

Next we will install the latest version of TensorFlow 2 which fortunately should support GPU compute directly without any other prerequisites.

$ pip install tensorflow

To ensure that the above install worked, start an interactive session with python and run the following:

>>> import tensorflow as tf
>>> tf.test.is_built_with_cuda()
# Should respond with 'True' if it worked

The import might show some error messages related to loading CUDA, however, for now we just wanted to see that TensorFlow was installed and pre-compiled with CUDA (i.e. GPU) support.

This should be it for installing necessary libraries. Below we have listed the modules which will have to be loaded for GPU compute to work. The next code snippet should be in your {ref}job-scripts so that the correct modules are loaded on worker nodes and the virtual environment is activated.

# Often useful to purge modules before running experiments
module purge

# Load desired modules (replace with the exact modules you used above)
module load Python/3.8.2-GCCcore-9.3.0
module load CUDA/10.1.243
module load cuDNN/7.6.4.38
# Activate virtualenv
source $SLURM_SUBMIT_DIR/tensor_env/bin/activate

Loading data

For data that you intend to work on it is simplest to upload to your home area and if the dataset is small enough simply load from there onto the worker node.

To upload data to use rsync to transfer data:

# On your own machine, upload the dataset to your home folder
$ rsync -zh --info=progress2 -r /path/to/dataset/folder <username>@fox.educloud.no:~/.

For large amounts of data it is recommended to load into your {ref}project-area to avoid filling your home area.

To retrieving the path to the dataset, we can utilize python's os module to access the variable, like so:

# ...Somewhere in your experiment python file...
# Load 'os' module to get access to environment and path functions
import os

# Path to dataset
dataset_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'], 'dataset')

Loading built-in datasets

First we will need to download the dataset on the login node. Ensure that the correct modules are loaded. Next open up an interactive python session with python, then:

>>> tf.keras.datasets.mnist.load_data()

This will download and cache the MNIST dataset which we can use for training models. Load the data in your training file like so:

(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

Saving model data

For saving model data and weights we suggest the TensorFlow built-in checkpointing and save functions.

The following code snippet is a more or less complete example of how to load built-in data and save weights.

#!/usr/bin/env python

# Assumed to be 'mnist.py'

import tensorflow as tf
import os

# Access storage path for '$SLURM_SUBMIT_DIR'
storage_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
			    os.environ['SLURM_JOB_ID'])

# Load dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255., x_test / 255.

def create_model():
        model = tf.keras.models.Sequential([
                tf.keras.layers.Flatten(input_shape=(28, 28)),
                tf.keras.layers.Dense(512, activation='relu'),
                tf.keras.layers.Dropout(0.2),
                tf.keras.layers.Dense(10, activation='softmax')
                ])
        model.compile(optimizer='adam',
                      loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                      metrics=['accuracy'])
        return model

# Create and display summary of model
model = create_model()
# Output, such as from the following command, is outputed into the '.out' file
# produced by 'sbatch'
model.summary()

# Save model in TensorFlow format
model.save(os.path.join(storage_path, "model"))

# Create checkpointing of weights
ckpt_path = os.path.join(storage_path, "checkpoints", "mnist-{epoch:04d}.ckpt")
ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=ckpt_path,
        save_weights_only=True,
        verbose=1)

# Save initial weights
model.save_weights(ckpt_path.format(epoch=0))

# Train model with checkpointing
model.fit(x_train[:1000], y_train[:1000],
          epochs=50,
          callbacks=[ckpt_callback],
          validation_data=(x_test[:1000], y_test[:1000]),
          verbose=0)

The above file can be run with the following {ref}job-scripts which will ensure that correct modules are loaded and results are copied back into your home directory.

#!/usr/bin/bash

# Assumed to be 'mnist_test.sh'

#SBATCH --account=<your_account>
#SBATCH --job-name=<creative_job_name>
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=8G
## The following line can be omitted to run on CPU alone
#SBATCH --partition=accel --gres=gpu:1
#SBATCH --time=00:30:00

# Purge modules and load tensorflow
module purge
module load TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
# List loaded modules for reproducibility
module list

# Run python script
python $SLURM_SUBMIT_DIR/mnist.py

Once these two files are located on a Fox resource we can run it with:

$ sbatch mnist_test.sh

And remember, in your code it is important to load the latest checkpoint if available, which can be retrieved with:

# Load weights from a previous run
ckpt_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
			"<job_id to load checkpoints from>",
			"checkpoints")
latest = tf.train.latest_checkpoint(ckpt_dir)

# Create a new model instance
model = create_model()

# Load the previously saved weights if they exist
if latest:
	model.load_weights(latest)

Using TensorBoard

TensorBoard is a nice utility for comparing different runs and viewing progress during optimization. To enable this on Fox resources we will need to write data into our home area and some steps are necessary for connecting and viewing the board.

We will continue to use the MNIST example from above. The following changes are needed to enable TensorBoard.

# In the 'mnist.py' script
import datetime

# We will store the 'TensorBoard' logs in the folder where the 'mnist_test.sh'
# file was launched and create a folder like 'logs/fit'. In your own code we
# recommended that you give these folders names that you will recognize,
# the last folder uses the time when the program was started to separate related
# runs
log_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
		       "logs",
		       "fit",
		       datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

# Change the last line, where we fit our data in the example above, to also
# include the TensorBoard callback
model.fit(x_train[:1000], y_train[:1000],
          epochs=50,
	  # Change here:
          callbacks=[ckpt_callback, tensorboard_callback],
          validation_data=(x_test[:1000], y_test[:1000]),
          verbose=0)

Once you have started a job with the above code embedded, or have a previous run which created a TensorBoard log, it can be viewed as follows.

  1. Open up a terminal and connect to Fox as usual.
  2. In the new terminal on Fox, load TensorFlow and run tensorboard --logdir=/path/to/logs/fit --port=0.
  3. In the output from the above command note which port TensorBoard has started on, the last line should look something like: TensorBoard 2.1.0 at http://localhost:44124/ (Press CTRL+C to quit).
  4. Open up another terminal and this time connect to Fox using the following: ssh -L 6006:localhost:<port> <username>@fox.educloud.no where <port> is the port reported from step 3 (e.g. 44124 in our case).
  5. Open your browser and go to localhost:6006.

Advance topics

Using multiple GPUs

Since all of the GPU machines on Fox have four GPUs it can be beneficial for some workloads to distribute the work over more than one device at a time. This can be accomplished with the tf.distribute.MirroredStrategy.

As of writing, only the `MirroredStrategy` is fully
supported by `TensorFlow` which is limited to one
node at a time.

We will, again, continue to use the MNIST example from above. However, as we need some larger changes to the example we will recreate the whole example and try to highlight changes.

#!/usr/bin/env python

# Assumed to be 'mnist.py'

import datetime
import os
import tensorflow as tf

# Access storage path for '$SLURM_SUBMIT_DIR'
storage_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
			    os.environ['SLURM_JOB_ID'])

## --- NEW ---
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")

# Calculate batch size
# For your own experiments you will likely need to adjust this based on testing
# on GPUs to find the 'optimal' size
BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

# Load dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), _ = mnist.load_data()
x_train = x_train / 255.
## --- NEW ---
# NOTE: We need to create a 'Dataset' so that we can process the data in
# batches
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(60000).repeat().batch(BATCH_SIZE)

def create_model():
        model = tf.keras.models.Sequential([
                tf.keras.layers.Flatten(input_shape=(28, 28)),
                tf.keras.layers.Dense(512, activation='relu'),
                tf.keras.layers.Dropout(0.2),
                tf.keras.layers.Dense(10, activation='softmax')
                ])
        model.compile(optimizer='adam',
                      loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                      metrics=['accuracy'])
        return model

# Create and display summary of model
## --- NEW ---
with strategy.scope():
	model = create_model()
# Output, such as from the following command, is outputed into the '.out' file
# produced by 'sbatch'
model.summary()
log_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
		       "logs",
		       "fit",
		       datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

# Save model in TensorFlow format
model.save(os.path.join(storage_path, "model"))

# Create checkpointing of weights
ckpt_path = os.path.join(storage_path, "checkpoints", "mnist-{epoch:04d}.ckpt")
ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=ckpt_path,
        save_weights_only=True,
        verbose=1)

# Save initial weights
model.save_weights(ckpt_path.format(epoch=0))

# Train model with checkpointing
model.fit(train_dataset,
          epochs=50,
	  steps_per_epoch=70,
          callbacks=[ckpt_callback, tensorboard_callback])

Next we will use a slightly altered job script to ask for two GPUs to see if the above works.

#!/usr/bin/bash

# Assumed to be 'mnist_test.sh'

#SBATCH --account=<your_account>
#SBATCH --job-name=<creative_job_name>
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=8G
#SBATCH --partition=accel --gres=gpu:2
#SBATCH --time=00:30:00

# Purge modules and load tensorflow
module purge
module load TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
# List loaded modules for reproducibility
module list

# Run python script
python $SLURM_SUBMIT_DIR/mnist.py

Distributed training on multiple nodes

To utilize more than four GPUs we will turn to the Horovod project which supports several different machine learning libraries and is capable of utilizing MPI. Horovod is responsible for communicating between different nodes and perform gradient computation, averaged over the different nodes.

Utilizing this library together with TensorFlow 2 requires minimal changes, however, there are a few things to be aware of in regards to scheduling with Slurm. The following example is based on the official TensorFlow example.

To install Horovod you will need to create a virtualenv as described above. Then once activated install the Horovod package with support for NCCL.

# This assumes that you have activated a 'virtualenv'
# $ source tensor_env/bin/activate
$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod

Then we can run our training using just a few modifications:

#!/usr/bin/env python

# Assumed to be 'mnist_hvd.py'

import datetime
import os
import tensorflow as tf
import horovod.tensorflow.keras as hvd

# Initialize Horovod.
hvd.init()

# Extract number of visible GPUs in order to pin them to MPI process
gpus = tf.config.experimental.list_physical_devices('GPU')
if hvd.rank == 0:
    print(f"Found the following GPUs: '{gpus}'")
# Allow memory growth on GPU, required by Horovod
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
# Since multiple GPUs might be visible to multiple ranks it is important to
# bind the rank to a given GPU
if gpus:
    print(f"Rank '{hvd.local_rank()}/{hvd.rank()}' using GPU: '{gpus[hvd.local_rank()]}'")
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
else:
    print(f"No GPU(s) configured for ({hvd.local_rank()}/{hvd.rank()})!")

# Access storage path for '$SLURM_SUBMIT_DIR'
storage_path = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
                            os.environ['SLURM_JOB_ID'])

# Load dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), _ = mnist.load_data()
x_train = x_train / 255.

# Create dataset for batching
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.repeat().shuffle(10000).batch(128)

# Define learning rate as a function of number of GPUs
scaled_lr = 0.001 * hvd.size()


def create_model():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(512, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    # Horovod: adjust learning rate based on number of GPUs.
    opt = tf.optimizers.Adam(scaled_lr)
    model.compile(optimizer=opt,
                  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'],
                  experimental_run_tf_function=False)
    return model


# Create and display summary of model
model = create_model()
# Output, such as from the following command, is outputed into the '.out' file
# produced by 'sbatch'
if hvd.rank() == 0:
    model.summary()

# Create list of callback so we can separate callbacks based on rank
callbacks = [
    # Horovod: broadcast initial variable states from rank 0 to all other
    # processes.  This is necessary to ensure consistent initialization of all
    # workers when training is started with random weights or restored from a
    # checkpoint.
    hvd.callbacks.BroadcastGlobalVariablesCallback(0),

    # Horovod: average metrics among workers at the end of every epoch.
    #
    # Note: This callback must be in the list before the ReduceLROnPlateau,
    # TensorBoard or other metrics-based callbacks.
    hvd.callbacks.MetricAverageCallback(),

    # Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to
    # worse final accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 *
    # hvd.size()` during the first three epochs. See
    # https://arxiv.org/abs/1706.02677 for details.
    hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=3,
                                             initial_lr=scaled_lr,
                                             verbose=1),
]

# Only perform the following actions on rank 0 to avoid all workers clash
if hvd.rank() == 0:
    # Tensorboard support
    log_dir = os.path.join(os.environ['SLURM_SUBMIT_DIR'],
                           "logs",
                           "fit",
                           datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir,
                                                          histogram_freq=1)
    # Save model in TensorFlow format
    model.save(os.path.join(storage_path, "model"))
    # Create checkpointing of weights
    ckpt_path = os.path.join(storage_path,
                             "checkpoints",
                             "mnist-{epoch:04d}.ckpt")
    ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=ckpt_path,
        save_weights_only=True,
        verbose=0)
    # Save initial weights
    model.save_weights(ckpt_path.format(epoch=0))
    callbacks.extend([tensorboard_callback, ckpt_callback])

verbose = 1 if hvd.rank() == 0 else 0
# Train model with checkpointing
model.fit(x_train, y_train,
          steps_per_epoch=500 // hvd.size(),
          epochs=100,
          callbacks=callbacks,
          verbose=verbose)
When printing information it can be useful to use the `if
hvd.rank() == 0` idiom to avoid the same thing being
printed from every process.

This can then be scheduled with the following:

#!/usr/bin/bash

#SBATCH --account=<your project account>
#SBATCH --job-name=<fancy name>
#SBATCH --partition=accel --gres=gpu:4
#SBATCH --ntasks=6
#SBATCH --ntasks-per-node=4
#SBATCH --mem-per-cpu=8G
#SBATCH --time=00:30:00

# Purge modules and load tensorflow
module purge
module load TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
source $SLURM_SUBMIT_DIR/tensor_env/bin/activate
# List loaded modules for reproducibility
module list

# Export settings expected by Horovod and mpirun
export OMPI_MCA_pml="ob1"
export HOROVOD_MPI_THREADS_DISABLE=1

# Run python script
srun python $SLURM_SUBMIT_DIR/mnist_hvd.py

Note especially the use of --gress=gpu:4 which means that each node allocated for the job will get four GPUs. The --ntasks-per-node=4 is necessary to force Slurm to allocate at most four task per available node, which corresponds to the number of GPUs per node on Fox. Also note that this means the example above will require two nodes without using all available compute - in effect oversubscribing. Due to this potential to oversubscribe it is recommended that you ask for a multiple of 4 for --ntasks to avoid oversubscribing and increase scheduling speed.

**Tip:**

In the future it should be possible to use
`--gpus-per-task=1` instead to simplify and not
oversubscribe on GPUs.

CC Attribution: This page is maintained by the University of Oslo IT FFU-BT group. It has either been modified from, or is a derivative of, "TensorFlow on GPU" by NRIS under CC-BY-4.0.