Platform R: Using the GPUs in Slurm

The Platform R cluster has a mix GPUs from Nvidia: L40Ss and H200s. You can see what's available with sinfo -o "%5D %35N %30G"

output from sinfo showing GPU information

In Slurm terminology a GPU is a kind of generic resource (GRES). The output format describing a GPU resource is gpu:<GPU type>:<GPU count>. You can see, for example, that there are eight nodes which have four Nvidia L40S GPUs in them.

Updating Your Environment to Use GPUs

CUDA is an API that gives programs access to Nvidia GPUs. Typically your software will say in the documentation which versions of CUDA it will run with. Some tools can discover for themselves what version is installed, but it is not uncommon to find software where you may need to select a library's version to match the version of CUDA installed on the Platform R nodes.

To help your software find CUDA, you need to make a few changes to your shell environment. You can set these in an interactive session, in your .bashrc file, or in a Slurm batch submit file.

export CUDA_HOME=/usr/local/cuda export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH PATH=$PATH:/usr/local/cuda/bin

Not every application will need all of these settings, but all of them together should cover the majority of situations you will encounter. Some libraries and software may require additional environment settings, which will be in the software documentation.

Discover the CUDA Version

First, connect to a node with a GPU: srun --gres=gpu:1 --pty -bash -i -l

Next, ask the Nvidia compiler its version (make sure you have updated your PATH as given above): nvcc --version

output of nvcc command showing CUDA version

This shows that version 12.8 is the default version of CUDA at the time this command was run.

See GPU Activity

The command nvidia-smi shows what GPUs are available, if they are busy, and with what.

Slurm GPU Options

GPU resources are requested with the --gres command line option to sbatch or srun.

Command Line Option	Meaning
`--gres=gpu:1`	one GPU of any kind
`--gres=gpu:2`	two GPUs of any kind
`--gres=gpu:<named-gpu>:1`	one GPU of the named kind

GPUs currently available on Platform R are:

GPU Name	Use Cases
`nvidia_l40s`	graphics, general compute, smaller machine learning workloads
`nvidia_h200`	image processing, generative AI and LLM training and inference
`nvidia_h200_nvl`	multiple GPU-jobs, or if the memory of one GPU is insufficient

Container Options

To enable a container to use the GPU you need to include the command line option --nv with sbatch or srun (in addition to the environment variables listed above).

See Platform R: Using Containers in Slurm for examples of interactive and batch workflows with containers using GPUs.

Applications

Some applications need specific instructions during installation to get the correct GPU information and support libraries.

ollama

If you install Ollama using Conda the default is to grab the newest version of ollama with support for the newest CUDA libraries. It may be necessary to adjust the CUDA library version to match the Platform R version of CUDA. Especially if it is built with a version of CUDA with a different major version than is installed on Platform R, Ollama will likely not discover the GPU.

Running conda search ollama will give a long list of versions and builds of Ollama. The builds encode the CUDA version the library has been built with. For example, cuda_129h63bc95c_0 is built against CUDA 12.9. There are also ollama builds using CUDA 13.0, which will not see the GPUs on Platform R due to the mismatch in the library major version with the Platform R installed version of CUDA, which is 12.8, as we saw above.

To select the right version and build run this:

conda install ollama=0.13.4=cuda_129h63bc95c_0

tensorflow

When installing Tensorflow from Conda, use the tensorflow-gpu package instead of just tensorflow. This ensures that the GPU library dependencies are installed.