Using GPUs on a Slurm Cluster

With effective Slurm, hpc-client, and compute engine configuration, gears can be scheduled on a GPU node with a Slurm Scheduler.

Configuration

Both Slurm and fw-cast must be configured appropriately to enable GPU execution of gears on a Slurm Cluster.

Slurm Configuration

To execute gears on GPU nodes of a Slurm cluster, Slurm must be configured correctly. Your system administrator will most likely configure these settings. Below are examples of how a working configuration was set. If you don't see something like these settings on the nodes of the Slurm Cluster, it is likely that it is not set up for GPU execution.

Detailed instructions for configuring Slurm can be found at https://slurm.schedmd.com/. Including a Slurm Configuration Tool.

slurm.conf

The slurm.conf file is typically found in /etc/slurm/. Below is an example for a node definition that enables GPUs to be scheduled.

NodeName=scien-hpc-gpu Gres=gpu:tesla:1 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=14978

The Generic RESource (GRES) flag (Gres=gpu:tesla:1) must be present to indicate the resource type (e.g. "gpu"), the resource class (e.g. "tesla"), and the number of resources present ("1"). Execution on more than one GPU per node has not yet been explored.

If desired, the remainder of the node configuration (e.g. CPUs, RealMemory) can be interrogated by the following command:

slurmd -C

Note: if you do not see nodes defined in the slurm.conf, your system administrator may have defined these in a separate file (e.g., in the nodes.conf file, typically located in /etc/slurm/ or /etc/slurm/slurm.d/). In this case, ensure that this file is included in the slurm.conf and that nodes with GPUs. Here's an example call to reference the nodes.conf file in the slurm.conf file:

include /etc/slurm/slurm.d/nodes.conf

gres.conf

The Generic RESource (GRES) configuration, gres.conf, needs to have an entry for each resource named in slurm.conf.

NodeName=scien-hpc-gpu Name=gpu Type=tesla File=/dev/nvidia0

Here File=/dev/nvidia0 is a reference to the device that the GPU is mounted on.

Updating the fw-cast settings

It is recommended that you replace the script section of your settings/cast.yml file with the script section of the examples/settings/gpu_cast.yml file. This is also shown below.

    script: |+
    #!/bin/bash
    #SBATCH --job-name=fw-{{job.fw_id}}
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task={{job.cpu}}
    #SBATCH --mem-per-cpu={{job.ram}}
    {% if job.gpu %}#SBATCH --gpus-per-node={{job.gpu}}{% endif %}
    #SBATCH --output {{script_log_path}}

    set -euo pipefail

    source "{{cast_path}}/settings/credentials.sh"
    cd "{{engine_run_path}}"

    set -x
    srun ./engine run --single-job {{job.fw_id}}

Compute Engine

Ensure that you have a Compute Engine installed that has been compiled after 2024-02-01. Please contact Flywheel staff to get an updated Flywheel engine.

After receiving the updated Flywheel engine, install it as per the instructions found this document.

Gear Execution

With the rest of the workflow configured, adding a gpu tag (in addition to the hpc tag) to the launch of the gear will schedule a GPU to execute the gear on the Slurm cluster.

Note: If your site already uses the gpu tag for launching another engine on Flywheel and those jobs are not routed through the HPC Hold engine, please contact Flywheel staff.

Potential Problems

Without the gpu tag present on gear launch any node meeting the criteria will be scheduled.
If the cast.yml does not have the line with --gpus-per-node only CPU nodes will be scheduled.
If GPU nodes are not available on the cluster, the job will be put in a waiting state until one is.

If your HPC cluster requires the use of a specific partition, you will need to specify the GPU partition in the cast.yml script to avoid the following error:

```bash sbatch: error: Batch job submission failed: Invalid partition specified

You can specify in the `cast.yml` script which partition to use, depending if the job
  was called with the `gpu` tag or not as follows:

```yaml
{% if job.gpu %}#SBATCH --partition=<gpu-partition-name-here>{% else %}#SBATCH --partition=<cpu-partition-name-here>{% endif %}