High Performance Computing

Our group makes frequent use of distributed, highly parallelizable computing resources on campus. In particular, researchers should be fluent at interfacing with three computers during their time at UMD. These are:

your local machine
our in-house GPU workstation (Curiosity)
UMD's supercomputing cluster (Zaratan).

While you're encouraged to configure / run your machine however you like, Curiosity and Zaratan have more formal requirements that are defined through SLURM.

Simple Linux Utility for Resource Management (SLURM)

Slurm is a tool that is used to schedule when different programs (jobs) should execute on shared hardware. Slurm takes information provided by the user which may include the expected amount of time it will take the script to run, how many cores are needed, if what GPUs to use, past usage data, etc. to produce a job queue to distribute the resources equitably among users.

Example: Imagine you have three users of a single computer: Sally, Jose, and Sierra. Sally is an experienced HPC user who has expended most of their compute credits. She wants to execute a program that will require 128 CPU cores and will likely take 3 days to execute. Jose has also used most of his credits, but has a simple script that requires 4 cores, and will execute in approximately 15 minutes. Sierra hasn't used any credits this year, but has a 10 hour job that will require 2 GPU.

Slurm will take these factors and produce a queue that would like prioritize Jose, then Sierra, then Sally. You can imagine this scaling to dozens of users, each with their own specific computing needs, and all relying on the same system.

Quick Start

Foremost, configure your local machine with the right tools to connect with these remote servers. I strongly recommend installing VS Code and its associated Remote-SSH extension. This will allow you to connect to the server right within your IDE, gaining access to your directory structure, debugging tools, and intellisense.

When you first SSH into the remote server, you'll be taken to something called a login node. This is somewhat misleading, as you have already successfully logged in. Rather, this is a very, very low resource compute node that you'll use to navigate to more powerful nodes. In particular, from the login node, you have two options: 1) transition to a compile node via acompile or 2) transition into an interactive node through sinteractive.

Compile Node

The compile node is slightly more powerful than a login node. It has 2 cores, and slightly more memory. The purpose of a compile node, for our purposes, is to submit HPC jobs. In particular, you will write a batch script (.sh) that contains all of the necessary initialization (configuring your anaconda environment, compiling code) and execution information for your script (i.e. call python my_script.py keyword_1 keyword_2)

To actually execute the program, you'll need to submit your batch script to the Slurm queue through sbatch your_bash_script.sh. Note that for sbatch to work properly, you'll need to add Slurm directives to the top of your bash script. These directives specify your actual need for compute resources (i.e. number of CPUs, GPUs, anticipated compute time, etc.).

Example

```bash

#!/bin/bash

#SBATCH --nodes=1 // How many nodes do you need (only request 1 unless you've configured MPI)
#SBATCH --ntasks=1 // ??
#SBATCH --cpus-per-task=2 // How many cores do you need
#SBATCH --time=01:30:00 // how long to run the script before terminating
#SBATCH --gres=gpu:1 // general resources (use for GPU)
#SBATCH --partition=aa100 // What type of compute node do you want (CPU or GPU?)
#SBATCH --output=SlurmFiles/test-%j-%a.out

module purge

echo "== Load Anaconda =="

module load slurm/alpine
module load anaconda
module load gcc
module load cudnn/8.6
export PATH=$PATH:/curc/sw/cuda/11.8/bin:/curc/sw/cuda/11.8/lib64
export XLA_FLAGS=--xla_gpu_cuda_data_dir=/curc/sw/cuda/11.8/

echo "== Activate Env =="

conda activate research

echo "== Running Script =="

srun python /projects/joma5012/GravNN/Scripts/Train/train_ablation_study.py

wait # Necessary to wait for all processes to finish
echo "== End of Job =="
exit 0

```

Note that Slurm systems typically have prebuilt programs or libraries called modules that you can load before executing your script. In this case, we are loading the necessary CUDA libaries to ensure that the GPUs can make full use of the hardware.

Interactive Node

The alternative to a compile node is an interactive node. Interactive nodes allow you to SSH directly into the hardware you are requesting to execute your code as if it was your local machine. This is typically best for when you need to interact with your data on the fly. While this would seem universally desirable, depending on how many resources you need, it may take a long time before your request is granted. Therefore, it is typically advised that compile nodes / batch jobs are submitted and interactive nodes are primarily left for quick debugging in low-compute settings.

Job Arrays

When running large jobs, consider if it's possible to break the job into smaller chunks and using job arrays. Job arrays are best explained through an example. Let's say you want to run a hyperparameter search. You have 1000 networks that you want to train with HPC resources and each network would take 1 minute to train.

You have two options for how to proceed.

- Option 1: Request 1000 CPU cores for 1 minute. 
- Option 2: Request 1 CPU core for 1 minute, 10,000 times. 
-

Option 1 will yield considerably longer queue times than Option 2. This is because, it's rare for 1000 cores to be available at any given moment on a HPC system. You have thousands of other jobs in the queue waiting to be run, and you would need to stop all of them to run your little script. Alternatively, if you only had to access 1 core at a time, you're hardly making a dent in the queue system.

Most HPC systems never run at 100% capacity. There are always stray cores that need work to do, and by parsing your job to run on a single core, you will have a steady stream of work getting done, with relatively high priority. The way you accomplish is by using job arrays.

In your bash script, add the following to your slurm directives

    #SBATCH --array=0-100
    #SBATCH --output=SlurmFiles/test-%j-%a.out 

    srun python /projects/joma5012/GravNN/Scripts/Train/train_ablation_study.py  $SLURM_ARRAY_TASK_ID

By specifying the array directive, you can imagine this script getting run 100 times, but with a different value for $SLURM_ARRAY_TASK_ID. This can be used as a command line argument, to query a particular index or hyperparameter configuration within your script. You can also specify the output of the specific job using the %a placeholder in the output directive.

Advanced Configuration

Make sure to configure your conda environment / packages to save to the larger Projects/ directory. If you install to your login directory, it's likely you'll run out of memory which will make it extremely difficult to connect via SSH moving forward.

Additionally, when configuring VS code, be sure to save the extensions moved to Projects/.vscode-server for similar purposes.

!!! note Configure VS Code Extensions to /Projects/ directory

You'll have to ensure that your python