One of the bigger productivity improvements for me in terms of tweaking and benchmarking algorithms was the ability to run Julia code on my university’s HPC nodes. This allowed me to test different versions of my code in parallel while still being able to do other work on my local machine. This is a quick tutorial on how to (sidenote: This tutorial is intended for a Julia project, but most of the content applies if you want to run jobs in C++/R/Python/etc. ) on a computing cluster and use SLURM to manage compute jobs.
I am writing this based on my experience with the University of Oxford HPC cluster (Arcus-HTC) which runs CentOS Linux 7 and uses SLURM for job scheduling. This means that based on your university / company setup your experience may differ. In particular the ARC login nodes are connected to the internet which makes installing Julia packages a lot easier.
Login and installing Julia
Once you have an account for Oxford ARC, connect to the University VPN and login via ssh:
You’ll likely connect to one of the cluster’s login nodes. Oxford ARC has different partitions you can choose for your jobs:
- Arcus-B for multi-node parallel computation
- Arcus-HTC for high-throughput lower core count jobs.
I want to use the HTC nodes, so I will again connect to them via ssh:
On the login node, two important folders were created for each user:
$HOME/
- points to the home directory associated with your user account$DATA/
- a folder to store larger files
To
(sidenote:
Some version of Julia/Python/R/Matlab might already be available on ARC. Check with module avail
.
)
we simply download the Julia binaries to our $DATA/
folder and unzip them:
To run Julia on the login node we can type:
(You should generally not run any significant computations on the login nodes, but it is a good way to check that everything works. Also it is a good idea to create a symlink or alias to /julia-1.5.3-linux-x86_64/bin/julia
. )
Installing Julia packages
You’ll find your Julia home folder at $HOME/.julia/
. Since the ARC nodes are connected to the internet you can simply download any packages using the normal workflow. I am assuming here that we want to setup some scripts in the project folder $DATA/myproject/
:
1. Run Julia
2. Create a new project environment `activate .`
3. Install packages: Type `]` and `pkg> add Statistics`
Uploading your code
This is a matter of taste but I prefer to be able to modify any of my code both locally and remotely. I therefore initialise /myproject
as a
.
(sidenote:
A quick overview to get started with Git can be found here.
)
I can then edit a local copy and git push
any of the local changes or vice versa. The project dependencies for this project are tracked by Julia in the Project.toml
and Manifest.toml
files. Changes in both files are also tracked via Git to make sure the same dependencies are used locally and remotely.
This setup allows me to test or debug any changes to my code locally and be certain that it will run the same way on the remote node. Debugging the code on the remote node is more time consuming because your compute jobs do not necessarily execute immediately. For example it might take several minutes until you receive the error message that you misspelled a function name.
Sometimes I want to upload files from my machine without tracking them via git, e.g. large dataset files like dataset.csv
. To transfer files I use scp
.
(sidenote:
Syntax: scp [OPTION] [user@SRC_HOST:]file1 [user@DEST_HOST:]file2
)
To upload dataset.csv
I simply type:
To transfer a result file results.csv
back to the current folder on my local machine I can use the same command:
Scheduling jobs
Oxford ARC uses the SLURM workload manager to request and manage compute jobs. Assuming we are still in the folder $DATA/myproject
the basic workflow to create a new job is:
- write a Julia script
myscript.jl
to run your code - manage all Julia project dependencies using the environment files
Project.toml
andManifest.toml
- write a job submission script
run_job.sh
that requests compute resources and tells the compute nodes what to do
Let’s assume we want to run the following script myscript.jl
to (inefficiently) compute and print the 20-th Fibonacci number:
To run this script on one of the compute nodes we define a submission script run_job.sh
:
Lines starting with #SBATCH
are SLURM commands. We request 5min of computation time on one CPU. We further want to run it on the ARC-HTC partition and get status updates about our job sent to our email.
The last line in the script calls Julia, instantiates the
,
(sidenote:
Suggested reading: Working with environments.
)
i.e. installs any package dependencies defined in Project.toml
, and runs our script.
We then submit the job using the shell with
Depending on the available resources the job is then queued and waiting for execution. You can look up the current job status with
To cancel the job use either of the two:
Job arrays
Sometimes we want to run the same job multiple times with only minor modifications, e.g. running our algorithm with one of the hyperparameters changed. For this case SLURM job arrays are quite useful. They execute the submission script multiple times and allow you to run different versions of the same script.
The ability to run time-consuming jobs in parallel can be a big time saver.
Let’s assume that we want to calculate the Fibonacci number for n = 20, 30, 40
in separate jobs.
To achieve this we add one line with an array command to our submission script run_job.sh
:
The job will now run three times. The array id is given by the environment variableSLURM_ARRAY_TASK_ID
which we can use inside our Julia script to select n
:
Multithreading
To use multithreading in our Julia script we have to request multiple cores on a computing node. The following script my_multithreaded_script.jl
is used to compute the Fibonacci number for different n
on
:
(sidenote:
This is not the best example for multithreading, because computing the Fibonacci number for the highest n
computes the Fibonacci number for all lower n
in the process.
)
To run this script we use the following modified SLURM submission script:
The only changes are that we now request 8 cores on one node and export the environment variable JULIA_NUM_THREADS
to start Julia with 8 threads.
Ensuring consistent results
Let’s assume that you want to benchmark your algorithm, e.g. measure its execution time. In order to generate consistent results from multiple runs, you have to make sure that the Julia script is executed on the same hardware every time. If no hardware is specified SLURM will just run the job on the next available node. If the node has eight CPUs and your job runs on four of them, then the performance of your job will depend on what other jobs run on the remaining four CPUs.
To ensure (sidenote: There will still be some variance in your time measurement. If more accuracy is required, I suggest running the script a number of times (using a job array) and averaging the result. ) you will have to specify the node hardware and request all the CPUs on a node. Information about the different nodes within the Oxford HTC partition can be found here.
Let’s assume that we want to run our job on a SandyBridge E5-2650 (2GHz) node. We can request exclusive access to a whole node by using SLURM constraints. Just add the following lines to the run_job.sh
submission script: