Running Julia jobs on an HPC cluster

Thursday, 17. December 2020 ⋅ 8 minute read ⋅ tags: HPC, julia

One of the bigger productivity improvements for me in terms of tweaking and benchmarking algorithms was the ability to run Julia code on my university’s HPC nodes. This allowed me to test different versions of my code in parallel while still being able to do other work on my local machine. This is a quick tutorial on how to setup Julia (sidenote: This tutorial is intended for a Julia project, but most of the content applies if you want to run jobs in C++/R/Python/etc. ) on a computing cluster and use SLURM to manage compute jobs.

I am writing this based on my experience with the University of Oxford HPC cluster (Arcus-HTC) which runs CentOS Linux 7 and uses SLURM for job scheduling. This means that based on your university / company setup your experience may differ. In particular the ARC login nodes are connected to the internet which makes installing Julia packages a lot easier.

Once you have an account for Oxford ARC, connect to the University VPN and login via ssh:

ssh -X USERNAME@oscgate.arc.ox.ac.uk

You’ll likely connect to one of the cluster’s login nodes. Oxford ARC has different partitions you can choose for your jobs:

Arcus-B for multi-node parallel computation
Arcus-HTC for high-throughput lower core count jobs.

I want to use the HTC nodes, so I will again connect to them via ssh:

ssh -X arcus-htc

On the login node, two important folders were created for each user:

$HOME/ - points to the home directory associated with your user account
$DATA/ - a folder to store larger files

To install Julia (sidenote: Some version of Julia/Python/R/Matlab might already be available on ARC. Check with module avail. ) we simply download the Julia binaries to our $DATA/ folder and unzip them:

cd $DATA
wget https://julialang-s3.julialang.org/bin/linux/x64/1.5/julia-1.5.3-linux-x86_64.tar.gz
tar -xzf julia-1.5.3-linux-x86_64.tar.gz

To run Julia on the login node we can type:

./julia-1.5.3-linux-x86_64/bin/julia

(You should generally not run any significant computations on the login nodes, but it is a good way to check that everything works. Also it is a good idea to create a symlink or alias to /julia-1.5.3-linux-x86_64/bin/julia. )

Installing Julia packages

You’ll find your Julia home folder at $HOME/.julia/. Since the ARC nodes are connected to the internet you can simply download any packages using the normal workflow. I am assuming here that we want to setup some scripts in the project folder $DATA/myproject/:

Run Julia
Create a new project environment `activate .`
Install packages: Type `]` and `pkg> add Statistics`

Uploading your code

This is a matter of taste but I prefer to be able to modify any of my code both locally and remotely. I therefore initialise /myproject as a Git repository. (sidenote: A quick overview to get started with Git can be found here. ) I can then edit a local copy and git push any of the local changes or vice versa. The project dependencies for this project are tracked by Julia in the Project.toml and Manifest.toml files. Changes in both files are also tracked via Git to make sure the same dependencies are used locally and remotely.

This setup allows me to test or debug any changes to my code locally and be certain that it will run the same way on the remote node. Debugging the code on the remote node is more time consuming because your compute jobs do not necessarily execute immediately. For example it might take several minutes until you receive the error message that you misspelled a function name.

Sometimes I want to upload files from my machine without tracking them via git, e.g. large dataset files like dataset.csv. To transfer files I use scp (secure copy protocol). (sidenote: Syntax: scp [OPTION] [user@SRC_HOST:]file1 [user@DEST_HOST:]file2 ) To upload dataset.csv I simply type:

scp dataset.csv USERNAME@oscgate.arc.ox.ac.uk:/home/USERNAME/.

To transfer a result file results.csv back to the current folder on my local machine I can use the same command:

scp USERNAME@oscgate.arc.ox.ac.uk:/home/USERNAME/results.csv .

Scheduling jobs

Oxford ARC uses the SLURM workload manager to request and manage compute jobs. Assuming we are still in the folder $DATA/myproject the basic workflow to create a new job is:

write a Julia script myscript.jl to run your code
manage all Julia project dependencies using the environment files Project.toml and Manifest.toml
write a job submission script run_job.sh that requests compute resources and tells the compute nodes what to do

Let’s assume we want to run the following script myscript.jl to (inefficiently) compute and print the 20-th Fibonacci number:

function fibonacci(n::Int64)
    if n <= 1
        return n
    else
        return fibonacci(n - 1) + fibonacci(n - 2)
    end
end

println("The 20-th fibonacci number is $(fibonacci(20)).")

To run this script on one of the compute nodes we define a submission script run_job.sh:

#!/bin/bash

#SBATCH --time=0:05:00
#SBATCH --job-name="Fibonacci_calculation"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=htc
#SBATCH --output="Fibonacci.out"
#SBATCH --error="Fibonacci.err.out"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=YOUR_EMAIL_ADDRESS

$DATA/julia-1.5.3-linux-x86_64/bin/julia --project -e 'import Pkg; Pkg.instantiate();
include("myscript.jl")'

Lines starting with #SBATCH are SLURM commands. We request 5min of computation time on one CPU. We further want to run it on the ARC-HTC partition and get status updates about our job sent to our email.

The last line in the script calls Julia, instantiates the project environment, (sidenote: Suggested reading: Working with environments. ) i.e. installs any package dependencies defined in Project.toml, and runs our script.

We then submit the job using the shell with

sbatch run_job.sh

Depending on the available resources the job is then queued and waiting for execution. You can look up the current job status with

squeue -u USERNAME

To cancel the job use either of the two:

scancel -j [JOBNUMBER]
scancel -u [USERNAME]

Job arrays

Sometimes we want to run the same job multiple times with only minor modifications, e.g. running our algorithm with one of the hyperparameters changed. For this case SLURM job arrays are quite useful. They execute the submission script multiple times and allow you to run different versions of the same script. The ability to run time-consuming jobs in parallel can be a big time saver. Let’s assume that we want to calculate the Fibonacci number for n = 20, 30, 40 in separate jobs.

To achieve this we add one line with an array command to our submission script run_job.sh:

#!/bin/bash

#SBATCH --time=0:05:00
#SBATCH --job-name="Fibonacci_calculation"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=htc
#SBATCH --output="Fibonacci_%a.out"
#SBATCH --error="Fibonacci_%a.err.out"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=YOUR_EMAIL_ADDRESS
#SBATCH --array=1-3

$DATA/julia-1.5.3-linux-x86_64/bin/julia --project -e 'import Pkg; Pkg.instantiate();
include("myscript.jl")'

The job will now run three times. The array id is given by the environment variableSLURM_ARRAY_TASK_ID which we can use inside our Julia script to select n:

# get the environment variable
task_id = Base.parse(Int, ENV["SLURM_ARRAY_TASK_ID"])
n_arr = [20; 30; 40]
n = n_arr[task_id]

function fibonacci(n::Int64)
    if n <= 1
        return n
    else
        return fibonacci(n - 1) + fibonacci(n - 2)
    end
end

println("The $(n)-th fibonacci number is $(fibonacci(n)).")

Multithreading

To use multithreading in our Julia script we have to request multiple cores on a computing node. The following script my_multithreaded_script.jl is used to compute the Fibonacci number for different n on parallel threads : (sidenote: This is not the best example for multithreading, because computing the Fibonacci number for the highest n computes the Fibonacci number for all lower n in the process. )

# check number of threads
println("Number of threads: $(Threads.nthreads())")

n_arr = [5; 10; 15; 20; 25; 30; 35; 40]

function fibonacci(n::Int64)
    if n <= 1
        return n
    else
        return fibonacci(n - 1) + fibonacci(n - 2)
    end
end

# use multiple threads to compute the fibonacci number for each n in n_arr
fib = zeros(length(n_arr))
Threads.@threads for k = 1:length(n_arr)
    fib[k] = fibonacci(n_arr[k])
end

To run this script we use the following modified SLURM submission script:

#!/bin/bash

#SBATCH --time=0:05:00
#SBATCH --job-name="Fibonacci_multithreaded_calculation"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --partition=htc
#SBATCH --output="Fibonacci.out"
#SBATCH --error="Fibonacci.err.out"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=YOUR_EMAIL_ADDRESS

export JULIA_NUM_THREADS=8
$DATA/julia-1.5.3-linux-x86_64/bin/julia --project -e 'import Pkg; Pkg.instantiate();
include("my_multithreaded_script.jl")'

The only changes are that we now request 8 cores on one node and export the environment variable JULIA_NUM_THREADS to start Julia with 8 threads.

Ensuring consistent results

Let’s assume that you want to benchmark your algorithm, e.g. measure its execution time. In order to generate consistent results from multiple runs, you have to make sure that the Julia script is executed on the same hardware every time. If no hardware is specified SLURM will just run the job on the next available node. If the node has eight CPUs and your job runs on four of them, then the performance of your job will depend on what other jobs run on the remaining four CPUs.

To ensure consistent results (sidenote: There will still be some variance in your time measurement. If more accuracy is required, I suggest running the script a number of times (using a job array) and averaging the result. ) you will have to specify the node hardware and request all the CPUs on a node. Information about the different nodes within the Oxford HTC partition can be found here.

Let’s assume that we want to run our job on a SandyBridge E5-2650 (2GHz) node. We can request exclusive access to a whole node by using SLURM constraints. Just add the following lines to the run_job.sh submission script:

#SBATCH --constraint='cpu_sku:E5-2650'
#SBATCH --exclusive

Comments / questions / mistakes? Please get in touch with me via email (public key).