The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world’s supercomputers and computer clusters.

User instructions

Each workstation in the cluster can be used to run a single-processor calculation via a batch system provided by Slurm. Slurm works in a very similar fashion to the queueing systems on compute clusters such as Imperial’s HPC systems. From your workstation you can submit calculations to run on free workstations. Similarly, part of the power of your workstation can be used by others in the group.

To view the list of jobs queueing and running on the cluster type:

$ squeue

Each job has a job id which is a number assigned sequentially to submitted jobs. Only the header is given if nothing is currently running on the cluster.

The status of the workstations in the cluster can also be seen using

$ sinfo -N

for a brief listing. Or

$ scontrol show nodes

for a more detailed list.

Calculations can be submitted using

$ sbatch submit.sbatch

where submit.sbatch is a submit script containing the desired commands to execute on the remote workstation, along with details of the requested resources. The submit script does not need to be executable, and I would advise against making it so to reduce the possibility of your script accidentally being executed directly which may have unintended consequences. Depending on how many other jobs are running, your calculation may or may not start immediately.

A job can be deleted using the ‘scancel’ command. For example, to delete the job ‘53’:

$ scancel 53

Documentation is available both as manpages and at the SchedMD website.

If you are familiar with Torque, there is a page listing the equivalent Slurm commands and variables at

Submit scripts

The submit script contains the commands to be executed on the workstation. Note that by default the submit script does not specify which workstation it will run on, so it cannot rely upon files or directories existing in /workspace. Instead, the best strategy is to have all the necessary files in your home or data directory, copy them to workspace and run the calculation there and then copy the output back to your home directory. For Hess group users running on maxwell, the data directory can be used directly as this is local to maxwell.

Job settings (e.g. time requested) can be set using lines beginning with #SBATCH. Please see the sbatch manpage for more details.

The default time limit is 1 hour; the maximum is 5 days. Note running calculations may need to be interrupted at short notice for urgent security updates.

An annotated sample script which creates a temporary directory to run the calculation as described above is as follows:


#SBATCH --job-name ExampleScript
#SBATCH --time 0-02:00:00
#SBATCH --mail-type END
#SBATCH --mail-user

# Create a local directory to run in.
mkdir -p $scratch

# Copy input file (called in this case input_file) to the directory job will
# run in. Slurm will start in the directory you submit your job from - so be
# sure this is in the home or data directory as workspace isn't shared between
# nodes.
cp input_file $scratch
cd $scratch

# Run program (test_prog.x in this example).
echo Executing in $scratch on $(hostname)
~/bin/test_prog.x input_file > output

# Copy back to submit directory. This is wrapped in an if statement so the
# working copy is only cleaned up if it has been copied back.
if cp output $SLURM_SUBMIT_DIR
  # Clean up.
  rm -rf $scratch
  echo "Copy failed. Data retained in $scratch on $(hostname)"

Further documentation is available in the Quick start section of the Slurm manual. Note that parts of it (in particular, requesting more than a single processor for the default ‘pc’ queue) do no apply to our system.

queues / partitions

There are several partitions (queues are known as partitions in slurm) available to users: pc (default), a partition with your username if you have an assigned pc, and compute for the Hess group. The partitions have different restrictions and so are suitable for different kinds of jobs:

Partition Runs on Max cores per job Max memory per job Max walltime Uses
pc workstations 1 4GB 5 days serial jobs
compute maxwell 32 128GB 5 days serial, parallel and large memory jobs
compute bloch1 & 2 24 192GB 5 days serial, parallel and large memory jobs
$user your pc ~8 ~15GB 5 days serial and small parallel jobs

The compute queue is only available to members of the Hess group.

The $user queue is to allow individual users to queue larger calculations that will run on their own pc.

To see the full properties of the partitions, run

$ sinfo -s


$ scontrol show partition

maxwell, bloch1 and bloch2 are medium-sized compute servers. maxwell possesses 4 8-core Xeon processors, 128GB RAM and 5TB storage and so can do small MPI jobs, and also acts as a file server for /data. bloch1 and bloch2 are newer systems (Spring 2016) with each having 2 12-core Xeon processors, 192GB RAM and 12TB storage. Note bloch1 and bloch2 also have a large shared directory in /bloch_storage/share. The Hess group funded the purchase of these systems, so running jobs on them is currently restricted to the Hess group.

A partition can be selected by adding, for example

#SBATCH --partition jbloggs


#SBATCH --partition compute

to the submit script as appropriate. The pc queue is the default queue, so this is only needed in practice for running calculations on a compute server or your user queue. To request one of the bloch servers for a calculation on the compute queue, you can add

#SBATCH --partition compute
#SBATCH --constraint bloch

to your script. This will allow your calculation to run on any node flagged with the “bloch” tag, i.e. either bloch1 or bloch2. To select a particular node you can use instead, for example:

#SBATCH --partition compute
#SBATCH --nodelist bloch2

Most, but not all, workstations have AVX instructions. A job can be restricted to run only on machines with AVX instructions by doing:

#SBATCH --constraint avx

By default, a job sent to the compute queue will use 1 core and 2GB RAM per core. It is possible to change this using the resource option:

#SBATCH --ntasks 4
#SBATCH --mem 3G

to request for example 4 cores and 3GB RAM per core (12 GB total). Note that it is not possible to run jobs which require a total of equalling all available memory on the system concurrently as some memory is required by the operating system.

Hyperthreading is currently enabled on maxwell and bloch, so (in principle) twice as many threads can be used as cores requested. The performance benefits (if any) depend heavily upon the application.

Another, more detailed, example job script for running in the compute queue on one of the bloch servers is as follows:

#SBATCH --job-name ExampleScript
#SBATCH --partition compute
#SBATCH --time 0-08:00:00
#SBATCH --constraint bloch
#SBATCH --ntasks=12
#SBATCH --mail-type END
#SBATCH --mail-user

# This script does the following:
#   Start from some home or data directory containing the workload in a
#     directory called "run",
#   Copy the workload to remote scratch location.
#   Execut "run/".
#   Copy everything back after completion.

# Give the directory you want your calculation to run in.
# A subdirectory of this will be created later.

# We use the name of the directory the job was submitted from, along
# with the job id to create a directory for this calculation

# Set the directory with data that needs to be copied

# Make sure this directory and calculation script exists
if [ -d "${SOURCEDIR}" -a -e "${SOURCEDIR}/" ]; then

  # Make the run directory (and suppress the error if it already exists)
  mkdir -p $RUNDIR
  # Copy our calculation directory to the subdirectory of the above

  # Go to this directory and run our calculation
  . ./

  # Copy our calcation data back to a sub directory of the directory the
  # job was submited from. This is wrapped in an if statement so the
  # working copy is only cleaned up if it has been copied back.
    # Clean up.
    rm -rf $DESTDIR
    echo "Copy failed. Data retained in $DESTDIR on $(hostname)"

  echo 'No "run" directory or "run/" script!'


In this case the script requests 8 cores on one of the bloch servers, copies the full subdirectory ./run to shared bloch scratch directory, and copies it back once the job has completed.


GNU General Public License (GPL) V2

Admin notes

The necessary packages are installed from the Debian repositories via puppet. This is slurmctld on the servers, and slurm-client and slurmd on the nodes. This will automatically pull in the necessary dependencies such as munge.

First to set up munge, which is used for authenticating nodes in the cluster, we generate a strong munge key on the server:

dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key

This is distributed to nodes via puppet. It should only be readable root, but to allow distribution I set it to 440 in the puppet module with the group as puppet, and this is changed to 400 on the node. Puppet is set to restart the munge service once the file is copied.

Then I set up the file /etc/slurm-llnl/slurm.conf as follows

# CMTH slurm.conf file
# See the slurm.conf man page for more information.
# Default workstation node config - note we err on the low side for memory,
# and default to having avx available
NodeName=DEFAULT TmpDisk=1024 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15000 Feature=workstation,avx

# Default partition config
PartitionName=DEFAULT MaxTime=5-00:00:00 DefaultTime=0-01:00:00 State=UP PriorityTier=10

# Definitions of pcs and compute servers and their assignment to the
# different partitions are done in separate files.
Include pc.conf
Include compute.conf
# And this is a list of user dedicated queues, allowing people higher priority
# use of their own workstation, without cpu or memory limits below what is
# available.
Include user.conf

The man page for slurm.conf is very detailed. Note that several of the directories mentioned in the file need to be created or it will complain.

The configurations for the different partitions were then contained in the other included files: pc.conf:

# NB if the node doesn't follow the defaults as set in the main slurm.conf,
# they should be specified here (we err on the low side for memory):
# Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMem=15000
# You can check with "slurmd -C" on a node
NodeName=tycpc31 Feature=workstation
NodeName=teller Feature=workstation
NodeName=chlorine[01-09] Feature=workstation

# Unfortunately the Partition node list needs to be defined in a single line.
# I suggest using a script to update it. NodeNames should be separated by a
# comma only (no space).
PartitionName=pc Default=YES PriorityTier=1 MaxNodes=1 MaxCPUsPerNode=1 MaxMemPerNode=4096 Nodes=\


NodeName=bloch[1-2] RealMemory=192000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Feature=compute_server,bloch
NodeName=maxwell RealMemory=128000 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 Feature=compute_server,maxwell

PartitionName=compute AllowGroups=hess DefMemPerCPU=2000 Nodes=maxwell,bloch[1-2]

And user.conf which lists one partition per user and machine. Note it’s easier to use AllowGroups rather than AllowAccounts here.

Note, since ReturnToService=0 is set above, workstations which users reboot do not automatically start accepting jobs once they come back up. I set it like this since slurm can be a little over eager in sending jobs to machines that are booting making it possible for jobs to fail as they try to start before the NFS mounts are available.

Every few days you can check which machines are marked as “down” in sinfo. There may be issues with them if users are rebooting them frequently. You can mark a number of nodes as available again in the queue with e.g. scontrol update state=resume node=chlorine03,tycpc34. Nodes marked as “down*” are currently uncontactable and have likely crashed. To see when a machine became unresponsive you can check the output of e.g. scontrol show node chlorine03.

To manually mark a machine as unavailable in the queue, e.g. for maintenance you can do scontrol update state=drain reason="Maintenance" node=chlorine03 and it will allow whatever jobs are running to finish and not send any more until it is resumed as described above.