.. index:: cluster; job submission
.. index:: slurm
.. _slurm:

slurm
=====

Description
-----------

The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.

User instructions
-----------------

Each workstation in the cluster can be used to run a single-processor calculation via a batch system provided by Slurm.  Slurm works in a very similar fashion to the queueing systems on compute clusters such as Imperial's HPC systems.  From your workstation you can submit calculations to run on free workstations.  Similarly, part of the power of your workstation can be used by others in the group.

To view the list of jobs queueing and running on the cluster type:

.. code-block:: bash

    $ squeue

Each job has a job id which is a number assigned sequentially to submitted jobs. Only the header is given if nothing is currently running on the cluster.


The status of the workstations in the cluster can also be seen using

.. code-block:: bash

    $ sinfo -N

for a brief listing. Or

.. code-block:: bash

    $ scontrol show nodes

for a more detailed list.

Calculations can be submitted using

.. code-block:: bash

    $ sbatch submit.sbatch

where submit.sbatch is a submit script containing the desired commands to execute on the remote workstation, along with details of the requested resources. The submit script does not need to be executable, and I would advise against making it so to reduce the possibility of your script accidentally being executed directly which may have unintended consequences. Depending on how many other jobs are running, your calculation may or may not start immediately.

A job can be deleted using the 'scancel' command.  For example, to delete the job '53':

.. code-block:: bash

    $ scancel 53

Documentation is available both as manpages and at the `SchedMD <https://slurm.schedmd.com/documentation.html>`_ website.

If you are familiar with Torque, there is a page listing the equivalent Slurm commands and variables at http://www.nersc.gov/users/computational-systems/cori/running-jobs/for-edison-users/torque-moab-vs-slurm-comparisons/

Submit scripts
^^^^^^^^^^^^^^

The submit script contains the commands to be executed on the workstation.  Note that by default the submit script does not specify which workstation it will run on, so it cannot rely upon files or directories existing in /workspace. Instead, the best strategy is to have all the necessary files in your home or data directory, copy them to workspace and run the calculation there and then copy the output back to your home directory. For Hess group users running on maxwell, the data directory can be used directly as this is local to maxwell.

Job settings (e.g. time requested) can be set using lines beginning with #SBATCH.  Please see the sbatch manpage for more details.

The default time limit is 1 hour; the maximum is 5 days. Note running calculations may need to be interrupted at short notice for urgent security updates.

An annotated sample script which creates a temporary directory to run the calculation as described above is as follows:

.. code-block:: bash

    #!/bin/bash

    #SBATCH --job-name ExampleScript
    #SBATCH --time 0-02:00:00
    #SBATCH --mail-type END
    #SBATCH --mail-user your.email.address@imperial.ac.uk

    # Create a local directory to run in.
    scratch=/workspace/$USER/scratch/$SLURM_JOB_ID
    mkdir -p $scratch

    # Copy input file (called in this case input_file) to the directory job will
    # run in. Slurm will start in the directory you submit your job from - so be
    # sure this is in the home or data directory as workspace isn't shared between
    # nodes.
    cp input_file $scratch
    cd $scratch

    # Run program (test_prog.x in this example).
    echo Executing in $scratch on $(hostname)
    ~/bin/test_prog.x input_file > output

    # Copy back to submit directory. This is wrapped in an if statement so the
    # working copy is only cleaned up if it has been copied back.
    if cp output $SLURM_SUBMIT_DIR
    then
      # Clean up.
      rm -rf $scratch
    else
      echo "Copy failed. Data retained in $scratch on $(hostname)"
    fi


Further documentation is available in the `Quick start <https://slurm.schedmd.com/quickstart.html>`_ section of the Slurm manual.  Note that parts of it (in particular, requesting more than a single processor for the default 'pc' queue) do no apply to our system.

queues / partitions
^^^^^^^^^^^^^^^^^^^

There are several partitions (queues are known as partitions in slurm) available to users: pc (default), a partition with your username if you have an assigned pc, and compute for the Hess group.  The partitions have different restrictions and so are suitable for different kinds of jobs:

=========       ============   =================       ==================      ============    ======================================
Partition       Runs on        Max cores per job       Max memory per job      Max walltime    Uses
=========       ============   =================       ==================      ============    ======================================
pc              workstations           1                       4GB               5 days        serial jobs
compute         maxwell               32                      128GB              5 days        serial, parallel and large memory jobs
compute         bloch1 & 2            24                      192GB              5 days        serial, parallel and large memory jobs
$user           your pc               ~8                      ~15GB              5 days        serial and small parallel jobs
=========       ============   =================       ==================      ============    ======================================

The compute queue is only available to members of the Hess group.

The $user queue is to allow individual users to queue larger calculations that will run on their own pc.

To see the full properties of the partitions, run

.. code-block:: bash

    $ sinfo -s

or

.. code-block:: bash

    $ scontrol show partition

maxwell, bloch1 and bloch2 are medium-sized compute servers. maxwell possesses 4 8-core Xeon processors, 128GB RAM and 5TB storage and so can do small MPI jobs, and also acts as a file server for /data. bloch1 and bloch2 are newer systems (Spring 2016) with each having 2 12-core Xeon processors, 192GB RAM and 12TB storage. Note bloch1 and bloch2 also have a large shared directory in /bloch_storage/share. The Hess group funded the purchase of these systems, so running jobs on them is currently restricted to the Hess group.

A partition can be selected by adding, for example

.. code-block:: bash

    #SBATCH --partition jbloggs

or

.. code-block:: bash

    #SBATCH --partition compute

to the submit script as appropriate.  The pc queue is the default queue, so this is only needed in practice for running calculations on a compute server or your user queue. To request one of the bloch servers for a calculation on the compute queue, you can add

.. code-block:: bash

    #SBATCH --partition compute
    #SBATCH --constraint bloch

to your script. This will allow your calculation to run on any node flagged
with the "bloch" tag, i.e. either bloch1 or bloch2. To select a particular node
you can use instead, for example:

.. code-block:: bash

    #SBATCH --partition compute
    #SBATCH --nodelist bloch2


Most, but not all, workstations have AVX instructions.  A job can be restricted to run only on machines with AVX instructions by doing:

.. code-block:: bash

    #SBATCH --constraint avx

By default, a job sent to the compute queue will use 1 core and 2GB RAM per core.  It is possible to change this using the resource option:

.. code-block:: bash

    #SBATCH --ntasks 4
    #SBATCH --mem 3G

to request for example 4 cores and 3GB RAM per core (12 GB total).  Note that it is not possible to run jobs which require a total of equalling all available memory on the system concurrently as some memory is required by the operating system.

Hyperthreading is currently enabled on maxwell and bloch, so (in principle) twice as many threads can be used as cores requested.  The performance benefits (if any) depend heavily upon the application.

Another, more detailed, example job script for running in the compute queue on one of the bloch servers is as follows:

.. code-block:: bash

    #!/bin/bash
    #SBATCH --job-name ExampleScript
    #SBATCH --partition compute
    #SBATCH --time 0-08:00:00
    #SBATCH --constraint bloch
    #SBATCH --ntasks=12
    #SBATCH --mail-type END
    #SBATCH --mail-user your.email.address@imperial.ac.uk

    # This script does the following:
    #   Start from some home or data directory containing the workload in a
    #     directory called "run",
    #   Copy the workload to remote scratch location.
    #   Execut "run/run.sh".
    #   Copy everything back after completion.

    # Give the directory you want your calculation to run in.
    # A subdirectory of this will be created later.
    RUNDIR="/workspace/${USER}/runs"

    # We use the name of the directory the job was submitted from, along
    # with the job id to create a directory for this calculation
    NAME=`basename $SLURM_SUBMIT_DIR`
    DESTDIR=${RUNDIR}/${SLURM_JOB_ID}_${NAME}

    # Set the directory with data that needs to be copied
    SOURCEDIR=${SLURM_SUBMIT_DIR}/run

    # Make sure this directory and calculation script exists
    if [ -d "${SOURCEDIR}" -a -e "${SOURCEDIR}/run.sh" ]; then

      # Make the run directory (and suppress the error if it already exists)
      mkdir -p $RUNDIR
      # Copy our calculation directory to the subdirectory of the above
      cp -r $SOURCEDIR $DESTDIR

      # Go to this directory and run our calculation
      cd $DESTDIR
      . ./run.sh

      # Copy our calcation data back to a sub directory of the directory the
      # job was submited from. This is wrapped in an if statement so the
      # working copy is only cleaned up if it has been copied back.
      if cp -r $DESTDIR ${SLURM_SUBMIT_DIR}/${SLURM_JOB_ID}
      then
        # Clean up.
        rm -rf $DESTDIR
      else
        echo "Copy failed. Data retained in $DESTDIR on $(hostname)"
      fi

    else
      echo 'No "run" directory or "run/run.sh" script!'

    fi

In this case the script requests 8 cores on one of the bloch servers, copies the full subdirectory ./run to shared bloch scratch directory, and copies it back once the job has completed.

Source
------

https://www.schedmd.com/downloads.php

and

https://github.com/SchedMD/slurm

License
-------

GNU General Public License (GPL) V2

Admin notes
-----------

The necessary packages are installed from the Debian repositories via puppet. This is slurmctld on the servers, and slurm-client and slurmd on the nodes. This will automatically pull in the necessary dependencies such as munge.

First to set up munge, which is used for authenticating nodes in the cluster, we generate a strong munge key on the server:

.. code-block:: bash

    dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key

This is distributed to nodes via puppet. It should only be readable root, but to allow distribution I set it to 440 in the puppet module with the group as puppet, and this is changed to 400 on the node. Puppet is set to restart the munge service once the file is copied.

Then I set up the file /etc/slurm-llnl/slurm.conf as follows

.. code-block:: bash

    #
    # CMTH slurm.conf file
    #
    # See the slurm.conf man page for more information.
    #
    ClusterName=CMTH
    ControlMachine=thomson
    #
    SlurmUser=slurm
    SlurmctldPort=7002
    SlurmdPort=7003
    AuthType=auth/munge
    StateSaveLocation=/var/spool/slurm/state
    SlurmdSpoolDir=/var/spool/slurm/slurmd
    SwitchType=switch/none
    MpiDefault=none
    SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
    SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
    ProctrackType=proctrack/linuxproc
    CacheGroups=0
    ReturnToService=0
    #
    # TIMERS
    SlurmctldTimeout=300
    SlurmdTimeout=300
    InactiveLimit=0
    MinJobAge=300
    KillWait=30
    Waittime=0
    #
    # SCHEDULING
    SchedulerType=sched/backfill
    SelectType=select/linear
    FastSchedule=1
    #
    # LOGGING
    SlurmctldDebug=info
    SlurmdDebug=info
    JobCompType=jobcomp/filetxt
    JobCompLoc=/var/log/slurm-llnl/jobcompletion
    #
    # ACCOUNTING
    JobAcctGatherType=jobacct_gather/linux
    JobAcctGatherFrequency=30
    AccountingStorageType=accounting_storage/filetxt
    AccountingStorageLoc=/var/log/slurm-llnl/accounting
    #
    # Default workstation node config - note we err on the low side for memory,
    # and default to having avx available
    NodeName=DEFAULT TmpDisk=1024 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15000 Feature=workstation,avx

    # Default partition config
    PartitionName=DEFAULT MaxTime=5-00:00:00 DefaultTime=0-01:00:00 State=UP PriorityTier=10

    # Definitions of pcs and compute servers and their assignment to the
    # different partitions are done in separate files.
    Include pc.conf
    Include compute.conf
    # And this is a list of user dedicated queues, allowing people higher priority
    # use of their own workstation, without cpu or memory limits below what is
    # available.
    Include user.conf

The man page for slurm.conf is very detailed. Note that several of the directories mentioned in the file need to be created or it will complain.

The configurations for the different partitions were then contained in the other included files: pc.conf:

.. code-block:: bash

    # NB if the node doesn't follow the defaults as set in the main slurm.conf,
    # they should be specified here (we err on the low side for memory):
    # Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMem=15000
    # You can check with "slurmd -C" on a node
    NodeName=tycpc31 Feature=workstation
    NodeName=teller Feature=workstation
    NodeName=chlorine[01-09] Feature=workstation

    # Unfortunately the Partition node list needs to be defined in a single line.
    # I suggest using a script to update it. NodeNames should be separated by a
    # comma only (no space).
    PartitionName=pc Default=YES PriorityTier=1 MaxNodes=1 MaxCPUsPerNode=1 MaxMemPerNode=4096 Nodes=\
        chlorine[01-09],tycpc31,teller

compute.conf:

.. code-block:: bash

    NodeName=bloch[1-2] RealMemory=192000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Feature=compute_server,bloch
    NodeName=maxwell RealMemory=128000 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 Feature=compute_server,maxwell

    PartitionName=compute AllowGroups=hess DefMemPerCPU=2000 Nodes=maxwell,bloch[1-2]

And user.conf which lists one partition per user and machine. Note it's easier to use AllowGroups rather than AllowAccounts here.

Note, since ``ReturnToService=0`` is set above, workstations which users
reboot do not automatically start accepting jobs once they come back up. I set
it like this since slurm can be a little over eager in sending jobs to
machines that are booting making it possible for jobs to fail as they try to
start before the NFS mounts are available.

Every few days you can check which machines are marked as "down" in ``sinfo``.
There may be issues with them if users are rebooting them frequently. You can
mark a number of nodes as available again in the queue with e.g. ``scontrol
update state=resume node=chlorine03,tycpc34``. Nodes marked as "down*" are
currently uncontactable and have likely crashed. To see when a machine became
unresponsive you can check the output of e.g. ``scontrol show node
chlorine03``.

To manually mark a machine as unavailable in the queue, e.g. for maintenance
you can do ``scontrol update state=drain reason="Maintenance" node=chlorine03``
and it will allow whatever jobs are running to finish and not send any more
until it is resumed as described above.