.. index:: cluster; job submission .. index:: slurm .. _slurm: slurm ===== Description ----------- The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters. User instructions ----------------- Each workstation in the cluster can be used to run a single-processor calculation via a batch system provided by Slurm. Slurm works in a very similar fashion to the queueing systems on compute clusters such as Imperial's HPC systems. From your workstation you can submit calculations to run on free workstations. Similarly, part of the power of your workstation can be used by others in the group. To view the list of jobs queueing and running on the cluster type: .. code-block:: bash $ squeue Each job has a job id which is a number assigned sequentially to submitted jobs. Only the header is given if nothing is currently running on the cluster. The status of the workstations in the cluster can also be seen using .. code-block:: bash $ sinfo -N for a brief listing. Or .. code-block:: bash $ scontrol show nodes for a more detailed list. Calculations can be submitted using .. code-block:: bash $ sbatch submit.sbatch where submit.sbatch is a submit script containing the desired commands to execute on the remote workstation, along with details of the requested resources. The submit script does not need to be executable, and I would advise against making it so to reduce the possibility of your script accidentally being executed directly which may have unintended consequences. Depending on how many other jobs are running, your calculation may or may not start immediately. A job can be deleted using the 'scancel' command. For example, to delete the job '53': .. code-block:: bash $ scancel 53 Documentation is available both as manpages and at the `SchedMD `_ website. If you are familiar with Torque, there is a page listing the equivalent Slurm commands and variables at http://www.nersc.gov/users/computational-systems/cori/running-jobs/for-edison-users/torque-moab-vs-slurm-comparisons/ Submit scripts ^^^^^^^^^^^^^^ The submit script contains the commands to be executed on the workstation. Note that by default the submit script does not specify which workstation it will run on, so it cannot rely upon files or directories existing in /workspace. Instead, the best strategy is to have all the necessary files in your home or data directory, copy them to workspace and run the calculation there and then copy the output back to your home directory. For Hess group users running on maxwell, the data directory can be used directly as this is local to maxwell. Job settings (e.g. time requested) can be set using lines beginning with #SBATCH. Please see the sbatch manpage for more details. The default time limit is 1 hour; the maximum is 5 days. Note running calculations may need to be interrupted at short notice for urgent security updates. An annotated sample script which creates a temporary directory to run the calculation as described above is as follows: .. code-block:: bash #!/bin/bash #SBATCH --job-name ExampleScript #SBATCH --time 0-02:00:00 #SBATCH --mail-type END #SBATCH --mail-user your.email.address@imperial.ac.uk # Create a local directory to run in. scratch=/workspace/$USER/scratch/$SLURM_JOB_ID mkdir -p $scratch # Copy input file (called in this case input_file) to the directory job will # run in. Slurm will start in the directory you submit your job from - so be # sure this is in the home or data directory as workspace isn't shared between # nodes. cp input_file $scratch cd $scratch # Run program (test_prog.x in this example). echo Executing in $scratch on $(hostname) ~/bin/test_prog.x input_file > output # Copy back to submit directory. This is wrapped in an if statement so the # working copy is only cleaned up if it has been copied back. if cp output $SLURM_SUBMIT_DIR then # Clean up. rm -rf $scratch else echo "Copy failed. Data retained in $scratch on $(hostname)" fi Further documentation is available in the `Quick start `_ section of the Slurm manual. Note that parts of it (in particular, requesting more than a single processor for the default 'pc' queue) do no apply to our system. queues / partitions ^^^^^^^^^^^^^^^^^^^ There are several partitions (queues are known as partitions in slurm) available to users: pc (default), a partition with your username if you have an assigned pc, and compute for the Hess group. The partitions have different restrictions and so are suitable for different kinds of jobs: ========= ============ ================= ================== ============ ====================================== Partition Runs on Max cores per job Max memory per job Max walltime Uses ========= ============ ================= ================== ============ ====================================== pc workstations 1 4GB 5 days serial jobs compute maxwell 32 128GB 5 days serial, parallel and large memory jobs compute bloch1 & 2 24 192GB 5 days serial, parallel and large memory jobs $user your pc ~8 ~15GB 5 days serial and small parallel jobs ========= ============ ================= ================== ============ ====================================== The compute queue is only available to members of the Hess group. The $user queue is to allow individual users to queue larger calculations that will run on their own pc. To see the full properties of the partitions, run .. code-block:: bash $ sinfo -s or .. code-block:: bash $ scontrol show partition maxwell, bloch1 and bloch2 are medium-sized compute servers. maxwell possesses 4 8-core Xeon processors, 128GB RAM and 5TB storage and so can do small MPI jobs, and also acts as a file server for /data. bloch1 and bloch2 are newer systems (Spring 2016) with each having 2 12-core Xeon processors, 192GB RAM and 12TB storage. Note bloch1 and bloch2 also have a large shared directory in /bloch_storage/share. The Hess group funded the purchase of these systems, so running jobs on them is currently restricted to the Hess group. A partition can be selected by adding, for example .. code-block:: bash #SBATCH --partition jbloggs or .. code-block:: bash #SBATCH --partition compute to the submit script as appropriate. The pc queue is the default queue, so this is only needed in practice for running calculations on a compute server or your user queue. To request one of the bloch servers for a calculation on the compute queue, you can add .. code-block:: bash #SBATCH --partition compute #SBATCH --constraint bloch to your script. This will allow your calculation to run on any node flagged with the "bloch" tag, i.e. either bloch1 or bloch2. To select a particular node you can use instead, for example: .. code-block:: bash #SBATCH --partition compute #SBATCH --nodelist bloch2 Most, but not all, workstations have AVX instructions. A job can be restricted to run only on machines with AVX instructions by doing: .. code-block:: bash #SBATCH --constraint avx By default, a job sent to the compute queue will use 1 core and 2GB RAM per core. It is possible to change this using the resource option: .. code-block:: bash #SBATCH --ntasks 4 #SBATCH --mem 3G to request for example 4 cores and 3GB RAM per core (12 GB total). Note that it is not possible to run jobs which require a total of equalling all available memory on the system concurrently as some memory is required by the operating system. Hyperthreading is currently enabled on maxwell and bloch, so (in principle) twice as many threads can be used as cores requested. The performance benefits (if any) depend heavily upon the application. Another, more detailed, example job script for running in the compute queue on one of the bloch servers is as follows: .. code-block:: bash #!/bin/bash #SBATCH --job-name ExampleScript #SBATCH --partition compute #SBATCH --time 0-08:00:00 #SBATCH --constraint bloch #SBATCH --ntasks=12 #SBATCH --mail-type END #SBATCH --mail-user your.email.address@imperial.ac.uk # This script does the following: # Start from some home or data directory containing the workload in a # directory called "run", # Copy the workload to remote scratch location. # Execut "run/run.sh". # Copy everything back after completion. # Give the directory you want your calculation to run in. # A subdirectory of this will be created later. RUNDIR="/workspace/${USER}/runs" # We use the name of the directory the job was submitted from, along # with the job id to create a directory for this calculation NAME=`basename $SLURM_SUBMIT_DIR` DESTDIR=${RUNDIR}/${SLURM_JOB_ID}_${NAME} # Set the directory with data that needs to be copied SOURCEDIR=${SLURM_SUBMIT_DIR}/run # Make sure this directory and calculation script exists if [ -d "${SOURCEDIR}" -a -e "${SOURCEDIR}/run.sh" ]; then # Make the run directory (and suppress the error if it already exists) mkdir -p $RUNDIR # Copy our calculation directory to the subdirectory of the above cp -r $SOURCEDIR $DESTDIR # Go to this directory and run our calculation cd $DESTDIR . ./run.sh # Copy our calcation data back to a sub directory of the directory the # job was submited from. This is wrapped in an if statement so the # working copy is only cleaned up if it has been copied back. if cp -r $DESTDIR ${SLURM_SUBMIT_DIR}/${SLURM_JOB_ID} then # Clean up. rm -rf $DESTDIR else echo "Copy failed. Data retained in $DESTDIR on $(hostname)" fi else echo 'No "run" directory or "run/run.sh" script!' fi In this case the script requests 8 cores on one of the bloch servers, copies the full subdirectory ./run to shared bloch scratch directory, and copies it back once the job has completed. Source ------ https://www.schedmd.com/downloads.php and https://github.com/SchedMD/slurm License ------- GNU General Public License (GPL) V2 Admin notes ----------- The necessary packages are installed from the Debian repositories via puppet. This is slurmctld on the servers, and slurm-client and slurmd on the nodes. This will automatically pull in the necessary dependencies such as munge. First to set up munge, which is used for authenticating nodes in the cluster, we generate a strong munge key on the server: .. code-block:: bash dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key This is distributed to nodes via puppet. It should only be readable root, but to allow distribution I set it to 440 in the puppet module with the group as puppet, and this is changed to 400 on the node. Puppet is set to restart the munge service once the file is copied. Then I set up the file /etc/slurm-llnl/slurm.conf as follows .. code-block:: bash # # CMTH slurm.conf file # # See the slurm.conf man page for more information. # ClusterName=CMTH ControlMachine=thomson # SlurmUser=slurm SlurmctldPort=7002 SlurmdPort=7003 AuthType=auth/munge StateSaveLocation=/var/spool/slurm/state SlurmdSpoolDir=/var/spool/slurm/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid ProctrackType=proctrack/linuxproc CacheGroups=0 ReturnToService=0 # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill SelectType=select/linear FastSchedule=1 # # LOGGING SlurmctldDebug=info SlurmdDebug=info JobCompType=jobcomp/filetxt JobCompLoc=/var/log/slurm-llnl/jobcompletion # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/filetxt AccountingStorageLoc=/var/log/slurm-llnl/accounting # # Default workstation node config - note we err on the low side for memory, # and default to having avx available NodeName=DEFAULT TmpDisk=1024 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15000 Feature=workstation,avx # Default partition config PartitionName=DEFAULT MaxTime=5-00:00:00 DefaultTime=0-01:00:00 State=UP PriorityTier=10 # Definitions of pcs and compute servers and their assignment to the # different partitions are done in separate files. Include pc.conf Include compute.conf # And this is a list of user dedicated queues, allowing people higher priority # use of their own workstation, without cpu or memory limits below what is # available. Include user.conf The man page for slurm.conf is very detailed. Note that several of the directories mentioned in the file need to be created or it will complain. The configurations for the different partitions were then contained in the other included files: pc.conf: .. code-block:: bash # NB if the node doesn't follow the defaults as set in the main slurm.conf, # they should be specified here (we err on the low side for memory): # Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMem=15000 # You can check with "slurmd -C" on a node NodeName=tycpc31 Feature=workstation NodeName=teller Feature=workstation NodeName=chlorine[01-09] Feature=workstation # Unfortunately the Partition node list needs to be defined in a single line. # I suggest using a script to update it. NodeNames should be separated by a # comma only (no space). PartitionName=pc Default=YES PriorityTier=1 MaxNodes=1 MaxCPUsPerNode=1 MaxMemPerNode=4096 Nodes=\ chlorine[01-09],tycpc31,teller compute.conf: .. code-block:: bash NodeName=bloch[1-2] RealMemory=192000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Feature=compute_server,bloch NodeName=maxwell RealMemory=128000 Sockets=4 CoresPerSocket=8 ThreadsPerCore=2 Feature=compute_server,maxwell PartitionName=compute AllowGroups=hess DefMemPerCPU=2000 Nodes=maxwell,bloch[1-2] And user.conf which lists one partition per user and machine. Note it's easier to use AllowGroups rather than AllowAccounts here. Note, since ``ReturnToService=0`` is set above, workstations which users reboot do not automatically start accepting jobs once they come back up. I set it like this since slurm can be a little over eager in sending jobs to machines that are booting making it possible for jobs to fail as they try to start before the NFS mounts are available. Every few days you can check which machines are marked as "down" in ``sinfo``. There may be issues with them if users are rebooting them frequently. You can mark a number of nodes as available again in the queue with e.g. ``scontrol update state=resume node=chlorine03,tycpc34``. Nodes marked as "down*" are currently uncontactable and have likely crashed. To see when a machine became unresponsive you can check the output of e.g. ``scontrol show node chlorine03``. To manually mark a machine as unavailable in the queue, e.g. for maintenance you can do ``scontrol update state=drain reason="Maintenance" node=chlorine03`` and it will allow whatever jobs are running to finish and not send any more until it is resumed as described above.