Using the SSCC Slurm Cluster
Slurm (Simple Linux Utility for Resource Management) is a powerful system for managing and scheduling computing jobs that is very popular at large-scale research computing centers. The SSCC is currently running a pilot with a small Slurm cluster as we all learn more about it, but we anticipate that it will become the main way SSCC researchers run large research computing jobs. This article will show you how to use the SSCC's Slurm cluster.
Slurm Quick Start
To submit a job to Slurm, the basic syntax is:
ssubmit --cores=C --mem=Mg "command"
C is the number of cores or threads to use,
M is the amount of memory to use in gigabytes, and
command is the command you'd normally use to run the job directly on the server. Note that the job must be able to run in the background, but do not put an ampersand at the end of the command. For example, to run a Stata do file called
my_do_file.do using 32 cores and 20GB of memory, use:
ssubmit --cores=32 --mem=20g "stata -b do my_do_file"
You can submit multi-node jobs to Slurm by specifying
--nodes, in which case
--mem specify the cores and memory to use on each node. You can submit to partitions other than the default with
--partition (which can be abbreviated
You can check the status of your job and see what else is in the Slurm queue with
squeue. You can cancel a job with
scancel plus the job identifier you can get from
Slurm is a job management and scheduling utility for clusters of Linux servers. You submit jobs to it like HTCondor, but Slurm makes it easier to specify the resources you need, can run multi-server jobs, and makes it easier to set priorities.
The SSCC is currently running a small pilot Slurm cluster containing five servers. The goals of the pilot are:
- Ensure most or all of the Linux research computing SSCC members do can run through Slurm, and figure out how
- Help SSCC members learn to use Slurm
- Observe how SSCC members use Slurm and craft policies that successfully balance giving everyone access to computing resources and allowing heavy users to do resource-intensive work
All of these will be a work in progress during the pilot period, so expect some problems to arise.
Keep in mind that SSCC staff have limited expertise in some of the software used by SSCC researchers, including Matlab, Julia, FORTRAN, and software used for biomedical research. Slurm is heavily used for this kind of computing and we're confident solutions exist for most or all of the problems you may encounter, but we may not be able to help you find them. Send your questions to the Help Desk and we'll help as much as we can. Also send us your solutions, and we'll share them with others.
The Slurm Cluster
The pilot cluster currently consists of four servers (slurm001-slurm-005) with 44 cores and 384GB of memory each. These servers were originally purchased by SMPH, so SMPH researchers have priority on them. We anticipate adding more servers to the cluster as the pilot continues.
When you submit a job to Slurm, you submit it to a partition. Partitions can have different priorities, time limits, and other properties. While a job is only submitted to one partition, servers can belong to (take jobs from) multiple partitions. The SSCC's pilot Slurm cluster currently has the following partitions:
|Partition Name||Servers||Max Job Length||Notes|
||slurm001-slurm003||7 Days||Default partition|
||slurm001-slurm005||6 Hours||slurm005 is reserved for short jobs|
||slurm001-slurm003||7 Days||Usable by SMPH researchers, preempts jobs in
||slurm001-slurm005||6 Hours||Usable by SMPH researchers, preempts jobs in
If you do not specify a partition your job will go to
sscc. To specify a different partition use
--partition or the abbreviation
Jobs submitted to high priority partitions (
smph-short) will preempt jobs submitted to normal priority partitions (
sscc-short) if they cannot run otherwise. We hope this will be rare. Preempted jobs are put back in the queue to run again, but lose any unsaved progress. SMPH researchers have priority on all the servers in the pilot cluster because they were purchased by SMPH, but we anticipate moving servers purchased by the SSCC into the cluster in the near future.
Jobs submitted to partitions with equal priority will not preempt each other. In choosing which job to run next, Slurm first considers the amount of Slurm computing time each user has used recently (past use decays exponentially with a half life of one week), with users with less past usage getting higher priority. It then considers how long each job has been waiting in the queue, with jobs that have been waiting longer getting higher priority.
Note that when people submit many jobs to Linstat it tries to run all of them at once, and the busier it gets the slower they all run. With Slurm, jobs will always get the cores and memory they request and run at full speed, but if the cluster is busy they may need to wait for their turn before they start. Both systems will get about the same amount of work done in a given amount of time, but with Slurm the jobs that are run first will get done sooner while the jobs that are run last get done no later--a Pareto improvement.
You can submit a job to Slurm using
ssubmit command. The general syntax is:
ssubmit resources "command"
resources is a description of the computing resources your jobs needs, and
command is the Linux command that will run your job.
ssubmit command is just a wrapper written by SSCC staff for the standard Slurm
sbatch command: it takes your command, turns it into a submit file, and then sends it to Slurm with
sbatch. Advanced users are welcome to use
sbatch directly; documentation can be found at https://slurm.schedmd.com/documentation.html. Note that the
--cores switch used by
ssubmit translates to
--cpus-per-task in standard Slurm, and
--partition must be spelled out.
ssubmit which partition to send your job to. If it is not specified the job will be sent to the
ssubmit how many nodes (servers) to use. If it is not specified the job will use one node. If you use multiple nodes, then
--mem specify the cores and memory to be used on each node.
ssubmit how many logical cores your job needs. If it is not specified the job will get one core.
ssubmit how much memory your jobs needs. Put a
g after the number to specify that the unit is gigabytes. If it is not specified the job will get one gigabyte of memory.
Your job will always get the cores and memory it asks for, but no more. If you request too few cores, your job will run more slowly than it could. But if you request too little memory, your job will crash when it runs out. So consider including a safety margin in your memory request; perhaps 25% of the total (e.g. if you know your job needs 40GB, ask for 50GB). On the other hand, as long as your job is running no one else can use the cores and memory you requested. Do not request more cores or memory than your job will actually use (plus a safety margin).
- If a job can can use multiple cores but does not ask you to specify how many to use, it probably defaults to all the cores in the server.
- Stata always uses 32 cores; SAS uses 4 by default.
- Most statistical software loads the data it will use into memory (SAS is a notable exception). The amount of memory needed by a data set is usually similar to or smaller than the amount of disk space the data set uses (in Windows, right-click on the file and choose Properties to see how big it is). Keep in mind what you will then do with the data: making a second copy of the data will double the memory used, for example. Combining data sets will often result in a data set that's bigger than the sum of the sizes of the original data.
- You can run the Linux
topcommand while your job is running on Linstat to monitor what computing resources it is using.
Note for advanced users: the servers in the Slurm cluster have "hyperthreading" turned off so there is just one logical core per physical core.
Linux Commands for Slurm
stata -b do my_do_file
R CMD BATCH --no-save my_R_script.R
matlab -nodisplay < my_matlab_script.m
Slurm will send any text output that would normally go to your screen to a file called
N is the
JOBID of the job. If this output is important, we recommend you send it to a file of your choice instead by adding
> outputfile to the end of the command, where
outputfile should be replaced by the desired file name. For example:
matlab -nodisplay < my_matlab_script.m > my_matlab_script.log
squeue. In the output,
NODELISTtells you which node(s) your job is running on. If under
(Resources)that means that the computing resources needed to run your job are not available at the moment. Your job will wait in the queue until other jobs finish, and then start as soon as possible.
(BeginTime)means your job was preempted by a higher priority job and put back in the queue; it will be run again shortly if resources are available.
squeueoutput also tells you the
JOBIDof each job; you can cancel one of your jobs with:
JOBIDshould be replaced by the number of the job you want to cancel.