Topics Map > User Guides
Topics Map > Platform R
Platform R: Debugging Slurm Jobs
Job States
Several Slurm commands can report on the state of a job, squeue and scontrol in particular. In the course of of a job's lifetime it will go through several states.
The most common job states are:
| Job State (long form) | Jobs State (short form) | Explanation |
CANCELLED |
CA |
The job was killed, either from the command line or due to exceeding resources. |
COMPLETED |
CD |
The job and all tasks have finished. |
COMPLETING |
CG |
Some part of the job is still running. |
FAILED |
F |
The job ended with a non-zero exit status. |
NODE_FAIL |
N |
The node the job was running on had a problem. |
OUT_OF_MEMORY |
OOM |
The job exceeded memory limits described in the request. |
PENDING |
PD |
The job is queued but not yet running. |
RUNNING |
R |
The job has been allocated resources and is running. |
See Job State Codes in the squeue manual page for the full list.
Slurm Job Reasons
You can see the reason with scontrol show job <jobid>
There are many reason codes, for which you should see the Slurm official documentation's list of job reason codes.
Some of the more common job reasons are:
| Reason Code | Explanation |
Priority |
Higher priority jobs are currently running. Your job will eventually run. |
Dependency |
The job is waiting for a dependent job to finish. This job will run when the dependent job finishes. |
Resources |
The job is waiting for resources. Your job will eventually run. |
BeginTime |
The job's earliest start time has not yet been reached. This can sometimes be due to a full job queue or if your job has a low priority. |
InvalidQoS |
The job's QoS is invalid. Cancel job and restart with a correct QoS. |
TimeLimit |
The job ran over its time limit. |
NonZeroExitCode |
The program returned a non-success error code. Check the error log for the message. |
Jobs Not Running
If the cluster is already busy and your job requests a lot of memory or CPU it may take a while for it to start.
Use squeue --me -t PD to see the reason a pending job hasn't started.
Slurm can give an approximate start time for a job, but it is only an estimate: squeue -j <jobid> --start
Failed Jobs
Errors from your programs go into the output log by default. You can separate the errors by setting --error.
You can use sacct -j <jobid> to get a short summary of your job's state and job reason, and scontrol show jobid <jobid> will give a very fully report including job state and job reason. If you cancel a job the job reason may preserve the job reason from whatever error or state caused the job cancellation.

