Topics Map > User Guides
Topics Map > Platform R

Platform R: Debugging Slurm Jobs

Job States

Several Slurm commands can report on the state of a job, squeue and scontrol in particular. In the course of of a job's lifetime it will go through several states.

The most common job states are:

Job State (long form) Jobs State (short form) Explanation
CANCELLED CA The job was killed, either from the command line or due to exceeding resources.
COMPLETED CD The job and all tasks have finished.
COMPLETING CG Some part of the job is still running.
FAILED F The job ended with a non-zero exit status.
NODE_FAIL N The node the job was running on had a problem.
OUT_OF_MEMORY OOM The job exceeded memory limits described in the request.
PENDING PD The job is queued but not yet running.
RUNNING R The job has been allocated resources and is running.

See Job State Codes in the squeue manual page for the full list.

Slurm Job Reasons

You can see the reason with scontrol show job <jobid>

There are many reason codes, for which you should see the Slurm official documentation's list of job reason codes.

Some of the more common job reasons are:

Reason Code Explanation
Priority Higher priority jobs are currently running. Your job will eventually run.
Dependency The job is waiting for a dependent job to finish. This job will run when the dependent job finishes. 
Resources The job is waiting for resources. Your job will eventually run.
BeginTime The job's earliest start time has not yet been reached. This can sometimes be due to a full job queue or if your job has a low priority. 
InvalidQoS The job's QoS is invalid. Cancel job and restart with a correct QoS.
TimeLimit The job ran over its time limit.
NonZeroExitCode The program returned a non-success error code. Check the error log for the message.

Jobs Not Running

If the cluster is already busy and your job requests a lot of memory or CPU it may take a while for it to start.

Use squeue --me -t PD to see the reason a pending job hasn't started. 

Slurm can give an approximate start time for a job, but it is only an estimate: squeue -j <jobid> --start

Failed Jobs

Errors from your programs go into the output log by default. You can separate the errors by setting --error.

You can use sacct -j <jobid> to get a short summary of your job's state and job reason, and scontrol show jobid <jobid> will give a very fully report including job state and job reason. If you cancel a job the job reason may preserve the job reason from whatever error or state caused the job cancellation. 



Keywords:
Doc ID:
157401
Owned by:
William A. in SMPH Research Applications
Created:
2025-12-12
Updated:
2025-12-22
Sites:
SMPH Research Applications