Platform R: Slurm Job Suspension Process

 

Platform R’s architecture is a balance between security, performance, and usability for other Platform R users. Occasionally Platform R components are overloaded by large job submissions. This results in poor performance for users or in the worst case a denial of service.

To ensure a fair and performant platform for all researchers, large or inefficient job submissions that cause issues for other researchers will be put on hold by a Platform R engineer. The Platform R engineer will work with researchers to release the suspended jobs in batches or make their Slurm workflows more efficient.

Procedure

Platform R engineers will consider putting jobs on hold:

  • When they receive a Navigator ticket complaining of performance issues or denial of service.
  • When they notice performance issues based on alerts and metrics.

When this happens, the following steps are taken.

  1. Identify the Slurm jobs and the user that are causing the performance issues.
  2. Put the jobs on hold.
  3. Reach out to the user via email to let them know their jobs were put on hold. We will also create a ticket on behalf of the researcher in Navigator. 
  4. Work with the researcher to release the jobs in batches or to make their Slurm workflows have less impact on the cluster.

Related Documentation



Keywords:
platform r, slurm
Doc ID:
162333
Owned by:
William A. in SMPH Research Applications
Created:
2026-06-30
Updated:
2026-06-30
Sites:
SMPH Research Applications