Platform R: Slurm Job Suspension Process
Platform R’s architecture is a balance between security, performance, and usability for other Platform R users. Occasionally Platform R components are overloaded by large job submissions. This results in poor performance for users or in the worst case a denial of service.
To ensure a fair and performant platform for all researchers, large or inefficient job submissions that cause issues for other researchers will be put on hold by a Platform R engineer. The Platform R engineer will work with researchers to release the suspended jobs in batches or make their Slurm workflows more efficient.
Procedure
Platform R engineers will consider putting jobs on hold:
- When they receive a Navigator ticket complaining of performance issues or denial of service.
- When they notice performance issues based on alerts and metrics.
When this happens, the following steps are taken.
- Identify the Slurm jobs and the user that are causing the performance issues.
- Put the jobs on hold.
- Reach out to the user via email to let them know their jobs were put on hold. We will also create a ticket on behalf of the researcher in Navigator.
- Work with the researcher to release the jobs in batches or to make their Slurm workflows have less impact on the cluster.
