Slow slurm on rackham closed

Slurm (the workload manager we use to schedule jobs) does not always like it then it has too many jobs to keep track of. This has happened quite a few times lately.

When this happens, squeue or sbatch/salloc may be very slow or even time out with an error like

slurm_load_jobs error: Socket timed out on send/recv operation

we’re aware of this and do what we can to improve the situation, but a lot of jobs means a lot of work which takes time.

Final ticket report

We have not seen this issue for some time now.

Affected systems: rackham

Written by Support Team on November 16, 2018

←→Top