Slow slurm on rackham closed
Slurm (the workload manager we use to schedule jobs) does not always like it then it has too many jobs to keep track of. This has happened quite a few times lately.
When this happens, squeue or sbatch/salloc may be very slow or even time out with an error like
slurm_load_jobs error: Socket timed out on send/recv operation
we’re aware of this and do what we can to improve the situation, but a lot of jobs means a lot of work which takes time.
Final ticket report
We have not seen this issue for some time now.
Affected systems: rackham
Written by Support Team on