Failed jobs on Rackham closed
On Thursday 2024-11-21 and Friday 2024-11-22 several jobs on Rackham failed immedately after starting. This was fixed Friday at 13:00. The failed jobs got the status CANCELLED in Slurm.
The cause was a single bad compute node with a failed harddrive that was not detcted.
If you had jobs that failed immediately, including interactive jobs, it was likely due to this problem.
To verify if a specific job was affected you can run the following
sacct
-command and check for node r1198
in the NodeList:
$ sacct -X -o Node -j <YOUR_JOB_ID>
NodeList
---------------
r1198
Affected systems: rackham
Written by Support Team on