Failed jobs on Rackham closed

On Thursday 2024-11-21 and Friday 2024-11-22 several jobs on Rackham failed immedately after starting. This was fixed Friday at 13:00. The failed jobs got the status CANCELLED in Slurm.

The cause was a single bad compute node with a failed harddrive that was not detcted.

If you had jobs that failed immediately, including interactive jobs, it was likely due to this problem.

To verify if a specific job was affected you can run the following sacct-command and check for node r1198 in the NodeList:

$ sacct -X -o Node -j <YOUR_JOB_ID>
       NodeList 
--------------- 
          r1198