Exhausted controller file system on Irma closed
The main filesystem on the server hosting the Slurm controller on Irma has been completely full at times. This has stopped the submission of new jobs and possibly also caused ohter, intermittent, errors. We plan to expand the main filesystem. We’re also removing unnecessary older files. We apologize for the inconvenience when individual jobs might fail to submit properly. If you experience repeated issues, please contact support. We hope to have the issue more fully resolved when staffing is increased again by early next week (the week starting August 9). When the actual upgrade is made, all access to the Slurm controller might be blocked (scheduling and checking status of jobs).
Jobs already running on compute nodes are not expected to be affected by these issues.
Update 2021-08-05 21:03
The controller node is currently under maintenance to address the underlying issue. It should be back up again during the evening.
Update 2021-08-05 22:05
The expansion was successful. Slurm seems to be operating normally again.