Crex problems closed

The main project file system Crex for the clusters Rackham and Snowy has been showing some problems the last few days. We have been in contact with the vendor to handle them. These got drastically worse during the afternoon and evening of December 22. We are trying to reboot selected parts of the system to restore availability.

If you are running jobs with very strange I/O loads, please review them if they have not already been killed by the issues.

Update 2021-12-23 10:28

The system is now back online. We’re still seeing very heavy load, but we are not sure if this is due to specific jobs with very aggressive I/O patterns. Please review your own jobs.

Update 2021-12-23 11:08

We identified a single user that was doing repeated scans over a single large file with indexing explicitly turned off, from what we can tell. While the file system can provide very high performance, a very high number of jobs reading the same full file can still cause trouble. We are not sure that this was the root cause, but things do look better right now.

Update 2021-12-26 12:00

Overall, operations have been more stable after the reboot and the removal of jobs from a specific user. We consider the current issue to be closed.