Issues with Crex (Rackham & Snowy project storage) closed

There are unfortunately new issues with the storage system Crex. Crex is used for storing project directories and large data sets on Rackham and Snowy. The new problems first appeared on December 7th (Saturday) evening and manifests, as before, in slow or no access to data under /proj and /data. This is the second time we have had issues with Crex within a week and we are working together with the vendor to have this solved as quickly as possible. This news will be updated as we know more.

Update 2019-12-11 16:00

The suggested culprit is a known bug in our version of the Lustre filesystem. We have implemented a workaround suggested by the vendor until we can perform a larger update of the filesystem at a suitable time (likely early 2020, due to the Christmas holidays). Today we performed several tests in an attempt to trigger the problem in a controlled setting - and we have so far been unable to reproduce the issue. This hopefully implies that the workaround indeed works, but we need to monitor the system closely as the vendor is still performing a technical analysis, and has not yet convincingly reported on the cause.

With no signs of further issues we have decided to once more allow jobs to run as normal (that is without specifying –reservation=job_might_be_killed).

If you experience any issues regarding slow or no access to your project directory, or any related issues, please contact the UPPMAX support at support@uppmax.uu.se

Update 2019-12-09 09:19

We are working with the vendor to resolve this issue and and will not start the queues properly before we believe the problem has been resolved.

However, the file system is currently available and if you have a pressing need to run jobs, you can do so by using the reservation job_might_be_killed.

Our slurm user guide at https://www.uppmax.uu.se/support/user-guides/slurm-user-guide/ is a good way of knowing how to use it, but the short version is that you add

--reservation=job_might_be_killed

to your sbatch, salloc or interactive command.

Jobs in that reservation may run but might be killed due to issues or because we need to shut things down as part of work to address any issues.