Issues with Crex (Rackham & Snowy project storage) closed
There are unfortunately new issues with the storage system Crex. Crex is used
for storing project directories and large data sets on Rackham and Snowy. The
new problems first appeared on December 7th (Saturday) evening and manifests,
as before, in slow or no access to data under
/data. This is the
second time we have had issues with Crex within a week and we are working
together with the vendor to have this solved as quickly as possible. This news
will be updated as we know more.
Update 2019-12-11 16:00
The suggested culprit is a known bug in our version of the Lustre filesystem. We have implemented a workaround suggested by the vendor until we can perform a larger update of the filesystem at a suitable time (likely early 2020, due to the Christmas holidays). Today we performed several tests in an attempt to trigger the problem in a controlled setting - and we have so far been unable to reproduce the issue. This hopefully implies that the workaround indeed works, but we need to monitor the system closely as the vendor is still performing a technical analysis, and has not yet convincingly reported on the cause.
With no signs of further issues we have decided to once more allow jobs to run as normal (that is without specifying –reservation=job_might_be_killed).
If you experience any issues regarding slow or no access to your project directory, or any related issues, please contact the UPPMAX support at firstname.lastname@example.org
Update 2019-12-09 09:19
We are working with the vendor to resolve this issue and and will not start the queues properly before we believe the problem has been resolved.
However, the file system is currently available and if you have a pressing need
to run jobs, you can do so by using the reservation
Our slurm user guide at https://www.uppmax.uu.se/support/user-guides/slurm-user-guide/ is a good way of knowing how to use it, but the short version is that you add
Jobs in that reservation may run but might be killed due to issues or because we need to shut things down as part of work to address any issues.