Crex is down closed

Crex is now back up and seems to be working, but we’re investigating what happened with the vendor.

Crex ran into new issues yesterday evening around 19:30 and was down until 10:30.

This affected jobs on Rackham and Snowy that write output to Crex. Jobs on Rackham and Snowy that write to local disk or the home directory are not affected. Also, all access to project folders on Crex were unavailable or restricted. Access will get back during the day as we’re working on the issue.

We have a priority issue with our vendor and are working on troubleshooting and solving the issue.

Update 2022-02-03 14:45

Crex seems to be working fine after the February maintenance window.

The previous problem was caused by a deadlock in Lustre which caused a failover which did not work as expected. We have tuned a setting which hopefully will make the failover work better. There are also updates available for Lustre which may help fix the deadlock, which we play to apply at a future service stop.

We will continue to track the performance of Crex.

Update 2022-02-01 10:30

Crex is back up. We’re investigating what happened and are keeping an extra eye on performance.

Update 2022-02-01 09:00

Crex ran into new issues yesterday evening and is currently down. We have a priority issue with our vendor and are working on troubleshooting and solving the issue.

Update 2022-01-28 18:30

The performance of Crex is now back to normal. If you still experience storage related issues please contact UPPMAX support.

Most of the metadata performance recovered yesterday (2022-01-17)

Original status

The main project file system Crex for the clusters Rackham and Snowy has been running slower since yesterday evening. We are still investigating the cause for this. We will update the status here as soon as we did some progress.

evening. Some errors remained on the file system servers and technicians from DDN helped us resolve these today.