Issues with Crex (upgrades started 2021-01-21) closed

We have noticed a problem with the metadata servers of Crex, which is making the project storage system on Rackham and Snowy unusable at the moment. Following vendor upgrades of client and server software, we are now allowing all kinds of jobs again.

Update 2021-01-25 14:00

Jobs have successfully ran during the weekend and we no longer see any convincing signs that issues persists. We will continue to track Crex by monitoring the load and log messages. If you have any questions or observe performance related issues, please contact the support at support@uppmax.uu.se.

Update 2021-01-22 13:50

Most nodes are now updated. We’ve started allowing normal job time limits on most nodes. We hope that most problems are resolved, but we’re not yet confident about that.

Update 2021-01-21 14:30

We have continued running with some slowdowns, which migh be due to a bug in the filesystem server and/or client. Therefore, the vendor will prepare an upgrade for us today (Thursday) and possibly running into tomorrow.

The availability of Crex can get worse during the upgrade, although this is supposed to be an “online” operation that can be executed while jobs are running. The filesystem should never become fully unavailable, but some operations can have execution times of several minutes, temporarily.

Update 2021-01-19 00:00

We have continued troubleshooting and possibly found a partial explanation based on an interaction between two specific jobs and the storage solution. Those jobs have been cancelled. We are still in contact with the vendor. The queues are partially opened again, but only allowing jobs shorter than less than half a day now to leave room for an extraordinary update, if that will turn out to be needed.

Update 2021-01-18 17:00

We have performed metadata performance tests continously during the day and while performance has recovered we still see some suspicious log entries that the filesystem is still having some issues with file locks. We feel confident enough to, unless we or the vendor see something more concrete errors later today, release a few Rackham nodes and allow short jobs to run.

Update 2021-01-15 18:30

The filesystem have been started again. Metadata performance is very low, which means that running commands like “ls” is very slow. Job queues are still stopped.

We continue troubleshooting this together with the vendor.

Update 2021-01-15 14:00

The project storage system Crex is shut down and unavailable.

The storage system vendor have identified parts of the problem as a known software defect, and is installing fixes. Consistency checks and repairs are also being performed. We do not have any further details or time estimates yet.