Issues with Crex closed

There have been reports about about slow or no access to the project storage system on Rackham and Snowy.

We are investigating it, and in the meantime we recommend users to use the local scratch storage for jobs with intensive file access. See our user guide for details - https://www.uppmax.uu.se/support/user-guides/how-to-use-the-nodes-own-hard-drive-for-analysis/

Update 2020-08-06 17:00

We performance maintenance on Crex on August 5. We replaced a faulty IOM module from a storage enclosure and a SAS-cable that connects one of the controllers to the (same) enclosure. This replacement recovers a large part of the lost performance from the end of july. We also installed new package specifically patched to include a fix for our version of Lustre. The fix prevents the a bad path in the server code that crashes one of the MDSsen. These two fixes, replaced hardware and software updates, should return Crex to its prior status and performance. If you have any questions, please contact the support at support@uppmax.uu.se.

Update 2020-07-31 16:00

We are having issues with poor performance running metadata operations (as simple as ls -l can trigger it) on Crex. We have updated our maintenance plans for next week and will perform more extended maintenance on Crex to solve this issue. See the service day news for august for more information.

Update 2020-07-15 12:30

So far we could deternine that one of the metadata servers of Crex became unhealthy and eventually got shutdown by the systems maintenance sevice. We have opened a case with the manufacturer and we are waiting for their answer.

Update 2020-07-17 09:00

As a consequence of the storage system metaserver (MDS) being out of function, compute nodes of the Snowy cluster cannot mount the Crex filesystem. As a result they are unavailable for the queueing system Slurm. We are working on a workaround with help from the manufacturer.

Affected systems: rackham, snowy, and crex

Written by Support Team on July 15, 2020

←→Top