Issues with Crex closed

There is a problem with the metadata performance on Crex. Normal file system commands like “cd” or “ls” is slower than usual. The problem started yesterday. The behaviour is similar to the previous problems but so far not as severe.

We are investigating and have contacted the storage system vendor.

Update 2021-05-11 13:00

The filesystem performance has recovered since the last few weeks. We will continue to monitoring the performance and work together with PIs to remove expired project data. Please contact the support at support@uppmax.uu.se if you make any further observations or have questions.

Update 2021-03-22 11:00

Performance recovered when we identified a job doing a lot of random I/O on a large dataset. If the dataset fit in available local RAM on the computation node we recommend using for example vmtouch or if not using the scratch partition of the local disk.

Update 2021-03-24 17:30

We have problems again with Crex being slow. We are investigating jobs on the cluster and have contacted the storage system vendor.

Update 2021-03-25 09:00

Crex is still intermittently slow. We have drained the computation nodes so that currently no new jobs are started. We plan to to open up for new jobs gradually, when possible. We are in contact with the vendor. We are also restarting nodes which seem to hold problematic locks which may affect other nodes access to the file system.

Update 2021-03-25 09:25

We plan to do a restart of one of the two metadata servers at 09:30 for the Crex file system. This will stop all access to Crex during the restart and the failover. Jobs will continue when the metadata service is running again.

Update 2021-03-25 11:45

We are slowly opening up Rackham and Snowy for new jobs while keeping an eye on file system performance.

Update 2021-03-26 12:00

Crex is working fine now.

Update 2021-04-12 09:30

Crex is running slow again. We will do a metadata server restart again to mitigate this. We have identified a bug which may be the culprit. A fix for this will hopefully be in the next release of ExaScaler expected soon. ExaScaler from DDN is the packaged Lustre file system that Crex is running.

In the meantime, we keep monitoring the health of the system. Note that this bug only appears together with high load from user jobs. Try to make sure that you use the scratch filesystem or other options for frequent non-sequential reads and writes or any temporary files. This will also improve your experience when Crex is slow.

Update 2021-04-13 13:30

The file system servers was upgraded yesterday. We will begin upgrading the clients today with the latest Lustre patches from DDN. It looks good so far.

Update 2021-04-17 14:55

Client updates have been ongoing. During Friday, we observed a new kind of behavior, possibly related to a network cable becoming unreliable. We have disconnected that network port. We are also seeing some other changes in behavior after the update which we have communicated to the vendor. For a while, new jobs on Rackham and Snowy were paused, but since things are looking slightly better again, they’re now resumed.

You might have experienced jobs that were killed or suffered I/O errors during Friday-Saturday. We’re monitoring the situation, but it’s possible that the intended fixes from the vendor introduced new problems.

Update 2021-04-19 16:50

Crex still has problems making it go slow. We are still working on the issue together with technicians from DDN.

Update 2021-04-23 13:00

Performance was normal from Monday noon until today.

Around 10:00 today performance started to deteriorate, and is currently very slow. Technicians from DDN are actively investigating.

We understand that this issue is frustrating when trying to work on Rackham/Snowy. This problem have a very high priority for us and we are doing our best fix it. We have escalated the problem with DDN, and had a meeting with resulted in several action points for how to proceed with the troubleshooting effort.

Update 2021-05-06 17:00

We have not seen this performance problem during the last week.

During the maintenance window yesterday additional software updates was installed by DDN.

We are monitoring the system but please contact support@uppmax.uu.se if you notice any filesystem issues.