Crex is slow closed

The main project file system Crex for the clusters Rackham and Snowy has been running slow this afternoon. We are trying to identify the jobs causing this. This may also be a side effect of the file system getting full.

We will continue our work with removing old expired projects.

Update 2021-10-11 10:00

We replaced one of the controllers on Crex the last week. During the last week we also with input from DDN (vendor) tuned down two parameters for Lustre (max_rpc_in_flight and max_dirty_mb) which may decrease performance for a single job/node. Hopefully this will give Crex a more stable performance. We may with input from the vendor change the tuning on this again. If this affects your performance in any noticeable way let us know.

Update 2021-10-01 14:30

Starting today a few hours before lunch we saw degraded performance again. We will continue to work with the vendor to identify the cause.

Update 2021-09-30 23:30

The technician from DDN determined that the performance degradation was related to one of the two redundant hardware storage controllers.

At approximately 22:00 all the filesystem load was moved away from the suspect controller and onto the other one and after this the performance recovered.

Troubleshooting of the controller will continue tomorrow, but for now the system seems to be stable with normal performance.

Update 2021-09-30 16:43

We tried the same reboot procedure that was successful last time, after filing a support case with our vendor. This was not successful in alleviating most of the load. A representative from our vendor is actively monitoring the system now.

We urge all users to avoid jobs that do random reads from large reference files directly on Crex, as outlined in the main text of the issue above. Try using scratch or vmtouch instead.

Update 2021-09-29 13:00

Crex is still intermittently running slow again. We are investigating.

Update 2021-09-07 13:30

Performance on crex have been stable after the reboots. We close this ticket.

Update 2021-09-04 08:21

Crex was slow during most of the Friday. We explored the quota hypothesis, but we could not confirm any specific project hitting the quota repeatedly being the cause. Still, we got very high load in the quota subsystem on the storage servers.

Thus, we reported this issue to the vendor and started a controlled reboot of the affected storage servers. This implied a graceful transition of all storage to another server while each server was rebooting. This process was started around 15.40 and completed around 17.30. No IO should have failed during this time, but waiting times (already long due to the issues) might have gotten longer for a while.

When the final reboot was complete, the load subsided. We do not see the high number of quota-related requests anymore. Our own metrics also indicate that other general operations on Crex are now back to normal speed. Our case with the vendor is still open to do a proper post-mortem and see if there is an underlying issue that should be addressed. This scenario was different from previous periods of unresponsiveness that we have encountered.

Update 2021-09-03 08:40

Crex is still running slow and we are working on it. We have stopped a job creating a lot of small files which create a lot of metadata load. We are also investigating if quota overrun may cauase high metadata load.