Issues with Crex (file system for Rackham) closed

Slow access to projects directories and files on Rackham can be experienced.

Some users have reported that commands such as “ls” or “ll” take a long time to complete, and in some cases they cannot read files. So far the problem is hard to be reproduced consistently but we are working on it and we hope we can isolate the reason soon.

It is only the filesystem that houses the projects directories that is affected. Home directories are not affected and users should be able to login and access their files there. GUI programs like MobaXTerm and WinSCP will be slow or even crash when their connection is set up to access project directories.

Final ticket report

Main issues seems to be resolved since at least 12:00 2019-08-28. Some rakcham compute nodes had issues that lingered until the evening.

Update 2019-08-29 08:00

The problem has been linked to an unscheduled and automatic reboot of one of the storage servers. The storage server became unresponsive after 18:00 on Tuesday and lost contact with the storage cluster. Our cluster manager software decided it needed a reboot. Once the server came back up the problem appears to have persisted and the I/O-performance did not return to normal until we manually lowered the load. What caused the server to become unresponsive is not yet clear. At this moment we believe the most likely cause is a very high load on specific parts of the filesystem.