Intermittent I/O-errors on Rackham and Snowy closed

The project storage system that attaches to Rackham and Snowy is unfortunately still having problems. The problem for most users will result in degraded performance when reading and writing from /proj and /data. For example, cd into your project directory and listing your files with ls may take longer. Reading and writing to the project directories may also result in input/output-errors, as seen below:

rsync: readlink_stat("/crex/proj/snic2019-X-YY/archive/publicityimgs/636_E11_x.tif") failed: Input/output error (5)
rsync: readlink_stat("/crex/proj/snic2019-X-YY/archive/testimg/1000_A1_1_features.csv") failed: Cannot send after transport endpoint shutdown (108)

At this moment the error is not seen for all read/write-operations, thus your job or interactive session may not be affected, but until the problem is fixed we urge all users to be extra cautious and double-check your output files for errors.

Signs of a problem became known yesterday 16:15 (January 28th) on Snowy.

Update 2020-01-30 15:00

Work progressing together with the vendor. Loss of network connectivity for Uppsala University prevented this upgrade from being published.

Update 2020-01-31 11:15

Today and yesterday we are still seeing significant intermittent performance issues with the storage system. We have now begun upgrading the storage routers HCAs with newer firmware and we have tuned the system to lessen the load on the metadata servers. Parts of the system will get restarted during the day. The queues have for now been stopped while we perform the upgrade.

Update 2020-02-06 09:00

During the february maintenance window we replaced hardware in two of our storage routers. The Lustre-client on the remaining Rackham and Snowy compute nodes were also updated to the latest version provided by our vendor. The cluster fabric switch firmware was also updated (as the fabric connects to Crex). Finally, we restarted Crex to make sure the backend services got to start clean. Despite these efforts, we are unfortunately not very confident that this solves the problem, as the root cause has not yet been identified by us or the vendor. Our investigations has revealed several severe Lustre-bugs that is not fixed in the version of Lustre we are running and thus we have escalated our support case at the vendor and asked for an upgrade plan for the backend and clients. An upgrade will most likely require additional downtime. More information will be provided.

At this particular moment we have no reports of issues with Crex and the queues has been running since yesterday. This issue will receive further updates and remain open until we can confirm the problem has been solved.

Update 2020-02-07 16:00

So far so good. No signs or reports of performance issues since the service day. The internal logs reports nothing unusual. Storage load is normal. The vendor is preparing updated Lustre-packages which we will likely receive by next week. If the good behavior continues we will aim to upgrade at a service window.

Please let us know if you have any questions at support@uppmax.uu.se.

Update 2020-02-10 09:00

We have not had any errors since the maintenance day last week where we upgraded the storage client software and replaced hardware for two Crex storage routers. The performance is back to normal. If you have any issues you wish to report, please contact the support at support@uppmax.uu.se. The investigation will remain open for a while as we in cooperation with the vendor is planning further upgrades for increased stability and (metadata) performance.

Update 2020-02-11 12:30

We are again experiencing high load on the metadata servers. This will negatively impact performance. We are working on this issue.

Update: The load was caused by jobs that overloaded the storage system. The performance recovered after the jobs were stopped.

Update 2020-02-14 13:30

The storage system has not had any internal issues since last week that should have a significant impace on performance or stability. The vendor is still preparing an upgraded Lustre client (2.12.4).

Update 2020-02-19 14:00

The project storage system remains healthy. Further upgrades will be scheduled in an upcoming maintenance day.