Crex - Transport endpoint shutdown closed

The following error is visible for some users when accessing files under /proj on Rackham and Snowy. It is an issue with the Lustre file system for the project storage system Crex.

$ cd /proj/<my_project_directory>
$ ls -l
ls: cannot access <file>: Cannot send after transport endpoint shutdown

We are investigating this issue.

Update 2021-08-30 16:30

The problem only affects the login node rackham2. We will restart rackham2 to try and resolve the problem.

Update 2021-08-30 23:30

Restarting rackham2 fixed the problem.

The problem on rackham2 started Friday evening 2021-08-27T20:37:06. Compute nodes and Rackham’s other login nodes was not affected.

Technical details: rackham2 lost its connection to crex-OST0036; one of 84 OSTs (Object Storage Targets) in the filesystem. The Lustre client failed to reconnect and got stuck in an error state. Trying to read/write/stat any file striped to this OST resulted in errors.