Issues with Crex (/proj, /sw, /sw/data) on Rackham and Snowy closed

Crex, the project and software storage for the Rackham and Snowy clusters, became partially unresponsive on Friday afternoon 11h of November and has continued to have issues during the weekend of 11-13th. We have identified a couple of bad links in the storage fabric and removed them, thus the problem may be eliviated somewhat, but normal production status is unlikely to return before Monday, 14th of November.

Running processes may have been affected. Jobs may have to be restarted.

We apologize for this inconvenience.

Update 2022-11-13 20:15

Two faulty network cables that were contributing incorrect network packets were isolated. With a reboot of Crex, the system is back online. We still have intermittent issues of slowness that also existed earlier during the week. We believe these two issues were only partially related, any intermittent issue can grow larger under high load. The queues for jobs are open again, but jobs that were running during the weekend might have failed due to the problems.

We will continue troubleshooting this with the vendor in the coming week since the system is still in an unexpected state.