Castor is back in production after RAID-issues - information for those affected closed
On Sunday 24th of May between 12:45:15 and 12:52:06 three drives in the same RAID-set in Castor broke. Upon investigating we also discovered a fourth drive with bad sectors, but which had not yet been automatically removed from the array. The drive failure caused the disk array to go offline, which made parts of the project and home filesystem unaccessible. This affected many Bianca projects. We canceled any runnings jobs and restricted access between 24th of May and 16th of June for the affected projects while we worked to bring the filesystem back online. Castor was put back in production June 16th and we shortly after re-opened Bianca for all projects. Most projects in Bianca has been unaffected by this, although you will likely have seen the system news about the incident.
We have repaired the RAID-set and checked the volumes and filesystem that make use of it. We have seen no signs of data corruption however any I/O that was made during the 7 minutes the drives broke (12:45:15, 12:45:33 and 12:52:06) has had an increasing risk of failing. After the third drive broke the drive array went offline and did not accept any more I/O. If you were performing work at this time or had Slurm jobs running, please double-check your jobs and output.
We have also discovered that the wharf remained operational between 24th of May and 16th of June. If you uploaded data between these dates please verify that the upload completed with no errors. On our end we have seen no signs of errors during transfers.
During this spring UPPMAX has seen an increased number of failed drives from Castor and from a particular brand and model, and we have asked the drive vendor for assistance it analyzing the drive failures. The same brand and model drive was involved in this incident, and if it is discovered that the drives are not up to specification, we will work to have the drives replaced.
If you have any questions, please contact us at support@uppmax.uu.se.
UPPMAX Support Team