Limited access for some projects in Bianca closed

We have an ongoing incident with one of three parts of the Castor storage system. Castor provides project and home storage to Bianca. On Sunday 24th of May between 12:45:15 and 12:52:06 three drives in the same RAID set broke. The RAID sets in Castor uses common dual parity (RAID 6) which allows for up to two broken drives before risk of real data loss. We lost three drives, apparently within 7 minutes, hence we are out of parity drives and are seeing on-site data loss. Any I/O to this RAID-set (on server castor43) will produce errors, as all blocks that make up a file can not be accessed. As a result many SNIC SENS projects have experienced issues when trying to read/write files after 12:52:06 on Sunday, 24th of May. Losing three drives in the same drive set where the drives are actually broken is very unlucky. We are hence suspicious of castor43 and other dependent storage components. Affected SNIC-SENS projects have been temporarily disabled on Bianca and the project members are no longer able to login or use the wharf. The reason why we do not allow logins is because the home directories may also be affected. We are currently investigating the issue. If we are unable to recover the RAID set we will restore project data from our off-site backups. If you are able to login to Bianca and you are not affected.

UPPMAX Support Team

Update 2020-06-16 16:00

We are closing this system news as all projects are now able to login to Bianca again. Please read our summary for more information.

Update 2020-06-16 10:00

We have reopened Bianca for all affected projects. We luckily did not discover double-overlapping bad sectors and was able to reconstruct a consistent RAID-set. The volumes and filesystem that consumes the RAID-set have been checked and reports no issues. We believe all data should should thus be up-to-date, and we did not need to restore from backups. However, you should inspect jobs / output that was running / generated during the time of the drive failures (2020-05-24 12:45:15 to 12:52:06).

Update 2020-06-15 17:00

We are running our final checks on Castor. So far things are looking well and if it continues, we expect all affected projects to be able to return to Bianca tomorrow. More information will be posted tomorrow.

Update 2020-06-11 17:00

The copy of the affected parts of the project directories has unfortunately taken longer than expected, and is still running. At the current rate we expect to be done around lunch tomorrow, after which we can start preparing Bianca for full production. At this time looking at the current status assuming no more surprises, we expect Bianca to be reopened for all projects by next week.

Update 2020-06-10 17:00

The filesystem checks completed with no issues. We have begun reading back data from the affected project directories including nobackup. The full copy should be completed by tomorrow morning, and we will be able to assert how much was recoverable.

Update 2020-06-10 15:00

We have been able to write a new virtual drive back to the storage system and successfully mounted the filesystem again. This is a good sign that we might be able to recover data. We are running consistency checks on the filesystem now and will update later tonight once the report comes in.

Update 2020-06-08 17:00

We have been able to construct a consistent RAID-set from disk dumps. We are now working on writing the resulting virtual drive back to the storage server. Once this is done the plan is to check the filesystem, which is striped across two additional virtual drives, and remount the filesystem if possible. If all goes well we will by the end of tomorrow or possible first part of Wednesday know if we are able to recover data that was written after the last run of the backups, and data, such as nobackup, that we do not backup.

Update 2020-06-04 15:00

Work continues trying to restore data from the drives in the RAID.

Update 2020-06-03 15:00

The status is similar to yesterday. We are working on restoring data from the drives in the RAID, both in a virtual environment and soon in a separate storage system with an identical RAID-controller. At this time we unfortunately do not expect to have Bianca open for all projects this week. We are sorry for this inconvenience and are working as fast as possible to recover and get all projects back into production.

Update 2020-06-02 09:00

Work continues. Projects related to the SIMPLER-infrastructure are now able to login to Bianca again. Please note however that we may need with no prior notice disable access again.

Updates are lagging behind as we proritize retrieving data and getting Castor back in production.

Update 2020-06-01 17:00

We have read back data from the backups and have it readily available if we are unable to recover files from the RAID-set. Much of the time so far spent has been on checking drives for bad sectors and creating an image of each of the 12 drives in the RAID. Today we completed the last image and tomorrow we will be able to work on a bit-for-bit copy (excluding unreadable sectors) of the RAID.

We have had many failed drives of the same model this spring. We suspect that this particular model, or possibly batch, is not working according to the vendors specification. Starting today we are working together with the vendor in analyzing the most recent failed drives (all of which is of the same model).

Update 2020-06-01 15:00

Work continues. We will summarize the days work at 17:00.

Update 2020-06-01 12:00

Work continues.

Update 2020-06-01 09:00

Work continues. The restore from backups is almost completed. We will continue to investigate if we can salvage the RAID-set.

Update 2020-05-29 15:00

This will be the final update this week. Work will be resumed on monday. It is still not possible to give an accurate estimate when Bianca can be opened again for all projects, however we will be in a better position to answer this question once the restore from backups complete (expected on Sunday).

Update 2020-05-29 12:00

Work continues.

Update 2020-05-29 09:00

Work continues. A restore from backups was started last night and is expected to take several days to complete. We are examining the original RAID-set in a new storage system to see if it can be recovered. Replacement drives delivered by the vendor.

Update 2020-05-28 17:00

Work continues tomorrow.

Update 2020-05-28 15:00

Work continues restoring from backups and examining the failed drives. At this time we do not expect to have Castor open for the affected projects this week.

Update 2020-05-28 12:00

Work continues.

Update 2020-05-28 09:00

Work continues.

Update 2020-05-27 17:00

Work continues tomorrow.

Update 2020-05-27 15:00

Work continues. Additional replacement drives has been sent by the vendor and is expected to arrive tomorrow. Once they arrive we will be able to replace all failed drives.

Update 2020-05-27 12:00

The failed drives has been dumped and examined. One of the three failed drives shows less damage than the others, and we might be able to use this drive to get the RAID-set back without resorting to creating a new one. This is one track that is being explored alongside restoring from backups. The reason why we would prefer not to restore from backups is due to time. The backups are located at PDC and tored on tape and reading back non-sequential data can take long time.

Update 2020-05-27 09:00

Work continues.

Update 2020-05-26 16:00

We are able to repurpose existing non-critical drives in Castor while we wait for replacement drives to be sent from the vendor. The broken drives are now being dumped and examined in a separate SENS-system. So far we have no indications that there is a problem with the storage server, backplane, SATA-controller, etc. that would allow us to recover the RAID set by replacing hardware. We are working in parallel to restore from backups. The recovery process will most likely require us to migrate a few active projects from the affected parts of Castor. The project members will be notified about this move. Once we start reading back from the backups we will be able to provide an estimated time of recovery.

Update 2020-05-26 12:00

Our vendor is preparing to send replacement drives. We always keep replacement on site but we only have two available this time, so we need at least one more to be able to start a rebuild. We expect the drives to be delivered with a day.

Update 2020-05-26 09:00

Work in progress.

Update 2020-05-25 17:00

We have checked logs from the controller and SMART-data from the broken drives. As it looks now the drives, not anything else, is broken. We can see many unusable broken sectors. This suggests that the storage server castor43 is OK, which contrary to what one might think is not good news as it implies that the RAID set can not be recovered. We have reached out to the drive vendor to confirm our conclusions. Tomorrow we will start to prepare for off-site backup recovery. Next update tomorrow at 09:00.

Update 2020-05-25 14:15

We are working on this issue and have restricted logins for the affected project members. The queues for these projects has also been stopped. If you are able to login to Bianca you should not be affected by this issue.

Update 2020-05-25 09:00

Investigations begins.