UPPMAX power failure closed
At 00:57 CEST on Monday, May 29th a power outage caused the cooling system at Ångström Laboratory to shut down, leading to a rapid increase in temperature within the compute hall. To prevent further temperature escalation and safeguard the equipment, all systems in the compute hall were forcefully powered off. The cooling system was restored at approximately 05:00.
Due to the elevated temperatures experienced during the outage, additional inspections are required to ensure the compute hall, compute, storage, and network hardware are functioning as expected. Currently, we have identified an issue with one of the two UPS units.
Throughout the day, we will provide regular updates regarding the progress of the recovery efforts and the status of the affected equipment. We are working diligently to resolve any issues and restore normal operations as soon as possible.
Please check back for further updates.
Update 2023-05-30 15:00
Bianca is back in production.
Update 2023-05-30 11:00
Miarka has returned to production. We are making progress on Bianca and expect to return to production soon.
Update 2023-05-30 08:00
Rackham and Snowy returned to production yesterday evening. We are making progress on Bianca and expect to return to production soon.
Update 2023-05-29 16:30
UPPMAX Cloud and SSC EAST-1 have returned to production. Bianca will most likely return to production tomorrow morning. Rackham, Snowy and Miarka will likely return in parts to production later today.
Update 2023-05-29 16:00
Rackham Login Nodes: The login nodes for Rackham are now accessible again. We are closely monitoring the system’s stability, and once we are confident that everything is functioning as expected, we will release the queues for user access.
Miarka Login Nodes: The login nodes for Miarka will be made available once the storage system has been updated.
Bianca, UPPMAX Cloud, and SSC EAST-1: The team is currently working on resolving any remaining issues with Bianca, the UPPMAX Cloud, and SSC EAST-1. Thus far, no significant issues have been identified, which is encouraging news.
Update 2023-05-29 11:00
The compute hall is fully operational again. We are now working on restoring systems.