Cooling failure closed
We had a significant loss of the central cooling starting after 3 PM. Akademiska hus were notified at 4 PM. At this point, we don’t know how bad it will get. It is possible that nodes will turn off automatically in order to avoid overheating. If the problem persists, we will be forced to turn off the storage systems as well.
At this moment, all user-facing services are functioning normally, but that will soon change unless cooling returns. There is nothing UPPMAX staff can do to acclerate this.
Update: 2023-11-28 11:00
All systems, including Bianca, is now back in production.
Update: 2023-11-25 16:26
Most of Bianca has now shut down automatically to decrease the heat output. More systems will do so as the temperature rises.
Update: 2023-11-25 17:10
After on-site presence from the university physical security contractor, Akademiska hus, UPPMAX, and our datacenter facility contractor, temperatures have now stabilized and are slowly coming down. We might not get all Bianca nodes back online until normal office hours.
No storage systems have been affected and we hope that the situation will not worsen.
Update: 2023-11-25 18:20
Finally, a malfunctioning valve in the central cooling system relating to heat recovery was identified. The flow of coolant water was severily limited and this resulted in our equipment overheating.
Bianca nodes have been turned back on, but we might need to make some additional efforts to bring them back to normal function during next week.
Update: 2023-11-27 08:28
More careful checks indicate that parts of Snowy were also turned off due to overheating. Miarka was turned off to the extent that it became inaccessible. In addition to compute nodes Bianca, some storage was also turned off due to the heat. This means that it can take longer to bring Bianca back to full operations. Don’t expect Bianca to be fully back during Monday.