Problems with Bianca and Cygnus closed
The storage system for Bianca is currently down we are troubleshooting together with the storage system vendor.
Update 2025-05-23 18:15
Short story: Cygnus is up and running again now. Bianca should then be
w~rking again. But let us know if it is not!
Longer story: One physical component in Cygnus failed. With help from the vendor we have now moved the services from the broken component to another part of Cygnus. Cygnus is now up and running again with reduced redundancy. Bianca should now when it has access to the file system start working again. But please let us know if things are not working as expected for you!
Update 2025-05-23 19:00
System have stopped working again. We will resume troubleshooting, but the system will probably be unavailable during the week-end.
Update 2025-05-26 13:00
We have been working with the vendor this morning to get Cygnus up and to be stable again.
Update 2025-05-26 17:00
We continue to work with the vendor.
Update 2025-05-26 18:15
Cygnus is up and running for projects that are already running on Bianca. We are having problems starting up new nodes which we will look into. Tomorrow we will hopefully receive replacement parts and and continue working with the vendor.
Update 2025-05-27 11:00
Starting new nodes on Bianca does still not work. We are working with the vendor. Replacement parts will hopefully arrive today.
Update 2025-05-28 09:00
The replacment part has arrived and was installed physically yesterday afternoon. The services are now running where they should with full redundancy. However when mouting the file system on the Bianca virtual cluster clients something is still timing out somewhere even though it is possible to mount. After restarting the InfiniBand subnet manager last night there seems to be an improvement. We are now testing various things to confirm this.
Update 2025-05-28 10:30
Startup of login nodes on Bianca are now back to 10 minutes and seems to be stable. Not good but not bad either.
Update 2025-05-28 17:30
We believe that most if not all projects are working as usual on Bianca. Please use Bianca again and if things are not working as expected let us know.
Update 2025-06-02 16:00
We are still seeing problems starting up nodes on Bianca. We are working on it.
Update 2025-06-03 15:00
As far as we can tell the problems in Cygnus has been resolved.
We do not know if it was related or not but during the weekend two of the LNet routers for Cygnus started to misbehave relating to InfiniBand. Since these were the ones with the least routers running they got all the new routers and this made new clusters not start as they should. We disabled them and investigated and then new clusters could start.
Update 2025-06-04 17:30
This ticket will be closed since we have begun an upgrade of Cygnus in the service window for June, shutting Bianca and Cygnus down. As far as we could tell earlier the problems with Bianca and Cygnus were resolved yesterday.