Bianca network issues closed
The Bianca cluster is currently experiencing network issues resulting in failures to start login nodes and slow and intermittent access to storage. Troubleshooting is ongoing.
Update 2025-12-04 16:30
We have not experienced any problems so far after the restart of all old Bianca login nodes and the modified settings for the Wharf VM but we are keeping an eye on this. Close ticket for now.
Update 2025-12-01 16:30
The shutdown of 43 old login nodes that have been running before last Thursday is now in progress. The 60 login nodes that has been started after Thursday are kept running. Login nodes that have jobs running on computation nodes will be started up again.
Update 2025-12-01 10:28
It seems like compute and login nodes that have been running since before the file system issues began keep having intermittent errors. These errors have contagion effects also affecting wharf access, among other things.
For these reasons, we will actively reboot all login nodes that have been running since last week. We have already had to do that manually for several projects that reported problems. This means that you can loose ongoing interactive sessions. We apologize for that.
Compute nodes that have been running for long will be “drained”. This means that that they will also be rebooted, once any current jobs running on them have completed. The intent is to not prematurely kill any ongoing job.
Hopefully, this will resolve all lingering problems. Thank you for your patience.
Update 2025-11-28 16:30
Long explanation (short version below):
Earlier this week we saw Cygnus misbehaving, and traced it to an InfiniBand (IB) card causing errors in one of the host machines running the virtual IB-Ethernet routers for the Lustre file system. Cygnus is the file system for Bianca. The faulty IB card disrupted the entire IB fabric, which affected storage for all Cygnus users — essentially all of Bianca. We resolved the immediate issue by restarting that server, and things looked stable afterwards.
However, the IB-fabric problems seem to have caused issues for already-running machines that had Cygnus mounted. We didn’t notice this right away, since in our initial tests all virtual machines in Bianca were able to mount Cygnus. Some directories were still not working, likely because certain clients had stopped communicating with some of the Cygnus servers for unknown reasons, causing I/O to stall. Or some other problem with the state of Lustre file system client.
We also discovered that one of the Bianca servers, Wharf. reacted poorly to the long timeouts and ended up blocking all users, even those without a problematic directory.
We are now restarting all affected clients we discover, we have added options to the VM running Wharf to make it less susceptible to blocking, and, last but not least, are in the process of restarting the entire Cygnus file system.
TL;DR We are in the process of resolving the problems affecting Bianca and Cygnus. If you are experiencing problems please report back to us. We may need to restart your virtual Bianca cluster.
Update 2025-11-27 12:17
Wharf is still experiencing issues, trouble shooting is ongoing.
Update 2025-11-27 08:35
Wharf is up and running and being monitored.
Update 2025-11-26 16:11
Unfortunately, wharf is still having issues
Update 2025-11-26 15:25
Wharf and file transfers are operational again.
Update 2025-11-26 14:10
Wharf and file transfers are not functional at the moment.
Update 2025-11-26 13:00
File transfers are still having issues.
Update 2025-11-26 10:00
File transfers are functional. Some compute nodes are still having issues accessing storage.
Update 2025-11-26 08:15
The issues with login nodes and storage have largely been resolved but will be monitored. File transfers and Wharf are currently not functional.