Rackham problem with login nodes closed
Two of the three login nodes for Rackham have had problems mounting the home directories from the file server Gorilla during the day. They got evicted from the file server due too many reconnects. We have restarted them and changed settings on the file server to reduce the risk of this happening again. We still do not know the root cause, but are guessing we notice it on the login nodes because they are heavily used by many users.
The adress rackham.uppmax.uu.se is leading to all of the three login nodes using round-robin. This means that in an ideal case then the client connecting will rotate between them so if one is not working the next connect should go to one that works. However due to how DNS is cached you will most probably get the same server the next time you are connecting again.
So if rackham.uppmax.uu.se is not working for you be aware that you can try connecting to rackham1.uppmax.uu.se, rackham2,uppmax.uu.se or rackham3.uppmax.uu.se instead while we are trying to solve this.
We are sorry for the troubles this is causing.
Update 2025-12-05 15:45
Everything is still working fine but we keep our eyes open.
Update 2025-12-05 16:30
We can see the problem again, a client is reconnecting, which is stalling the file system. But now the eviction is not added which is good. So everything is not working as we want it but it is resolving itself after some delay.
Now we hope things will work fine during the weekend. Remember to try all the servers if one of them is having problems.
Update 2025-12-15 10:30
The login nodes have been stable since we fine tuned the configuration. We still need some more fine tuning but it is stable.