Status of Pelle open
What is the status of Pelle?
TL;DR: We are in the process of installing and configuring Gorilla and Pelle. No major obstacles so far.
Update at Wednesday 2025-04-09
All hardware is on site. We are installing the Gorilla file system running CephFS in the Red Hat Storage distribution. This is the same file system that we are running on the Vulpes file server for Miarka so we have previous experience with it. We however do our own installation - it is not a turn key solution. We are now in the process of testing the file system. We have during this testing time discovered things with the hardware, for example how CRC is handled in the switches and the hardware, that we were not aware of before and had to be fixed. It became obvious during the processes leading up to us accepting the hardware that this combination of hardware was the first of its kind the vendor set up. However that is of course not unusual for new hardware. All small things take time to fix.
This may be a curiosity, but one new thing for us is that the hard drives have multiple actuators. Half of each drive is read by the first actuator and half of the drive by the other. They behave like two hard drives bolted together. We are testing striping these two parts in order to get all the performance out.
The file systems in Rackham (called Crex) and Bianca (called Cygnus) are using the Lustre file systems. Our existing file systems for Miarka (called Vulpes) and now for Pelle (called Gorilla) are using the CephFS file system. When we synced the data over from Lupus (also running Lustre) to Vulpes we felt that Lustre were giving more metadata performance than CephFS. We feel that it is important that we test and try the metadata of Gorilla before being put in production in order to find things we can tune to get good metadata performance. Using the NVMe’s in an intelligent way can help with this.
We have installed a rudimentary setup of the Rocky 9 Linux distribution on Pelle. Compared to the Rackham nodes the Pelle nodes have quite a lot of cores and RAM. We plan to run virtual login nodes on top of the physical login nodes. This way we can also be running some other services related to the cluster on the physical login nodes.
We are setting up LNet routers betweeen the Pelle Ethernet network and the Crex/Rackham InfiniBand network. We are planning to mount Crex on Pelle in the beginning in order to get a seamless migration for our users as that can help migrate project-by-project over to the new file system. Crex will be mounted in a similar way as Bianca is mounting Cygnus - from the file server the data is sent by LNet over InfiniBand to the LNet router and then by LNet over Ethernet to the cluster nodes.
During the autumn we started but did not complete an upgrade of Rackham to Rocky 9. Everything takes longer than expected and we decided to continue to run Scientific Linux 7 on Rackham until we shut it down. We let small parts of Rackham run Rocky 9 allowing us to do some preparations. Pelle will be running only Rocky 9. We are now glad now we started this work already then.
Currently only system and application experts of UPPMAX are let in to Pelle. As soon as we have the rudimentary setup with Domus (for home directories), Gorilla and Crex (with project directories) mounted we are hoping to let interested users in. We will let you know.
It is at this moment a bit hard to get the exact dates when things will happen. We are a bit unsure about the time schedule but this is what we are looking forward to:
- Accept the hardware (Done 2025-02-07)
- Configure the network (Basic setup done around 2025-03-31)
- Setting up Gorilla (In progress, file system is currently being tested)
- Setting up Pelle (In progress, the basic kickstart and hardware config is done)
- Getting Crex mounted also on Pelle over LNet routers (In progress)
- Setting up Slurm and the module system. (In progress)
- Start by letting some users in on Pelle
- Start migrating projects from Crex to Gorilla for users running on Pelle
- Let all users in on Pelle
- Shut most of Rackham down
- Complete migration of Crex and Domus to Gorilla
- Shut down Domus and Crex