May maintenance window closed
The service on May 6th will begin at 09:00 CEST. The queues on Irma and Bianca will not be stopped. The queues on Rackham and Snowy will be stopped as we perform an upgrade to the project storage system Crex. Crex is a DDN ExaScaler system and the upgrades involves upgrading from ExaScaler 4 to ExaScaler 5. The upgrade will provide us with an updated Lustre with one of the more important additions being that we expect it to permanentely fix some of the instability issues we faced by the end of last year. ExaScaler 5 will also bring higher performance when migrating metadata between the metadata targets.
You can follow our progress on this page throughout the day.
Update 2020-05-12 17:00
After successful testing on the filesystem, we have now started the queues on
both Rackham and Snowy. The clusters are back in production and Crex seems
stable so far.
A small warning about the command uquota
, which is not showing correct
information right now. We are in a process of making minor changes in our
internal tools that will fix this.
With this we are closing the maintenance window post.
Update 2020-05-12 15:00
No issues so far. Final tests in progress.
Update 2020-05-12 12:00
The mounting of /proj
on the compute nodes went well. The results of the
performance testing is looking promising. Testing will continue and we will
perform an update of our configuration to add redundancy back to the OSSes,
which was missing from the latest configuration change. Next update at 15:00.
Update 2020-05-12 09:00
We believe many of the issues has been solved. We are now testing mounting /proj on the compute nodes.
Update 2020-05-11 17:00
Work continues together with the vendor to return Crex into production.
Update 2020-05-11 15:00
We are still not ready to put Crex back into production. The storage system is having issues balancing OSTs across all (four) OSSes. We are working together with the vendor to sort this issue out.
Update 2020-05-11 12:00
We are still working on resolving minor issues with the storage system. Next update at 15:00 today.
Update 2020-05-11 09:00
Today we continue with the effort to bring Crex back to production after the upgrade. Our expectation is that we will be able to start the queues on Rackham and Snowy today or tomorrow. Next update comes at 12:00.
Update 2020-05-08 17:00
The maintenance on Crex is completed, however, a few issues remain that needs to be checked before we can return access to the filesystem and start the Slurm queues. Overall, the upgrade has gone well, with few hiccups along the way. Work will be resumed on Monday morning and we aim to have Rackham and Snowy back in production early next week.
Update 2020-05-08 15:00
We are making good progress with the upgrade process now, still some work remaining. Next update at 17:00.
Update 2020-05-08 12:00
Maintenance ongoing without any issues. Next update at 15:00.
Update 2020-05-08 09:00
Maintenance ongoing. The final steps of the upgrade is now in progress. We apologize for missing this update at 09:00.
Update 2020-05-07 17:00
The maintenance for Crex will continue tomorrow. The first part of the work involves upgrading Lustre, which is almost completed, the second part involves updating firmware, which is ongoing.
Update 2020-05-07 15:00
Work on Crex continues with minor issues. The upgrade will most likely take the rest of today so at this time we believe Rackham and Snowy will most likely not return to production today. Note that you may still login to rackham.uppmax.uu.se and access your home directory.
Update 2020-05-07 12:00
Work on Crex continues with minor issues.
Update 2020-05-07 09:00
The remaining work on Crex is expected to take 12 to 17 hours, according to the upgrade plan we have got from DDN. We aim at taking Rackham and Snowy back in production tomorrow. Updates will come if the circumstances change, otherwise we will announce when Crex is ready.
Update 2020-05-06 17:00
Our work on Crex will continue and we expect it to be completed tomorrow if everything goes smooth. Until then queues on Rackham and Snowy will remain stopped, and project directories will be unavailable. Next update at 9:00 tomorrow morning.
Update 2020-05-06 15:00
The work on Crex is moving slower than anticipated due to issue replacing a
malfunctioning controller in the DDN EF4024 Raid system. The login nodes are
online for Rackham where you can access your home directory, but the project
storage directories (under /proj
) will remain unavailable until the
maintenance on Crex is completed. Maintenance is completed on Irma, Grus and
Bianca. Next update at 17:00.
Update 2020-05-06 12:00
The maintenance is going without issues so far. Next update is coming at 15:00.
Update 2020-05-06 09:00
Maintenance begin. /proj
on Rackham and Snowy will very
soon become unavailable as we begin working on the project
storage system (Crex).