UPPMAX Cloud slow instance boot closed
There is currently an issue related to the backend infrastructure at the UPPMAX Cloud. Users are experiencing problems starting new VMs. One of the last steps of the boot process is polling the control plane for instance metadata (such as hostnames, SSH-keys, user scripts, etc.). During high load, parts of the control plane is unable to respond in time, leading to cloud-init timeouts. In the worst case this may lead to a broken compute node (as no keys were injected, you will not be able to login).
The below shows a problem with SSH, as the keys were not injecte during instance creation.
$ ssh firstname.lastname@example.org email@example.com: Permission denied (publickey)
The problem as mentioned is known to occur period of high load, e.g. when multiple instances are created at roughly the same time. The boot performance is improved in later versions of the cloud platform, which we are working on migrating too at this moment.
We unfortunately have no quick fix for this issue, and will try to reduce the load on the control plane to allow the network components time to respond.
Update 2018-10-17 08:00
We have reduced the load on the cloud control plane which should result in slightly better usability overall, however, the main problem is not expected to be solved. As we continue to look for better solutions, we recommendation users to, if possible, spread your work evently across the day. If you know many users will be working at the same time, then this issue might become more severe.
Update 2018-11-04 11:00
As a step in working to solve this issue the compute nodes will be rebooted during the maintenance day on November 6th. If you are running services which needs to be shutdown manually, please do so before 09:00 CET. All running VMs after 09:00 on November 6th will be powered off.