UPPMAX cloud issues with external network (FIP) closed

Update 2023-06-15 09:00

Everything looks correct now. In addition to our own tests, we have asked users to confirm, and so far we have only got confirmation of no issues. It is particularly difficult to test all aspects of a cloud this sizes, since project local network configuration plays a part.

We are working on resolving an issue with external access to VMs. If you start your VM, some users might not be able to connect to it. This is most likely falldown from the power down from Monday, May 29th. It does not affect all users.

Update 2023-06-14 17:00

We have found a potential fix and have been applying it on projects during the afternoon. In all cases so far the fix has worked. The fix will run to completion later this evening, until then we remain confident but can not confirm that all affected networks will be fixed.

We are grateful for your patience.

Update 2023-06-14 11:00

We have made progress and are working on returning access to the cloud.

We have received a query from one of our users regarding the impact of the recent networking issue on their volumes and instance data. We want to address these concerns promptly and provide reassurance.

At this point, we have found NO indication that any data has been affected by the networking problem. Our investigation reveals that the issue primarily stems from the handling of SDNs (Software-Defined Networks) by the neutron component. It is important to note that this problem is separate from the storage system responsible for storing user data, including volumes.

As operators, we have continued to access our internal infrastructure, which is also stored within the same storage system, without encountering any issues.

If you have any further questions or concerns, please feel free to reach out to our support team at support@uppmax.uu.se. We appreciate your patience and trust as we work towards a resolution.

Update 2023-06-12 15:00

We are attempting to bring up the virtual networks by disabling the HA features in neutron. This will create a single-point-of-failure, but will allow us to rule out if the issue is created by a broken HA-setup. We are expecting this to take some time, and we apologize for this delay as we are aware many users are blocked by these persistent issues.

Update 2023-06-08 17:00

We apologize for the delay in providing an update. We want to assure you that our team is actively working to resolve the issue, and we will provide you with further information as soon as it becomes available. Although progress has been made, we require additional time to fully resolve the issue.

Here’s an overview of what has occurred: During the sudden shutdown on May 29th, the network controllers were left in an inconsistent state. Initially, this issue went unnoticed as our internal projects and associated virtual infrastructure, including VMs, continued to function as expected. Normally, any residual state, such as files or temporary database tables, would be automatically cleaned up when the services stop and start. However, during an abrupt shutdown, the cleanup process is limited to what occurs upon service startup.

In this specific case, the ‘neutron’ component in OpenStack, which manages the network infrastructure, fails to recreate certain elements of the virtual infrastructure, such as network interfaces and keepalived processes, that we typically provide out-of-the-box for each project. Despite our repeated attempts to instruct ‘neutron’ to recreate the necessary infrastructure, some components have not been successfully restored. Additionally, we have observed a deterioration in the high availability (HA) features of ‘neutron’ as it struggles to determine the active controller among the three available (while the others remain in standby mode).

We sincerely apologize for any inconvenience caused, and we appreciate your patience as we work diligently to resolve this issue and restore full functionality. We will continue to provide updates on our progress and any further developments. Thank you for your understanding.

Update 2023-06-07 17:00

Update 2023-06-07 11:00

The UPPMAX cloud is still under investigation. We have made some progress isolating the issue but have not yet find the root cause why some virtual networks remain unavailable.

Update 2023-06-02 17:00

There are unfortunately still remaining issues preventing the cloud from being fully operational. We will continue to work to have the cloud back in production as soon as possible.

Update 2023-06-02 12:00

The problem is confirmed related to the hard shutdown. We have found corrupted state caused by the abrupt power down. We have managed to clean one part and some of the virtual networks are now working better, however, we still have remaining issues that appears to not heal itself even after cleaning up the old state files. The cloud will remain unavailable as we continue to work diligently to restore the last networks.

Update 2023-06-01 17:00

This problem appears to be related to the hard shutdown, leaving the virtual network configuration in a broken state. At this time, roughly 20 virtual networks are affected. We are making slow progress recovering the networks, and hope to have this issue resolved before the weekend. We apologize for this inconvenience.

Update 2023-05-31 17:00

We have identified this issue as a problem with neutron, however, we have unfortunately not found the root cause yet. The problem however is obvious, as neutron is not creating the necessary virtual infrastructure needed to allow external access to the VMs. We will continue to work to have this solved as soon as possible.

Update 2023-05-31 11:00

We have restarted the network service in OpenStack. All users might temporarily lose connection to their VMs. We apologize for this inconvenience.