Over the past six days, we experienced two network issues affecting the blade center hosting nodes vps-001 to vps-016. These issues disrupted internet access, although the internal cluster network and access to our Ceph storage remained unaffected.
Upon investigation, we identified the problem as a malfunctioning switch within the blade center. While the switch was still operational, it was not forwarding any traffic. During the first incident, a cold reboot of the switch by our on-site team temporarily restored normal functionality. The hardware had been running smoothly for over a year and performed as expected after the reboot.
However, the issue reoccurred yesterday, prompting us to replace the switch. Today, the spare part has been installed in the blade center and is currently being prepared for operation.
To prevent further disruptions, we will migrate to the new switch during an emergency maintenance window this weekend, once all necessary preparations and reviews have been completed. We will announce the exact timing of this maintenance shortly. The migration will cause short disruptions to the internet access of the virtual machines on the affected nodes.
As we have already identified the issue and are taking the necessary steps to resolve it, we kindly ask you to refrain from submitting additional support tickets related to network connectivity for the affected VPS nodes in case of any brief outages or during the upcoming maintenance. We will keep you updated on the progress and thank you for your understanding.