Outage of several VPS Nodes

Incident Report for dataforest

Postmortem

Over the past six days, we experienced two network issues affecting the blade center hosting nodes vps-001 to vps-016. These issues disrupted internet access, although the internal cluster network and access to our Ceph storage remained unaffected.

Upon investigation, we identified the problem as a malfunctioning switch within the blade center. While the switch was still operational, it was not forwarding any traffic. During the first incident, a cold reboot of the switch by our on-site team temporarily restored normal functionality. The hardware had been running smoothly for over a year and performed as expected after the reboot.

However, the issue reoccurred yesterday, prompting us to replace the switch. Today, the spare part has been installed in the blade center and is currently being prepared for operation.

To prevent further disruptions, we will migrate to the new switch during an emergency maintenance window this weekend, once all necessary preparations and reviews have been completed. We will announce the exact timing of this maintenance shortly. The migration will cause short disruptions to the internet access of the virtual machines on the affected nodes.

As we have already identified the issue and are taking the necessary steps to resolve it, we kindly ask you to refrain from submitting additional support tickets related to network connectivity for the affected VPS nodes in case of any brief outages or during the upcoming maintenance. We will keep you updated on the progress and thank you for your understanding.

Posted Oct 18, 2024 - 22:58 CEST

Resolved

We were able to recover services about 1 minutes ago by doing a cold reboot of one of the bladecenter's switches. The bladecenter's network consists of multiple switches for redundancy and one switch fail should not cause an outage. It did here, though, and we are still investigating why that is. However, we already had ordered a replacement switch to prevent future issues to happen again. Therefore, urgent maintenance may be neccessary in the next days. We will follow-up tomorrow.

Posted Oct 17, 2024 - 08:55 CEST

Identified

The Bladecenter hosting the nodes vps-001 - vps-016, has lost internet connectivity. One of our technicians is heading to the datacenter to investigate the cause. The internal cluster network is up and running, that's why customers can still use VNC but don't have internet connectivity as well

Posted Oct 17, 2024 - 08:10 CEST

Investigating

Several VPS Nodes, running our Avoro "VPS" product line as well as some PHP-Friends Black Friday Servers from 2023, are unreachable. We are currently investigating the issue.

Posted Oct 17, 2024 - 08:00 CEST

This incident affected: [Datacenter] maincubes FRA01 (Virtual Servers).