Regional Connectivity

Incident Report for SimpleVoIP LLC

Postmortem

SimpleVoIP experienced major service disruptions on Saturday, Aug 8 and Monday, Aug 10 that caused both call completion and phone registration failures across a large portion of our customer base. The issue Saturday primarily impacted our US-Central zone, while Monday impacted US-West. In both cases, the failure in the primary cluster also impacted our redundant clusters until the primary cluster was isolated. We are sorry for all of the trouble and any lost business that these incidents caused, and we are currently taking major steps to prevent a re-occurrence and improve overall platform stability.

A core router in US-Central failed to a degraded state on Saturday morning, which caused a significant jump in latency on all traffic to and from that datacenter. In a situation in which a cluster is unavailable, traffic will automatically route to our redundant clusters, transparently moving phone registrations and voice traffic to a healthy cluster until the affected cluster is restored. In this case, the zone was still responding to health checks and initial phone registration requests, so the conditions were not met for automatic failover, but the high latency prevented most registrations and call attempts from completing successfully. Due to a software bug that we have discovered, the redundant clusters kept trying to synchronize call and registration status with the US-Central cluster over their shared messaging bus, but the delay on those messages ended up compounding and taking down services on those clusters, as well. Once we shut down the connection to the US-Central cluster, the other two zones were restored to service, and phones could register to those zones. The same issue appeared Monday morning in US-West, and we were able to more quickly isolate the issue, take that cluster offline, and restore services.

We swapped out our router in US-Central on the morning of Aug 9. We also swapped the US-West router out on the morning of Aug 12, but we had to revert that change following some issues that arose with the new equipment that impacted some call activity on US-West. We will be swapping out the US-East and US-West routers as soon as we can confirm that those changes will have no negative impact, and we are accelerating plans for significant network upgrades (mentioned in the RFO for our July 9 DDOS attack) to ensure that our network remains as stable as possible. A patch is currently being drafted to address the software bug that caused these issues to spread across clusters. Please be on the lookout for notifications of upcoming maintenance, and let our support or account teams know if you have any questions or concerns.

Posted Aug 13, 2020 - 14:55 PDT

Resolved

All services are restored and operating normally. An RFO is pending, but the root cause appears to be a core router that failed into a degraded state, adding 60-80ms latency to all circuits but not actually shutting them down.

An emergency maintenance will also be scheduled to replace the equipment in a subsequent update once we are ready to do so. Technicians are enroute now for that task, but all customer services are operating normally at this time.

Posted Aug 08, 2020 - 12:39 PDT

Identified

Some inbound calls are not being properly delivered. Our engineers are working on it.

Posted Aug 08, 2020 - 12:17 PDT

Monitoring

Phone registrations have stabilized. We will continue to keep an eye on the system status while we complete repairs on the US-Central region.

Posted Aug 08, 2020 - 11:41 PDT

Identified

We are working on a residual issue causing a subset of phones to fail to register to us on some attempts, causing them to go offline periodically.

Posted Aug 08, 2020 - 11:18 PDT

Monitoring

Inbound call completion services are restoring now, and at this point services should be stable on US-East and US-West. We are maintaining the failover state on US-Central until we can confirm that the underlying issue has been resolved.

Posted Aug 08, 2020 - 09:50 PDT

Update

At this point outbound services should be restored, and most phones should be registered to either US-East or US-West. We are still working on restoring inbound call services.

Posted Aug 08, 2020 - 09:43 PDT

Update

Inbound calls are still failing to complete. We are working to remedy this.

Posted Aug 08, 2020 - 09:25 PDT

Update

We have rerouted services around US-Central and isolated that datacenter to prevent it from impacting the other two regions. At this point services should be stabilizing in the US-East and US-West regions, and phone registrations and calls should be failing over to those regions as appropriate. We are still working on mitigating the underlying issue in US-Central.

Posted Aug 08, 2020 - 09:15 PDT

Identified

We are receiving reports of call completion issues in other zones as well. Our engineering team is working to resolve this as quickly as possible.

Posted Aug 08, 2020 - 09:00 PDT

Investigating

We are investigating reports of a potential issue in our US-Central regional data center. We will provide updates as soon as we have identified the cause and impact of this event.

Posted Aug 08, 2020 - 08:47 PDT

This incident affected: SimpleVoIP Hosted PBX and SimpleVoIP Admin Portal.