SimpleVoIP experienced major service disruptions on Saturday, Aug 8 and Monday, Aug 10 that caused both call completion and phone registration failures across a large portion of our customer base. The issue Saturday primarily impacted our US-Central zone, while Monday impacted US-West. In both cases, the failure in the primary cluster also impacted our redundant clusters until the primary cluster was isolated. We are sorry for all of the trouble and any lost business that these incidents caused, and we are currently taking major steps to prevent a re-occurrence and improve overall platform stability.
A core router in US-Central failed to a degraded state on Saturday morning, which caused a significant jump in latency on all traffic to and from that datacenter. In a situation in which a cluster is unavailable, traffic will automatically route to our redundant clusters, transparently moving phone registrations and voice traffic to a healthy cluster until the affected cluster is restored. In this case, the zone was still responding to health checks and initial phone registration requests, so the conditions were not met for automatic failover, but the high latency prevented most registrations and call attempts from completing successfully. Due to a software bug that we have discovered, the redundant clusters kept trying to synchronize call and registration status with the US-Central cluster over their shared messaging bus, but the delay on those messages ended up compounding and taking down services on those clusters, as well. Once we shut down the connection to the US-Central cluster, the other two zones were restored to service, and phones could register to those zones. The same issue appeared Monday morning in US-West, and we were able to more quickly isolate the issue, take that cluster offline, and restore services.
We swapped out our router in US-Central on the morning of Aug 9. We also swapped the US-West router out on the morning of Aug 12, but we had to revert that change following some issues that arose with the new equipment that impacted some call activity on US-West. We will be swapping out the US-East and US-West routers as soon as we can confirm that those changes will have no negative impact, and we are accelerating plans for significant network upgrades (mentioned in the RFO for our July 9 DDOS attack) to ensure that our network remains as stable as possible. A patch is currently being drafted to address the software bug that caused these issues to spread across clusters. Please be on the lookout for notifications of upcoming maintenance, and let our support or account teams know if you have any questions or concerns.