SimpleVoIP Support received reports of inbound and outbound call failures starting at around 8am CDT. Our Engineering team investigated the provided example calls and identified one server in US-Central and one in US-West that were routing network traffic inappropriately. After those server routes were corrected just before 10am CDT, most call traffic was restored, but some desktop app users were unable to connect until an additional change was made at around 3pm CDT.
At 2am CDT on April 15, we made a routing change to adjust the default route for outbound carrier traffic. Most of our carrier routes have specific overrides to the default, so we only expected even potential impact to toll free traffic. Unfortunately, the config update and subsequent service restart required for this change impacted the physical interfaces used by two of our servers, generating a routing asymmetry and causing traffic from those servers to be ignored by the rest of our platform. This resulted in a roughly 50% failure rate for call completion and some impact for phone/softphone registration across both the US-Central and US-West zones, which may have appeared as high as 100% failure for certain customer locations depending on traffic/registration distribution.
The asymmetric route was corrected by adjusting the preferred interface of the two impacted servers to match the rest of our server pool, whose members were still handling traffic normally.
By forcing these servers to use a particular interface for all of their traffic, we broke connectivity to some desktop apps that relied on the other interface until a second configuration update was made specifically to repair desktop app connectivity.
The configurations for these servers have been updated and saved to our configuration management system, which should prevent this specific issue from reoccurring.
We only performed brief testing following the planned routing change due to the expected lack of impact. This was a clear oversight, and we would have caught this issue overnight if we had run through our exhaustive test suite. Work is planned to automate more of this post-maintenance testing to enable us to run it more consistently, and we will require the full test suite for all configuration changes moving forward, no matter how harmless they seem.
We apologize for the impact to your users and customers. We will continue to improve our monitoring and processes to ensure that the impact of any future server maintenance is minimized.