Call Completion

Incident Report for SimpleVoIP LLC

Postmortem

Incident Summary 

 SimpleVoIP Support received reports of inbound and outbound call failures starting at around 8am CDT. Our Engineering team investigated the provided example calls and identified one server in US-Central and one in US-West that were routing network traffic inappropriately. After those server routes were corrected just before 10am CDT, most call traffic was restored, but some desktop app users were unable to connect until an additional change was made at around 3pm CDT. 

Root Cause 

At 2am CDT on April 15, we made a routing change to adjust the default route for outbound carrier traffic. Most of our carrier routes have specific overrides to the default, so we only expected even potential impact to toll free traffic. Unfortunately, the config update and subsequent service restart required for this change impacted the physical interfaces used by two of our servers, generating a routing asymmetry and causing traffic from those servers to be ignored by the rest of our platform. This resulted in a roughly 50% failure rate for call completion and some impact for phone/softphone registration across both the US-Central and US-West zones, which may have appeared as high as 100% failure for certain customer locations depending on traffic/registration distribution. 

Resolution 

The asymmetric route was corrected by adjusting the preferred interface of the two impacted servers to match the rest of our server pool, whose members were still handling traffic normally.  

By forcing these servers to use a particular interface for all of their traffic, we broke connectivity to some desktop apps that relied on the other interface until a second configuration update was made specifically to repair desktop app connectivity. 

Preventive Measures 

The configurations for these servers have been updated and saved to our configuration management system, which should prevent this specific issue from reoccurring. 

We only performed brief testing following the planned routing change due to the expected lack of impact. This was a clear oversight, and we would have caught this issue overnight if we had run through our exhaustive test suite. Work is planned to automate more of this post-maintenance testing to enable us to run it more consistently, and we will require the full test suite for all configuration changes moving forward, no matter how harmless they seem.

We apologize for the impact to your users and customers. We will continue to improve our monitoring and processes to ensure that the impact of any future server maintenance is minimized.

Posted 18 days ago. Apr 21, 2025 - 13:58 PDT

Resolved

All services are stable following the routing changes. A post-mortem analysis will be posted here within the next 2 days.
Posted 23 days ago. Apr 15, 2025 - 21:44 PDT

Monitoring

The change to US-West has been applied, and we are seeing normal call traffic there as well. At this time there should be no further impact. If you experience any issues, please let our support team know.
Posted 24 days ago. Apr 15, 2025 - 08:04 PDT

Update

Engineers have performed a routing change to correct the issue in US-Central. Calls are completing normally there now.

We also identified a server in US-West subject to the same issue, and we are implementing the same solution there. We should have that resolved within a few minutes.
Posted 24 days ago. Apr 15, 2025 - 07:56 PDT

Identified

The impacted calls appear to be limited to the US-Central region. We have identified a SIP gateway server as a common element in all of the provided examples, and we are working to remove it from routing to alleviate the issue while we investigate further.
Posted 24 days ago. Apr 15, 2025 - 07:07 PDT

Investigating

We are investigating reports of some calls not completing on our system. We will provide updates as soon as we have identified the cause and impact of this event.
Posted 24 days ago. Apr 15, 2025 - 06:32 PDT
This incident affected: SimpleVoIP Hosted PBX.