Call Completion
Incident Report for SimpleVoIP LLC
Postmortem

Incident Description

We received reports of inbound and outbound call completion issues as well as some web portal errors on Wednesday, June 1. After isolating these reports to our US-East datacenter, our engineering team was able to determine that our US-East application servers, which handle web traffic and interface with our SIP devices, had entered a bad state that limited the amount of traffic they could process. We were able to restart the affected services and quickly restore voice and web traffic to 100%. 

The issue reoccurred on Thursday, June 2, with the same application servers failing to handle a subset of their requests. We were quicker to isolate the issue this time and restarted the affected services again. Unfortunately, due to a high (but not abnormal) volume of API traffic being directed to these servers during the startup process, the services did not fully restore to a healthy state, which delayed the fix for our end users until we were able to re-route traffic and reduce the load on those servers. 

Action Items

We apologize for the frustration caused by these disruptions to both your customers and your end users. Any downtime is bad for business, and minimizing the impact of any interruptions in service is our highest priority. Since these issues occurred, we have implemented a more efficient routing policy for our API traffic to direct requests preferentially to servers with lower traffic. This should help to ensure that any required service restarts are not impacted by normal traffic, as well as preserving a more equitable distribution of traffic across our availability zones. 

We have also identified a signature of these service failures within our monitoring system that we are targeting for proactive monitoring, which will enable us to catch this bad service state and resolve any future occurrences within minutes. Thank you for your patience as we continue to improve our systems and processes to best serve you and your customers. As always, we welcome any feedback, which may be submitted to our Support team or to your Account Manager.

Posted Jun 07, 2022 - 11:22 PDT

Resolved
All services have been stable for the past 6 hours, which includes our peak traffic period. We now consider this incident fully resolved.
Posted Jun 01, 2022 - 19:43 PDT
Monitoring
The rolling restarts have completed, and all services in US-East are now functioning normally.
Posted Jun 01, 2022 - 13:51 PDT
Update
Our engineering team is now performing a rolling service restart on our US-East cluster to resolve this issue with as little disruption as possible. Phones should re-register without interruption, while web users may be asked to log in again.
Posted Jun 01, 2022 - 13:40 PDT
Identified
We are investigating reports of some calls not completing on our system. We believe to have identified the cause and are performing steps to resolve this issue.
Posted Jun 01, 2022 - 13:33 PDT
This incident affected: SimpleVoIP Hosted PBX.