Call Completion in US-East
Incident Report for SimpleVoIP LLC
Postmortem

Incident Description

We received reports of inbound and outbound call completion issues as well as some web portal errors on Wednesday, June 1. After isolating these reports to our US-East datacenter, our engineering team was able to determine that our US-East application servers, which handle web traffic and interface with our SIP devices, had entered a bad state that limited the amount of traffic they could process. We were able to restart the affected services and quickly restore voice and web traffic to 100%. 

The issue reoccurred on Thursday, June 2, with the same application servers failing to handle a subset of their requests. We were quicker to isolate the issue this time and restarted the affected services again. Unfortunately, due to a high (but not abnormal) volume of API traffic being directed to these servers during the startup process, the services did not fully restore to a healthy state, which delayed the fix for our end users until we were able to re-route traffic and reduce the load on those servers. 

Action Items

We apologize for the frustration caused by these disruptions to both your customers and your end users. Any downtime is bad for business, and minimizing the impact of any interruptions in service is our highest priority. Since these issues occurred, we have implemented a more efficient routing policy for our API traffic to direct requests preferentially to servers with lower traffic. This should help to ensure that any required service restarts are not impacted by normal traffic, as well as preserving a more equitable distribution of traffic across our availability zones. 

We have also identified a signature of these service failures within our monitoring system that we are targeting for proactive monitoring, which will enable us to catch this bad service state and resolve any future occurrences within minutes. Thank you for your patience as we continue to improve our systems and processes to best serve you and your customers. As always, we welcome any feedback, which may be submitted to our Support team or to your Account Manager.

Posted Jun 07, 2022 - 11:22 PDT

Resolved
The US-East cluster has been stable for the past 12 hours. Our engineering team has restored all traffic to primary routes. This incident is now resolved.
Posted Jun 02, 2022 - 21:34 PDT
Monitoring
We have stabilized all services on US-East, and all phone and web traffic should be processing normally at this time. We will keep voice traffic on its secondary route while we continue to closely monitor our US-East servers.
Posted Jun 02, 2022 - 10:39 PDT
Identified
Our engineers have temporarily redirected phone traffic away from US-East while we continue working to resolve this issue. Inbound and outbound calls are completing normally over their redundant routes. Call Center users may notice slow portal performance while this issue is active.
Posted Jun 02, 2022 - 09:37 PDT
Investigating
We are investigating reports of some calls not completing on our US-East zone. We will provide updates as soon as we have identified the cause and impact of this event.
Posted Jun 02, 2022 - 09:25 PDT
This incident affected: SimpleVoIP Hosted PBX and SimpleVoIP Admin Portal.