Administration Portal
Incident Report for SimpleVoIP LLC
Postmortem

We received reports of errors and slow response times within our web portal on Thursday, March 23. We quickly isolated these web portal issues to a backend API being handled primarily servers in our US-West datacenter. We redirected that API traffic to US-East in an attempt to mitigate the issues. Unfortunately, we started to receive reports that call traffic was being impacted as well.  

After isolating these call reports to our US-West datacenter, our engineering team was able to determine that our US-West application servers, which handle web traffic and interface with our SIP devices, had entered a bad state that limited the amount of traffic they could process. We were able to restart the affected services and restore voice and web traffic to 100%. 

In looking into the underlying cause, we found that the network connection between our application servers and the virtual host running one of our database servers was degraded during the initial outage period and intermittently afterward. This slowed response times for database lookups involved in both API handling and call routing, resulting in the observed behavior, and eventually resulting in the faulty service state on our application servers. 

We apologize for the frustration caused by these disruptions to both your customers and your end users. Any downtime is bad for business, and minimizing the impact of any interruptions in service is our highest priority. Since these issues occurred, we have begun work to migrate our database servers from their current virtual hosts to new physical hosts that will minimize the points of failure between the DB servers and the rest of our platform stack, in addition to providing improved performance. You should receive maintenance notifications for these migrations over the next several weeks.

Thank you for your patience as we continue to improve our systems and processes to best serve you and your customers. As always, we welcome any feedback, which may be submitted to our Support team or to your Account Manager.

Posted Apr 10, 2023 - 14:44 PDT

Resolved
The system has been stable since traffic was moved from US-West. Our engineers performed a service restart on the US-West cluster once it was safe to do so, and all services there are stable once again. We will give it a bit more time before moving all traffic back to default routes. More information will be available in the post-mortem, which will be posted following further analysis by our Engineering team.
Posted Mar 23, 2023 - 20:38 PDT
Monitoring
We have moved phone registrations and voice traffic off of the US-West zone. At this point all call traffic should be routing normally. Please contact our Support team if your users are experiencing any further issues.
Posted Mar 23, 2023 - 14:57 PDT
Update
We have received examples of long post-dial delay specific to our US-West cluster. We are routing traffic away from US-West so that we can work on that server without disrupting active traffic. This should also fix the post-dial delay issues.
Posted Mar 23, 2023 - 14:23 PDT
Identified
Our engineers have identified a network issue impacting our database servers that coincides with the period of call disruption and are working to fully resolve that. We still see some sporadic API timeouts, but we believe that the inbound/outbound call issues are resolved at this time. Please let our support team know if your users are continuing to experience call completion issues.
Posted Mar 23, 2023 - 14:08 PDT
Investigating
We are still seeing some limited timeouts from the API impacting Portal operations. In addition, we saw a brief period during which call traffic was severely impacted. Some users are reporting ongoing post-dial delay on outbound calls. The engineering team is working to isolate the cause.
Posted Mar 23, 2023 - 13:45 PDT
Monitoring
Our engineering team has fixed the issue with the affected database server and restarted our API application. Admin portal functionality should be fully restored at this time.
Posted Mar 23, 2023 - 11:00 PDT
Update
Diverting voice traffic away from our US-West zone has resolved the outbound dialing issues. We are still working on restoring full functionality to the admin portal.
Posted Mar 23, 2023 - 10:04 PDT
Update
We have received examples of outbound call failures. At this time we believe these failures are isolated to our US-West zone. Our engineers have routed phone traffic away from this zone to attempt to mitigate the issue.
Posted Mar 23, 2023 - 09:07 PDT
Identified
Our engineers have identified a database server with elevated response times. They are working to resolve this issue as quickly as possible.
Posted Mar 23, 2023 - 08:32 PDT
Investigating
We are investigating reports of issues with our Administration Portal located at https://portal.simplevoip.us. We will provide updates as soon as we have identified the cause and impact of this event.

**This should not be impacting voice or messaging services.**
Posted Mar 23, 2023 - 07:55 PDT
This incident affected: SimpleVoIP Hosted PBX and SimpleVoIP Admin Portal.