Inbound and Outbound call disruption
Incident Report for SimpleVoIP LLC
Postmortem

Incident Details and Remediation Steps

We received reports on Wednesday, April 24, two days after our migration to Private Cloud, of inbound caller ID not displaying properly for certain customers. We attempted to fix that issue right away by making a minor change to our caller ID lookup system. This change was hastily applied, and in fact, broke caller ID lookups for all users that were configured to use that system. Rather than failing gracefully as normal and returning an empty caller ID name, this broken change caused lookups to fail in a way that interfered with the call path, resulting in failures of both inbound and outbound calls for all users that relied on this caller ID system.

This change was quickly rolled back once it was discovered that it was the root cause of the inbound/outbound issues, which provided an immediate fix for the problem.

Future Prevention

First, the technical side: one of the benefits of the migration to Private Cloud was access to a new method of looking up and applying caller ID for inbound calls. This new method is significantly more fault-tolerant than the previous method - put simply, calls will proceed regardless of what happens during the lookup process. We have fully switched to this new lookup method as of Monday 5/13 after a phased rollout, and at this point the only side effect of a caller ID lookup failure is a missing caller ID name on an individual call.

More importantly, this issue represents one of the worst failings in IT: a self-inflicted outage. Our engineering team was in a frame of mind following the Private Cloud migration and the subsequent scramble to repair true service disruptions (see the Inbound Calling issues on Monday, 4/22) that we rushed out what we believed to be a quick fix and made a minor problem far worse as a result, for which we cannot apologize enough. This was a major wake-up call that we needed to tighten our change management processes back down to pre-migration standards:

* For any non-critical issues, a change process will be followed that requires advanced customer notification and an off-hours maintenance window.
* These changes will be tested in advance of the maintenance, and testing will be performed after the maintenance to ensure that the intended issue is fixed and that there are no side effects.
* Our standard maintenance window is Sunday nights/Monday mornings from 2am ET/11pm PT to 5am ET/2am PT.
* Any issues requiring immediate intervention will be posted on our Status Page at status.simplevoip.us, and notifications will be sent to all tech contacts subscribed to our status updates that emergency maintenance is being performed.

Posted May 15, 2019 - 08:17 PDT

Resolved
We received reports earlier today of inbound caller ID not displaying properly for certain customers. We applied a fix intended to resolve that issue that ended up interfering with both inbound and outbound call traffic. Once we isolated the cause of the inbound and outbound call issues, we were able to revert the caller ID fix and restore all services.

Call traffic has been stable since that fix was rolled back. If any of your users are experiencing call completion issues, please bring them to the attention of our support team.
Posted Apr 24, 2019 - 13:28 PDT
Monitoring
A fix has been implemented and we are monitoring to ensure that there are no further issues. Please let us know if you experience any inbound or outbound call failures after 1:25pm EDT/10:25am PDT.
Posted Apr 24, 2019 - 10:25 PDT
Investigating
We have received widespread reports of inbound and outbound call failures. Our engineering team is currently investigating.
Posted Apr 24, 2019 - 10:15 PDT
This incident affected: SimpleVoIP Hosted PBX.