We received reports on Wednesday, April 24, two days after our migration to Private Cloud, of inbound caller ID not displaying properly for certain customers. We attempted to fix that issue right away by making a minor change to our caller ID lookup system. This change was hastily applied, and in fact, broke caller ID lookups for all users that were configured to use that system. Rather than failing gracefully as normal and returning an empty caller ID name, this broken change caused lookups to fail in a way that interfered with the call path, resulting in failures of both inbound and outbound calls for all users that relied on this caller ID system.
This change was quickly rolled back once it was discovered that it was the root cause of the inbound/outbound issues, which provided an immediate fix for the problem.
First, the technical side: one of the benefits of the migration to Private Cloud was access to a new method of looking up and applying caller ID for inbound calls. This new method is significantly more fault-tolerant than the previous method - put simply, calls will proceed regardless of what happens during the lookup process. We have fully switched to this new lookup method as of Monday 5/13 after a phased rollout, and at this point the only side effect of a caller ID lookup failure is a missing caller ID name on an individual call.
More importantly, this issue represents one of the worst failings in IT: a self-inflicted outage. Our engineering team was in a frame of mind following the Private Cloud migration and the subsequent scramble to repair true service disruptions (see the Inbound Calling issues on Monday, 4/22) that we rushed out what we believed to be a quick fix and made a minor problem far worse as a result, for which we cannot apologize enough. This was a major wake-up call that we needed to tighten our change management processes back down to pre-migration standards:
* For any non-critical issues, a change process will be followed that requires advanced customer notification and an off-hours maintenance window.
* These changes will be tested in advance of the maintenance, and testing will be performed after the maintenance to ensure that the intended issue is fixed and that there are no side effects.
* Our standard maintenance window is Sunday nights/Monday mornings from 2am ET/11pm PT to 5am ET/2am PT.
* Any issues requiring immediate intervention will be posted on our Status Page at status.simplevoip.us, and notifications will be sent to all tech contacts subscribed to our status updates that emergency maintenance is being performed.