Summary:
On April 22nd, 2024, at 6:15 AM PT (13:15 UTC) we began receiving reports of scheduled campaigns experiencing delays or which had not been published at all. Two sources were identified that lead to the delays, and were subsequently addressed in two separate hotfixes.
Impact:
Impact was most visible in campaign reporting delivery metrics showing that campaigns had either not gone out at the expected time, or email deliveries themselves arrived well after the scheduled time. Not all campaigns were affected, and actual delays ranged from several minutes up to an hour or longer in a small number of instances.
Root Cause:
Root Cause was determined to be related to a scheduled database upgrade performed on April 19th which resulted in degraded performance characteristics of the scheduling service. There were two underlying observable symptoms:
- On April 22nd, the actual delivery of some emails was slower than expected as a result of several database queries that were not optimized for the new database software version deployed on April 19th. These queries ran slower after the upgrade when under higher load levels than what had been initially tested against.
- The number of scheduled campaigns not executing at the precise scheduled time increased dramatically, also following the database upgrade, as a result of several newly uncovered bugs in the scheduling service itself.
Mitigation:
A number of mitigation measures were put into place to address different aspects of this platform incident over the course of several days.
- The database query optimizations were deployed in a hotfix on April 22nd at 4:30 PM PT (23:30 UTC). This was specifically aimed at addressing the email delivery slowness issue.
- For Customers who opened support tickets related to specific scheduled campaigns being delayed, those campaigns were manually published as a part of the individual support tickets. Also, a separate query was run on an as-needed basis to proactively identify other campaigns in a similar state, and manually publish those as well.
- A second hotfix was deployed on April 24th at 11:30 AM PT (18:30 UTC) to add an automated backstop measure to catch and publish any campaigns that had been scheduled at an earlier time but had not actually started.
Recurrence Prevention:
The following actions have been committed to fully resolving the incident and eliminating the reliance on the mitigation measure currently in place.
- Create improved platform alerting for campaign delivery times to identify and address degraded state earlier.
- Fix remaining 3 bugs uncovered during the incident investigation process as well as making the scheduler service itself more robust.