Platform Service Degradation - Campaign Email Delivery Delayed
Incident Report for Firstup
Postmortem

Summary:

On April 22nd, 2024, at 6:15 AM PT (13:15 UTC) we began receiving reports of scheduled campaigns experiencing delays or which had not been published at all.  Two sources were identified that lead to the delays, and were subsequently addressed in two separate hotfixes.

Impact:

Impact was most visible in campaign reporting delivery metrics showing that campaigns had either not gone out at the expected time, or email deliveries themselves arrived well after the scheduled time.  Not all campaigns were affected, and actual delays ranged from several minutes up to an hour or longer in a small number of instances.

Root Cause:

Root Cause was determined to be related to a scheduled database upgrade performed on April 19th which resulted in degraded performance characteristics of the scheduling service.  There were two underlying observable symptoms:
 

  1. On April 22nd, the actual delivery of some emails was slower than expected as a result of several database queries that were not optimized for the new database software version deployed on April 19th.  These queries ran slower after the upgrade when under higher load levels than what had been initially tested against.
  2. The number of scheduled campaigns not executing at the precise scheduled time increased dramatically, also following the database upgrade, as a result of several newly uncovered bugs in the scheduling service itself.

Mitigation:

A number of mitigation measures were put into place to address different aspects of this platform incident over the course of several days.

  • The database query optimizations were deployed in a hotfix on April 22nd at 4:30 PM PT (23:30 UTC).  This was specifically aimed at addressing the email delivery slowness issue.
  • For Customers who opened support tickets related to specific scheduled campaigns being delayed, those campaigns were manually published as a part of the individual support tickets.  Also, a separate query was run on an as-needed basis to proactively identify other campaigns in a similar state, and manually publish those as well.
  • A second hotfix was deployed on April 24th at 11:30 AM PT (18:30 UTC) to add an automated backstop measure to catch and publish any campaigns that had been scheduled at an earlier time but had not actually started.

Recurrence Prevention:

The following actions have been committed to fully resolving the incident and eliminating the reliance on the mitigation measure currently in place.

  • Create improved platform alerting for campaign delivery times to identify and address degraded state earlier.
  • Fix remaining 3 bugs uncovered during the incident investigation process as well as making the scheduler service itself more robust.
Posted May 13, 2024 - 22:13 UTC

Resolved
Marking incident as resolved and all components fully operational. Automated mitigation measure has been demonstrated to be effective while remaining recurrence prevention items work their way through the system.
Posted May 13, 2024 - 22:12 UTC
Update
We continue to monitor the services that were impacted for any further issues.
Posted Apr 24, 2024 - 17:29 UTC
Monitoring
We have deployed a hotfix to address the database performance issue as it relates to the campaign email delivery queue.

All affected services remain fully stable and available. We will be placing these services under monitoring for now.
Posted Apr 23, 2024 - 02:06 UTC
Identified
We have identified a database performance issue, and are working to address it.

The email delivery pipeline queue was backed up, resulting in campaign email deliveries being delayed. This queue has since caught up, and campaign emails are now being delivered as expected.

We will provide another update as soon as more information is made available.
Posted Apr 22, 2024 - 19:38 UTC
Update
We continue to investigate the delays in campaign email deliveries.

Another update in 1 hour.
Posted Apr 22, 2024 - 19:01 UTC
Update
We continue to investigate the delays in campaign email deliveries. We have observed that the campaigns are publishing as expected, and therefore no need to republish them.

Another update within 1 hour.
Posted Apr 22, 2024 - 17:50 UTC
Investigating
We are currently investigating reports of delayed campaign email deliveries and associated reporting.

We will provide you with an update within 1 hour.
Posted Apr 22, 2024 - 16:53 UTC
This incident affected: Products (New Studio, Classic Studio).