Platform Service Degradation - Scheduled Campaigns Not Sending
Incident Report for Firstup
Postmortem

Summary:

On Wednesday May 15th, 2024 starting at approximately 3:00 AM PT until 10:22 AM PT we received customer reports that scheduled campaigns were not being delivered as expected. These reports included campaigns either sitting in the ‘Scheduled’ queue and not getting sent on time, as well as reports of campaigns that were marked as ‘Delivered’ but showed no actual sends to the users in the Campaign Delivery Report.  The problem was linked to a regression in a service used for scheduling that introduced a new post related variable that was not correctly initialized in a specific campaign.  The result was that this campaign, and all subsequently scheduled campaigns, were not automatically published during the affected time.

Impact:

Campaigns that were scheduled remained sitting in the scheduled queue, or were incorrectly marked as delivered, but were not actually sent to the target audience. We confirmed at least 50 campaigns that were scheduled to publish experienced some sort of a delay. 

Root Cause: 

The latest deployment of the service used for campaign scheduling had an internal permissions check failure that resulted in an elevated error rate as a result of a missing expected default value that was not set. 

Mitigation:

Once the root cause was identified through a source code analyzer, we corrected the post with the missing expected default value.  This allowed all of the remaining scheduled campaigns to be published.  Any other campaigns that did not go out on time were then manually published by the incident management team.

Recurrence Prevention:

The below changes have been implemented to ensure campaign delays are prevented due to internal deployment activities:

  • A hotfix was deployed on Monday June 3rd, 2024 where the internal permissions method was updated to ensure the correct default value is set for the new post variable irrespective of whether the associated feature flag is enabled or not. 
  • Improved error logging in the scheduling software to make it easier and faster to identify irrecoverable errors that can cause the scheduler to queue up and delay sends.  . 
  • Updated runbooks for steps to troubleshoot delayed campaigns linked from monitors
  • Improved monitoring for the schedule_publish endpoint for any errors
Posted Jun 11, 2024 - 16:38 UTC

Resolved
We have identified the root cause was related to a governor service change. The latest deployment of the governor service version caused an increased error rate. This caused a brief delay, however impacted delivery systems are processing as normal. Additional details will be outlined in our postmortem to this service degradation.
Posted Jun 05, 2024 - 04:04 UTC
Update
We are continuing to work on a fix for this issue.
Posted May 15, 2024 - 17:59 UTC
Identified
We have identified a database performance issue, and are working to address it. A permission error caused publishing delays, the problematic post has now been fixed.

We will provide another update as soon as more information is made available.
Posted May 15, 2024 - 17:57 UTC
Investigating
We are currently investigating reports of scheduled campaigns not sending.

We will provide you with an update within 1 hour.
Posted May 15, 2024 - 16:22 UTC
This incident affected: Products (New Studio, Classic Studio) and Platforms (US Firstup Platform).