Platform Service Degradation - Scheduled Campaigns Not Publishing At Scheduled Time
Incident Report for Firstup
Postmortem

Summary:

On February 28th, 2024, starting at around 1:11 PM PT (18:11 UTC), we started receiving reports that some users had not received an email from a scheduled campaign, and subsequently additional reports on February 29th, 2024, where some scheduled campaigns were still showing in the scheduled folder in Studio past their scheduled publish time.

Impact:

Impact was primarily related to campaigns that were scheduled to publish between 02.28.2024 at 11:16 AM ET and 02.29.2024 at 1:06 PM ET.

Root Cause:

The root cause was determined to be memory exhaustion in our core database on 02.28.2024 at 11:16 AM ET, which triggered an automatic database failover by AWS infrastructure failure service. Post-failover, dependent services that manage scheduled campaigns did not automatically reconnect to the failover database, and therefore could not initiate a “publish” event for scheduled campaigns at the scheduled time. 

Mitigation:

The immediate problem was mitigated by querying the database for past-due scheduled campaigns and manually publishing them. Additionally, the services responsible for scheduled campaigns were manually restarted to establish connections to the failover database, in effect allowing them to initiate “publish” events for scheduled campaigns as expected. 

Recurrence Prevention:

An incident response team post-mortem meeting revealed the following as recurrence prevention measures to be taken:

●      Removal of SQL comments to reduce database memory consumption.

●      Increase database instance size by upgrading the Postgres version.

●      Improve monitoring and alerting on database connections and memory usage using dedicated dashboards that include links to runbooks and mitigation instructions.

●      Fix failover and error handling in the affected services.

Posted Mar 21, 2024 - 16:10 UTC

Resolved
This service degradation is now considered as resolved, and all impacted services have remained available and stable.
Posted Mar 07, 2024 - 17:35 UTC
Monitoring
We have identified a potential issue that caused some scheduled campaigns not to publish at the scheduled time. This only affected campaigns that were scheduled at a specific moment in time, and those campaigns have manually been published. Any campaigns scheduled to publish from now on, should not experience any issues, and should publish at the scheduled time.

We will provide additional details in our postmortem to this service degradation.

This incident is now considered mitigated.
Posted Feb 29, 2024 - 18:08 UTC
Update
We continue to investigate the cause of this service degradation, and will provide another update within 1 hour.
Posted Feb 29, 2024 - 17:31 UTC
Update
We have manually published any scheduled campaigns if they were scheduled on or after 2/28/2024, but did not publish at the expected time.

We continue to investigate the cause of this service degradation, and will provide another update within 1 hour.
Posted Feb 29, 2024 - 16:34 UTC
Investigating
We are currently investigating reports where some scheduled campaigns did not publish at the schedule time.

We will provide an update within 1 hour.
Posted Feb 29, 2024 - 16:11 UTC
This incident affected: Products (New Studio).