On February 28th, 2024, starting at around 1:11 PM PT (18:11 UTC), we started receiving reports that some users had not received an email from a scheduled campaign, and subsequently additional reports on February 29th, 2024, where some scheduled campaigns were still showing in the scheduled folder in Studio past their scheduled publish time.
Impact was primarily related to campaigns that were scheduled to publish between 02.28.2024 at 11:16 AM ET and 02.29.2024 at 1:06 PM ET.
The root cause was determined to be memory exhaustion in our core database on 02.28.2024 at 11:16 AM ET, which triggered an automatic database failover by AWS infrastructure failure service. Post-failover, dependent services that manage scheduled campaigns did not automatically reconnect to the failover database, and therefore could not initiate a “publish” event for scheduled campaigns at the scheduled time.
The immediate problem was mitigated by querying the database for past-due scheduled campaigns and manually publishing them. Additionally, the services responsible for scheduled campaigns were manually restarted to establish connections to the failover database, in effect allowing them to initiate “publish” events for scheduled campaigns as expected.
An incident response team post-mortem meeting revealed the following as recurrence prevention measures to be taken:
● Removal of SQL comments to reduce database memory consumption.
● Increase database instance size by upgrading the Postgres version.
● Improve monitoring and alerting on database connections and memory usage using dedicated dashboards that include links to runbooks and mitigation instructions.
● Fix failover and error handling in the affected services.