Campaign Delivery Service Disruption

Incident Report for Firstup

Postmortem

Summary:

On Saturday, February 22nd, 2025, starting at 6:45 PM PT, our US-East RDS Database Cluster responsible for campaign delivery for customers hosted in the US entered into a crash loop during a database table repacking process, which forced the database to restart after every crash. This prevented platform communications from continuously being processed and sent out while the database restarted.

Severity:

Sev2

Scope:

The scope of this service disruption was restricted to customers on the US platform utilizing the US-East database cluster.

Impact:

Some campaign communications may have experienced slight delivery delays via any delivery channel in the duration of this incident (8hrs 27mins).

Customers on Employee Intranet solutions, Web Experience,  Studio, custom API solutions, and Embedded content endpoints, would not have seen any impact.

Root Cause:

The root cause of this incident was attributed to a break in the integrity between a table in the database and its index, following a repacking task on the table to improve database performance. During the repacking of the table, its index was identified as having invalid indexes, and was skipped by the repacking process. The break in the integrity of the table and its index caused the database to enter into a crash loop starting at 6:45 PM PT, and restarted after every crash in an attempt to resolve the crash. 

In the periods between a successful database restart and the next database crash, other database processes were running as expected and campaign communications were sent out successfully.

Mitigation:

To mitigate this incident, the invalid index for the table was removed and recreated on Sunday, February 23rd, 2025, at 3:12 AM PT, in effect ending the crash loop on the database.

Recurrence Prevention:

To prevent this incident from recurring, we will have performed the following actions:

  • Checked all database indexes, and corrected any identified invalid indexes, including the offending index in this incident.
  • Added an additional parameter to the tables repacking task, which would skip repacking any tables if their indexes were found to be invalid in the future, thereby preserving their relational integrity, and avoiding a database crash.
Posted Mar 03, 2025 - 15:57 UTC

Resolved

The applied fix has restored operations, and campaign delivery has resumed. Any campaigns scheduled during the incident have now been sent and delivered successfully.

This issue is now resolved. Thank you for your patience while we investigated and implemented a fix. If you experience any further issues, please contact Support at support.firstup.io
Posted Feb 23, 2025 - 13:23 UTC

Identified

We are implementing a fix to attempt to mitigate the incident affecting campaign delivery. We'll provide further updates once the fix has been applied and we have monitored the service.
Posted Feb 23, 2025 - 11:38 UTC

Update

We are continuing to investigate the issue affecting campaign delivery to email and push notification channels for our US customers.

We will provide further updates as soon as we have more information.
Posted Feb 23, 2025 - 10:30 UTC

Update

We are continuing to investigate the issue affecting campaign delivery to email and push notification channels for our US customers. Our teams are actively working to resolve the issue as soon as possible.

We will provide our next update in 30 minutes or sooner if there are significant developments.
Posted Feb 23, 2025 - 09:28 UTC

Investigating

We are currently investigating an issue affecting campaign delivery to email and push notification channels for our US customers.

Publishing to the feed and Notification Center is not affected. Content will continue to publish and display on web, mobile, and other endpoints as expected, and notifications are populating in the Notification Center.

Customers in our EU communities (EU DC) are unaffected.

We are actively working to identify the root cause and will provide an update within 30 minutes.
Posted Feb 23, 2025 - 08:55 UTC
This incident affected: Platforms (US Firstup Platform).