Delay with Campaign Delivery and User Imports being uploaded.

Incident Report for Firstup

Postmortem

Summary:

On Thursday, February 13th, 2025, reports of delays in campaign email deliveries, as well as processing failures associated with user import files were received via tickets to our Customer Support Center. Following a correlation of the number of tickets received and the nature of the issues reported, a platform incident was declared at 10:02 AM PT and an incident response team began investigating the reports. A service degradation notification was subsequently published on our status page at 10:15 AM PT.

Investigations revealed that the reported issues were not related in any way, and troubleshooting of each issue continued independently and concurrently during the incident.

Severity:

Sev2

Scope:

Campaign communication delays:

  • Customers on the Firstup platform with email campaigns published on 2/13/2025 between 7:00 AM PT and 7:40 AM PT.

User import/sync issue:

  • Customers on the Firstup platform who uploaded user sync and user import files on 2/13/2025 between 2:09 AM PT and 11:22 AM PT.

Impact:

  • Campaigns in the scope above may have experienced email delivery delays of 20 to 30 minutes.
  • User import and user sync files in the scope above may have returned processing errors or were stuck in a processing state for up to 9hrs 13mins.

All other platform services remained uninterrupted during this incident.

Root Causes:

Campaign communication delays:

  • A long-running query in the database created a bottleneck effect, resulting in a slowdown in processing queries associated with the campaigns in scope of the incident until the job completed at 7:40 AM PT.

User import/sync issue:

  • Following a configuration change in the backend intended at improving Multi-Program user membership at 2:09 AM PT, user re-indexing activity started to surge as customers uploaded their user sync files, which overloaded the indexing queue and caused delays in bulk uploads and snapshot creation.

Mitigation:

Campaign communication delays:

  • No mitigation steps were taken to resolve the campaign communication delays, as the long-running query completed on its own, eliminating the slowdown in processing normal campaign publishing queries.

User import/sync issue:

  • The offending configuration change was internally identified and reverted at 9:36 AM PT. However, the messages in queue to be processed had surged so much that additional processing resources were added at 10:43 AM PT to aid in draining the queue. The queue returned to normal levels by 11:22 AM PT.

Recurrence Prevention:

To prevent a recurrence of the issues in this incident due to the same root cause, we have performed or have committed to perform the following actions: 

Campaign communication delays:

  • Created a monitor that will alert on long-running queries, to allow for faster intervention before a platform incident happens.
  • Move the offending query to a separate data pipeline not utilized by the campaign delivery process.
  • Add Statement and Deadlock timeouts to the campaign delivery processing database to reduce the impact of long-running queries.

User import/sync issue:

  • Add monitoring and alerting for User-Sync and Bulk-Upload errors on a platform level.
  • Evaluate permanently increasing processing resources to account for unforeseen surges in the messages queue.
  • Review front-end UI error reporting messaging to be more intuitive.
Posted Mar 05, 2025 - 17:26 UTC

Resolved

All impacted systems have remained available and fully functional during the monitoring phase of this service degradation. This incident is now resolved.
Posted Feb 21, 2025 - 20:51 UTC

Monitoring

The User Imports issue has also been mitigated, and all user import files are now processing as expected.

We will be placing the impacted services under monitoring for now.
Posted Feb 13, 2025 - 19:27 UTC

Identified

The email delivery delays issue experienced today has been mitigated, and all email communications should have been delivered by now.

We continue to investigate why some User Import jobs are stuck, and will provide another update within 1 hour.
Posted Feb 13, 2025 - 19:02 UTC

Update

We are continuing to investigate this issue.
Posted Feb 13, 2025 - 18:15 UTC

Investigating

We are investigating reports where campaign communication deliveries on the EU and US platforms are delayed. This may include campaign emails, push notifications and assistant notifications. This is also causing and issue with the User Import feature as some uploads are stuck in processing state for an extended period of time.
Posted Feb 13, 2025 - 18:15 UTC
This incident affected: Ecosystem (User Sync), Products (Creator Studio, Classic Studio, Web Experience, Mobile Experience), and 3rd-Party Dependencies (SendGrid API v3).