Summary:
On Thursday, February 13th, 2025, reports of delays in campaign email deliveries, as well as processing failures associated with user import files were received via tickets to our Customer Support Center. Following a correlation of the number of tickets received and the nature of the issues reported, a platform incident was declared at 10:02 AM PT and an incident response team began investigating the reports. A service degradation notification was subsequently published on our status page at 10:15 AM PT.
Investigations revealed that the reported issues were not related in any way, and troubleshooting of each issue continued independently and concurrently during the incident.
Severity:
Sev2
Scope:
Campaign communication delays:
- Customers on the Firstup platform with email campaigns published on 2/13/2025 between 7:00 AM PT and 7:40 AM PT.
User import/sync issue:
- Customers on the Firstup platform who uploaded user sync and user import files on 2/13/2025 between 2:09 AM PT and 11:22 AM PT.
Impact:
- Campaigns in the scope above may have experienced email delivery delays of 20 to 30 minutes.
- User import and user sync files in the scope above may have returned processing errors or were stuck in a processing state for up to 9hrs 13mins.
All other platform services remained uninterrupted during this incident.
Root Causes:
Campaign communication delays:
- A long-running query in the database created a bottleneck effect, resulting in a slowdown in processing queries associated with the campaigns in scope of the incident until the job completed at 7:40 AM PT.
User import/sync issue:
- Following a configuration change in the backend intended at improving Multi-Program user membership at 2:09 AM PT, user re-indexing activity started to surge as customers uploaded their user sync files, which overloaded the indexing queue and caused delays in bulk uploads and snapshot creation.
Mitigation:
Campaign communication delays:
- No mitigation steps were taken to resolve the campaign communication delays, as the long-running query completed on its own, eliminating the slowdown in processing normal campaign publishing queries.
User import/sync issue:
- The offending configuration change was internally identified and reverted at 9:36 AM PT. However, the messages in queue to be processed had surged so much that additional processing resources were added at 10:43 AM PT to aid in draining the queue. The queue returned to normal levels by 11:22 AM PT.
Recurrence Prevention:
To prevent a recurrence of the issues in this incident due to the same root cause, we have performed or have committed to perform the following actions:
Campaign communication delays:
- Created a monitor that will alert on long-running queries, to allow for faster intervention before a platform incident happens.
- Move the offending query to a separate data pipeline not utilized by the campaign delivery process.
- Add Statement and Deadlock timeouts to the campaign delivery processing database to reduce the impact of long-running queries.
User import/sync issue:
- Add monitoring and alerting for User-Sync and Bulk-Upload errors on a platform level.
- Evaluate permanently increasing processing resources to account for unforeseen surges in the messages queue.
- Review front-end UI error reporting messaging to be more intuitive.