EU Communities - Campaign Send Service Disruption
Incident Report for Firstup
Postmortem

Summary:
On July 23rd, 2024, beginning at approximately 6:04 AM ET (10:04 UTC), we started receiving reports of published email campaigns not being received by the intended audiences. The reports were from customers hosted on the EU Firstup platform.

Scope and Impact:
The scope of this incident was restricted to customers hosted on the EU Firstup platform and impacted email campaigns on the EU Firstup platform published between 4:00 AM ET (08:00 UTC) and 7:43 AM ET (11:43 UTC).

Root Cause:
An update to a user’s profile erroneously triggered an update of over 400 campaigns authored by that user simultaneously, which overloaded the backend user planning processes, and caused delivery delays of other campaigns.

Mitigation:
The system recovered on its own after all this user’s campaigns were updated, and processing for the delivery of other campaigns resumed.

Recurrence Prevention:
The below changes have been implemented to ensure campaign delays are prevented due to processing spikes from specific user planning processes:

  • Enhanced Monitoring: By the time investigation into the incident began, the processing queue self recovered and resumed delivery. To prevent this, we have added a Delivery specific time threshold to our processing monitors which will automatically bring additional cloud compute capacity online to help prevent a deadlock condition due to a single planning process usurping shared resources.
  • Updated our regression testing and runbooks to include steps to troubleshoot delayed campaigns linked from planner queues activities.
Posted Aug 08, 2024 - 01:01 UTC

Resolved
This platform service degradation is now resolved, and an RCA will be provided once a full incident postmortem has been completed.
Posted Aug 06, 2024 - 18:32 UTC
Monitoring
Queued campaigns are now successfully being delivered and we are currently monitoring the issue.
Posted Jul 23, 2024 - 11:47 UTC
Investigating
We are investigating a potential service disruption affecting campaign sending from EU communities.

Users in communities hosted on our EU infrastructure may not be receiving campaign emails, or facing email delays at this time.

Our next update will be in 30 minutes.
Posted Jul 23, 2024 - 11:42 UTC
This incident affected: Platforms (EU Firstup Platform).