Platform Service Disruption - Campaign emails delayed or not delivered

Incident Report for Firstup

Postmortem

Summary:

On July 10th, 2024, beginning at approximately 1:30 PM ET (17:30 UTC), we started receiving reports of published email campaigns that hadn’t been delivered to the intended audiences in over 1 hour. Due to the number of reports received, a platform incident was declared at 2:18 PM ET (18:18 UTC) and an incident response team began investigating these reports. Another platform incident was declared on July 11th, 2024, after reports of audiences being inaccessible or taking too long to load were received.

‌

Impact:

The service degradation was intermittent in nature, and the impact was restricted to the US platform for access to some audiences and some campaigns published on July 10th, 2024, at or after 11:20 AM ET (15:20 UTC) through July 15th, 2024, at 6:23 PM ET (22:23 UTC).

Root Cause:

Both incidents stemmed from an overload of the ElasticSearch service, which resolves Audiences to User IDs and email addresses. A surge in error messages temporarily stored in a queue (for messages ElasticSearch couldn't process) overwhelmed the service, causing it to intermittently stop serving requests until it could catch up.

‌

Mitigation:

The issue was immediately addressed by reducing the number of workers sending requests to ElasticSearch and increasing the number of nodes processing those requests. This reduced the strain on ElasticSearch, allowing the request queue to clear faster. Additionally, the error messages were manually reprocessed, making audiences accessible and campaigns publishable again.

‌

Recurrence Prevention:

Errors in the queue are normal and typically resolve through automatic reprocessing. However, to prevent future occurrences:

We doubled ElasticSearch’s processing power on July 15th, 2024, at 6:23 PM ET to better handle any spikes.
We enabled additional monitoring and dashboards for early detection and mitigation of potential issues.
We will investigate and address the sources of the errors to ensure a healthier service.

Posted Jul 26, 2024 - 17:40 UTC

Resolved

This incident is now resolved, and all associated services are fully operational.

Posted Jul 26, 2024 - 17:39 UTC

Monitoring

This issue has now been mitigated, and customers whose campaign emails had not been delivered should start seeing them coming in now.

We will provide you with another update as more information is made available.

Posted Jul 10, 2024 - 18:35 UTC

Investigating

We have received reports where campaign emails are delayed or have not been delivered at all after an extended period of time.

We are investigating these reports and will provide you with an update within 1 hour.

Posted Jul 10, 2024 - 18:24 UTC

This incident affected: Products (Creator Studio).