Summary:
On July 10th, 2024, beginning at approximately 1:30 PM ET (17:30 UTC), we started receiving reports of published email campaigns that hadn’t been delivered to the intended audiences in over 1 hour. Due to the number of reports received, a platform incident was declared at 2:18 PM ET (18:18 UTC) and an incident response team began investigating these reports. Another platform incident was declared on July 11th, 2024, after reports of audiences being inaccessible or taking too long to load were received.
Impact:
The service degradation was intermittent in nature, and the impact was restricted to the US platform for access to some audiences and some campaigns published on July 10th, 2024, at or after 11:20 AM ET (15:20 UTC) through July 15th, 2024, at 6:23 PM ET (22:23 UTC).
Root Cause:
Both incidents stemmed from an overload of the ElasticSearch service, which resolves Audiences to User IDs and email addresses. A surge in error messages temporarily stored in a queue (for messages ElasticSearch couldn't process) overwhelmed the service, causing it to intermittently stop serving requests until it could catch up.
Mitigation:
The issue was immediately addressed by reducing the number of workers sending requests to ElasticSearch and increasing the number of nodes processing those requests. This reduced the strain on ElasticSearch, allowing the request queue to clear faster. Additionally, the error messages were manually reprocessed, making audiences accessible and campaigns publishable again.
Recurrence Prevention:
Errors in the queue are normal and typically resolve through automatic reprocessing. However, to prevent future occurrences: