Platform Service Degradation - Studio Slow Performance
Incident Report for Firstup
Postmortem

Summary:

On July 10th, 2024, beginning at approximately 1:30 PM ET (17:30 UTC), we started receiving reports of published email campaigns that hadn’t been delivered to the intended audiences in over 1 hour. Due to the number of reports received, a platform incident was declared at 2:18 PM ET (18:18 UTC) and an incident response team began investigating these reports. Another platform incident was declared on July 11th, 2024, after reports of audiences being inaccessible or taking too long to load were received.

Impact:

The service degradation was intermittent in nature, and the impact was restricted to the US platform for access to some audiences and some campaigns published on July 10th, 2024, at or after 11:20 AM ET (15:20 UTC) through July 15th, 2024, at 6:23 PM ET (22:23 UTC).

Root Cause:

Both incidents stemmed from an overload of the ElasticSearch service, which resolves Audiences to User IDs and email addresses. A surge in error messages temporarily stored in a queue (for messages ElasticSearch couldn't process) overwhelmed the service, causing it to intermittently stop serving requests until it could catch up.

Mitigation:

The issue was immediately addressed by reducing the number of workers sending requests to ElasticSearch and increasing the number of nodes processing those requests. This reduced the strain on ElasticSearch, allowing the request queue to clear faster. Additionally, the error messages were manually reprocessed, making audiences accessible and campaigns publishable again.

Recurrence Prevention:

Errors in the queue are normal and typically resolve through automatic reprocessing. However, to prevent future occurrences:

  • We doubled ElasticSearch’s processing power on July 15th, 2024, at 6:23 PM ET to better handle any spikes.
  • We enabled additional monitoring and dashboards for early detection and mitigation of potential issues.
  • We will investigate and address the sources of the errors to ensure a healthier service.
Posted Jul 26, 2024 - 17:42 UTC

Resolved
This incident is now resolved, and all associated services are fully operational.
Posted Jul 26, 2024 - 17:42 UTC
Monitoring
Studio performance has now been restored and all functionality should be available.

We will be placing these services under monitoring for now.
Posted Jul 15, 2024 - 20:59 UTC
Update
We continue to investigate the recurrence of this issue and will provide another update within 1 hour.
Posted Jul 15, 2024 - 20:15 UTC
Investigating
We have received reports of a recurrence of this issue (including delayed campaign deliveries), and are actively investigating it again.

Another update within 1 hour.
Posted Jul 15, 2024 - 19:13 UTC
Monitoring
Studio performance has now been restored and all functionality should be available.

We will be placing these services under monitoring for now.
Posted Jul 11, 2024 - 21:32 UTC
Update
We continue to work on a fix for this issue and will provide another update within the next hour, or as soon as the fix is deployed.
Posted Jul 11, 2024 - 21:21 UTC
Update
We continue to work on a fix for this issue and will provide another update within the next hour, or as soon as the fix is deployed.
Posted Jul 11, 2024 - 20:16 UTC
Identified
We have identified the cause of this service degradation, and are working to mitigate the issue.

Another update within 1 hour.
Posted Jul 11, 2024 - 19:15 UTC
Investigating
We are currently investigating reports of slow Studio performance with errors loading some pages, including Audiences.

We will provide you with an update with 1 hour.
Posted Jul 11, 2024 - 19:07 UTC
This incident affected: Products (Creator Studio).