Summary:
On Tuesday, May 14th, 2024, starting at around 11:25 AM EDT (15:25 UTC), we received reports that some users saw errors while accessing the Studio platform or the Web Experience. Reported error messages included:
Scope:
The scope of this incident primarily affected users who attempted to access Studio services, and to a lesser degree, users who tried to access the Web Experience between 11:25 AM EDT and 12:09 PM EDT.
Root Cause:
An underlying service (Athena) which is used as part of our machine learning and AI infrastructure experienced access issues connecting with one of our core database servers due to high network latency. The service had timeouts configured that were too large for its access pattern and the data it uses, causing it to block incoming connections for an inordinate period. Subsequently, services that depend on Athena also timed out, resulting in the Studio service degradation and error messages observed by impacted users.
Mitigation:
The immediate impact was mitigated by performing a rolling restart of the affected services, and all Studio functions were restored by 12:09 PM EDT (16:09 UTC).
Recurrence Prevention:
To prevent a recurrence of this incident, connection requests Time-To-Live (TTL) from Athena to our core database will be reduced from the default 60 seconds to 5 seconds. This will greatly reduce the traffic backup of requests from other services to Athena.