Platform Service Degradation - Studio Performance Degraded or Inaccessible
Incident Report for Firstup
Postmortem

Summary:

On Tuesday, May 14th, 2024, starting at around 11:25 AM EDT (15:25 UTC), we received reports that some users saw errors while accessing the Studio platform or the Web Experience. Reported error messages included:

  • We’re sorry, but something went wrong.
  • 502 Bad Gateway.
  • There was an error processing your request. Please try again.

Scope:

The scope of this incident primarily affected users who attempted to access Studio services, and to a lesser degree, users who tried to access the Web Experience between 11:25 AM EDT and 12:09 PM EDT.

Root Cause:

An underlying service (Athena) which is used as part of our machine learning and AI infrastructure experienced access issues connecting with one of our core database servers due to high network latency. The service had timeouts configured that were too large for its access pattern and the data it uses, causing it to block incoming connections for an inordinate period. Subsequently, services that depend on Athena also timed out, resulting in the Studio service degradation and error messages observed by impacted users.

Mitigation:

The immediate impact was mitigated by performing a rolling restart of the affected services, and all Studio functions were restored by 12:09 PM EDT (16:09 UTC).

Recurrence Prevention:

To prevent a recurrence of this incident, connection requests Time-To-Live (TTL) from Athena to our core database will be reduced from the default 60 seconds to 5 seconds. This will greatly reduce the traffic backup of requests from other services to Athena.

Posted Jun 10, 2024 - 23:33 UTC

Resolved
This incident has been resolved.
Posted May 28, 2024 - 17:02 UTC
Monitoring
We have rolled the affected services to restore functionality and will continue to monitor these services for stability.
Posted May 14, 2024 - 16:08 UTC
Investigating
We are currently investigating reports of Studio performing poorly or returning 5xx errors for some users.

We will provide you with an update in 1 hour.
Posted May 14, 2024 - 15:56 UTC
This incident affected: Products (New Studio, Classic Studio).