Platform Performace Degradation - Intermittent 5xx errors accessing Studio
Incident Report for Firstup
Postmortem

Summary:

On February 8th, 2024, beginning at approximately 1:56 PM EST (18:56 UTC), we started receiving reports of Studio not performing as expected. The symptoms observed by some Studio users included:

·       A “failed to fetch” or a “504 Gateway Timeout” error message. 

·       Unusually slow performance.

A recurrence of this incident was also observed on April 24th, 2024.

Impact:

Studio users who were actively trying to navigate through and use any Studio functions during the duration of these incidents were impacted by the service disruption.

Root Cause:

It was identified that Studio services were failing to establish a TCP connection to the Identity and Access Management service (IAM) due to a backup of TCP connection requests. The backup of TCP connection requests resulted from other “already failed” connection requests that were not dropped because they kept retrying to establish a connection for an extended period.

Mitigation:

On both days, the immediate problem was mitigated by restarting the backend services that had failed TCP connection attempts, in effect purging the connection request queue of stale connections and allowing new connections to be established with the IAM service.

Remediation Steps:

Our engineering team is working on reducing the time-to-live duration of all TCP connection requests to the IAM service from the default 60 seconds to 10 seconds. This will allow for failed connections to be dropped sooner and reduce the backup of connection requests to IAM.

In addition, we have also implemented dashboards to track TCP connection failures, as well as set alerting thresholds on failed TCP connections to help us get ahead of a potential platform service disruption.

Posted Jun 07, 2024 - 19:57 UTC

Resolved
Studio has remained fully accessible following the bouncing of the affected services.

This platform service degradation is now considered resolved, and a RCA analysis will be provided once a full incident postmortem has been completed.
Posted May 02, 2024 - 15:49 UTC
Update
We have bounced the impacted services to mitigate this performance degradation. Studio is now accessible, as we work to identify the root cause of this incident.

We will provide you with another update as soon as more information is made available.
Posted Apr 24, 2024 - 23:48 UTC
Investigating
We are currently investigating reports of intermittent 5xx errors while accessing Studio.

We will provide you with an update within 1 hour.
Posted Apr 24, 2024 - 22:57 UTC
This incident affected: Products (New Studio, Classic Studio).