Studio Performance Degradation
Incident Report for Firstup
Postmortem

Summary:

On February 8th, 2024, beginning at approximately 1:56 PM EST (18:56 UTC), we started receiving reports of Studio not performing as expected. The symptoms observed by some Studio users included:

·       A “failed to fetch” or a “504 Gateway Timeout” error message. 

·       Unusually slow performance.

A recurrence of this incident was also observed on April 24th, 2024.

Impact:

Studio users who were actively trying to navigate through and use any Studio functions during the duration of these incidents were impacted by the service disruption.

Root Cause:

It was identified that Studio services were failing to establish a TCP connection to the Identity and Access Management service (IAM) due to a backup of TCP connection requests. The backup of TCP connection requests resulted from other “already failed” connection requests that were not dropped because they kept retrying to establish a connection for an extended period. 

Mitigation:

On both days, the immediate problem was mitigated by restarting the backend services that had failed TCP connection attempts, in effect purging the connection request queue of stale connections and allowing new connections to be established with the IAM service. 

Remediation Steps:

Our engineering team is working on reducing the time-to-live duration of all TCP connection requests to the IAM service from the default 60 seconds to 10 seconds. This will allow for failed connections to be dropped sooner and reduce the backup of connection requests to IAM.

In addition, we have also implemented dashboards to track TCP connection failures, as well as set alerting thresholds on failed TCP connections to help us get ahead of a potential platform service disruption.

Posted Jun 07, 2024 - 19:55 UTC

Resolved
This incident has been resolved.
Posted Feb 20, 2024 - 17:47 UTC
Monitoring
We have completed the rolling restart of backend services to mitigate this issue, and Studio Services are now available.

We will be placing this issue under monitoring for now.
Posted Feb 08, 2024 - 20:17 UTC
Identified
We will be performing a rolling restart of backend services as we work to mitigate this issue. Studio users may experience a brief moment of Studio Services being unavailable. We will advise once the backend services are back online.
Posted Feb 08, 2024 - 19:53 UTC
Investigating
We are currently investigating reports of Studio performance degradation.

We will provide an update within 1 hour.
Posted Feb 08, 2024 - 19:16 UTC
This incident affected: Products (New Studio).