On February 8th, 2024, beginning at approximately 1:56 PM EST (18:56 UTC), we started receiving reports of Studio not performing as expected. The symptoms observed by some Studio users included:
· A “failed to fetch” or a “504 Gateway Timeout” error message.
· Unusually slow performance.
A recurrence of this incident was also observed on April 24th, 2024.
Studio users who were actively trying to navigate through and use any Studio functions during the duration of these incidents were impacted by the service disruption.
It was identified that Studio services were failing to establish a TCP connection to the Identity and Access Management service (IAM) due to a backup of TCP connection requests. The backup of TCP connection requests resulted from other “already failed” connection requests that were not dropped because they kept retrying to establish a connection for an extended period.
On both days, the immediate problem was mitigated by restarting the backend services that had failed TCP connection attempts, in effect purging the connection request queue of stale connections and allowing new connections to be established with the IAM service.
Our engineering team is working on reducing the time-to-live duration of all TCP connection requests to the IAM service from the default 60 seconds to 10 seconds. This will allow for failed connections to be dropped sooner and reduce the backup of connection requests to IAM.
In addition, we have also implemented dashboards to track TCP connection failures, as well as set alerting thresholds on failed TCP connections to help us get ahead of a potential platform service disruption.