Summary:
On Monday, October 14th, 2024, starting at 9:41 AM PDT, we received reports of Studio users receiving the error message “We’re sorry, but something went wrong” while attempting to log into Studio via Single Sign-On (SSO). Following a correlation of customer reports and initial troubleshooting, a platform service disruption incident was declared at 11:08 AM PDT and published on our Status Page at 11:11 AM PDT.
Severity:
Sev2
Scope:
The scope of this service disruption was restricted to Studio users on the US platform attempting to log into Studio via SSO. Users who were already logged in before the incident, or used other authentication methods to log into Studio were unaffected.
Impact:
Users could not log into Studio via SSO for the duration of this incident (1hr 37mins).
Root Cause:
The root cause of this incident was attributed to an unexpected hardware failure on the AWS Redis cluster, which triggered a failover event at 9:35 AM PDT. The failover event caused disruptions to the authentication flow in the Identity and Access Management (IAM) Redis service, which did not re-establish connections to the failover cluster, leading to the SSO login error.
Mitigation:
To mitigate this incident, the IAM service was restarted at 11:12 AM PDT to refresh the connections to the failover Redis cluster, which restored the Studio SSO logging service.
Recurrence Prevention:
To prevent this incident from recurring, we will perform the following: