Platform Service Disruption - SSO users can't log into Studio
Incident Report for Firstup
Postmortem

Summary:

On Monday, October 14th, 2024, starting at 9:41 AM PDT, we received reports of Studio users receiving the error message “We’re sorry, but something went wrong” while attempting to log into Studio via Single Sign-On (SSO). Following a correlation of customer reports and initial troubleshooting, a platform service disruption incident was declared at 11:08 AM PDT and published on our Status Page at 11:11 AM PDT.

Severity:

Sev2

Scope:

The scope of this service disruption was restricted to Studio users on the US platform attempting to log into Studio via SSO. Users who were already logged in before the incident, or used other authentication methods to log into Studio were unaffected.

Impact:

Users could not log into Studio via SSO for the duration of this incident (1hr 37mins).

Root Cause:

The root cause of this incident was attributed to an unexpected hardware failure on the AWS Redis cluster, which triggered a failover event at 9:35 AM PDT. The failover event caused disruptions to the authentication flow in the Identity and Access Management (IAM) Redis service, which did not re-establish connections to the failover cluster, leading to the SSO login error.

Mitigation:

To mitigate this incident, the IAM service was restarted at 11:12 AM PDT to refresh the connections to the failover Redis cluster, which restored the Studio SSO logging service.

Recurrence Prevention:

To prevent this incident from recurring, we will perform the following:

  • Introduced self-healing for IAM to automatically reconnect to Redis following failover events. This enhancement will be released in our upcoming Scheduled Software Release maintenance window on November 12th, 2024.
  • Perform a gap analysis of the already existing IAM monitoring and alerting dashboard.
Posted Oct 31, 2024 - 18:39 UTC

Resolved
SSO log-in into Studio has remained available and fully functional throughout the monitoring phase of this incident.

This incident is now resolved.
Posted Oct 22, 2024 - 14:14 UTC
Monitoring
The reported issue has been mitigated, and Studio is now available via SSO.

We will place the impacted services under monitoring for now.
Posted Oct 14, 2024 - 18:20 UTC
Identified
We have identified a potential cause of this issue, and are working to resolve it.

Another update will be provided within 1 hour.
Posted Oct 14, 2024 - 18:13 UTC
Investigating
We are investigating reports of users being unable to log into Studio via Single Sign-On (SSO).

We will provide you with an update within 1 hour.
Posted Oct 14, 2024 - 18:11 UTC
This incident affected: Products (Creator Studio, Classic Studio).