EU Communities - Studio Service Disruption
Incident Report for Firstup
Postmortem

Summary:

On Tuesday April 16th, 2024, starting at approximately 9:54 AM UTC to 11:09 AM UTC, EU Studio experienced multiple service disruptions including general slowness with loading Studio functions, issues with login as well as HTTP 500 system error messages. It was identified that a number of backend services were experiencing TCP (Transmission Control Protocol) networking issues that manifested as a variety of user-visible errors and unpredictable product interactions.

 Impact: 

Affected users were unable to login into Studio, as well as experienced general slowness and system error messages such as “504 Gateway Timeout” or “502 Bad Gateway” due to the backend services having network errors. 

Root Cause: 

The root cause was determined to be an unexpected spike in traffic which caused a number of nodes (worker machines) to rapidly increase to handle the additional workload. This led to DNS (Domain Name Service) request timeouts as it exceeded the overall capacity for inbound DNS traffic when these nodes increased.  

Mitigation:

The immediate problem was mitigated by increasing DNS capacity within the EU infrastructure and restarting the affected services, restoring system services and performance by 11:09 AM UTC.

Recurrence Prevention:

Below changes have been implemented to prevent unexpected loss of DNS service capacity.  ‌

  • An alert will now fire within the EU infrastructure any time the internal DNS capacity drops below the minimal viable threshold determined by Site Reliability Engineering.
  • Load testing has been performed to ensure scalability and appropriate buffer for potential spikes and organic growth in DNS request volume.
Posted Jun 07, 2024 - 20:09 UTC

Resolved
Studio has remained fully accessible for EU communities following the applied fix.

This platform service degradation is now resolved, and an RCA will be provided once a full incident postmortem has been completed.
Posted May 02, 2024 - 15:56 UTC
Monitoring
We have applied a fix for the issue affecting Studio on EU communities. We are continuing to monitor and will update once we have confirmed that the platform is stable.
Posted Apr 16, 2024 - 11:11 UTC
Update
We are continuing to investigate this issue affecting Studio for EU communities and working to restore service.

We'll provide another update within the next 30 minutes.
Posted Apr 16, 2024 - 10:51 UTC
Investigating
We are investigating a service disruption affecting Studio for EU communities.

These appear to be intermittent issues causing some users to be unable to login to Studio, or experiencing slowness/timeouts.

Our next update will be in 30 minutes.
Posted Apr 16, 2024 - 10:04 UTC
This incident affected: Products (New Studio, Classic Studio) and Platforms (EU Firstup Platform).