On Tuesday April 16th, 2024, starting at approximately 9:54 AM UTC to 11:09 AM UTC, EU Studio experienced multiple service disruptions including general slowness with loading Studio functions, issues with login as well as HTTP 500 system error messages. It was identified that a number of backend services were experiencing TCP (Transmission Control Protocol) networking issues that manifested as a variety of user-visible errors and unpredictable product interactions.
Affected users were unable to login into Studio, as well as experienced general slowness and system error messages such as “504 Gateway Timeout” or “502 Bad Gateway” due to the backend services having network errors.
The root cause was determined to be an unexpected spike in traffic which caused a number of nodes (worker machines) to rapidly increase to handle the additional workload. This led to DNS (Domain Name Service) request timeouts as it exceeded the overall capacity for inbound DNS traffic when these nodes increased.
The immediate problem was mitigated by increasing DNS capacity within the EU infrastructure and restarting the affected services, restoring system services and performance by 11:09 AM UTC.
Below changes have been implemented to prevent unexpected loss of DNS service capacity.