On February 15th, 2024, beginning at approximately 5:50 AM PT (13:50 UTC), we started receiving reports of several platform services being unavailable, including Microapps and Partner APIs. Errors persisted intermittently for just over an hour primarily for these two services as well as any new user requests that required IP address resolution through an authoritative DNS (domain name service) server.
The impact was primarily related to services which have very low TTL (time to live) thresholds for DNS and new end-user requests that required a new DNS lookup first. Observed error conditions included request timeouts and HTTP 500 gateway errors. Multiple services were in scope of the platform incident and availability would have depended on whether the service IP had been cached locally or whether the DNS request was able to be serviced within the lower level of available capacity.
The root cause was determined to be an unexpected drop in overall DNS service capacity. An earlier planned maintenance regressed an earlier performance improvement that resulted in the reduction of the number of Core DNS services that would run in production, thus limiting the overall available capacity for inbound DNS traffic.
The immediate problem was mitigated by restoring Core DNS capacity as soon as the discrepancy was discovered at 6:30 AM PT (14:30 UTC) by the incident response team. Remaining error rates began to improve markedly by 6:45 AM PT (14:45 UTC) and all services were confirmed to be fully stabilized by 7:15 AM PT (15:15 UTC).
A technical team postmortem meeting reviewed the change management process that allowed an errant default setting for the number of DNS nodes to be pushed to production, how to improve platform alert visibility of this condition in the future, and how to prevent unexpected loss of DNS service capacity. The following changes have since been instituted: