Platform Services Degradation - Microapps and Partner APIs Unavailable
Incident Report for Firstup
Postmortem

Summary:

On February 15th, 2024, beginning at approximately 5:50 AM PT (13:50 UTC), we started receiving reports of several platform services being unavailable, including Microapps and Partner APIs.  Errors persisted intermittently for just over an hour primarily for these two services as well as any new user requests that required IP address resolution through an authoritative DNS (domain name service) server.

Impact:

The impact was primarily related to services which have very low TTL (time to live) thresholds for DNS and new end-user requests that required a new DNS lookup first. Observed error conditions included request timeouts and HTTP 500 gateway errors.  Multiple services were in scope of the platform incident and availability would have depended on whether the service IP had been cached locally or whether the DNS request was able to be serviced within the lower level of available capacity.

Root Cause:

The root cause was determined to be an unexpected drop in overall DNS service capacity.  An earlier planned maintenance regressed an earlier performance improvement that resulted in the reduction of the number of Core DNS services that would run in production, thus limiting the overall available capacity for inbound DNS traffic.

Mitigation:

The immediate problem was mitigated by restoring Core DNS capacity as soon as the discrepancy was discovered at 6:30 AM PT (14:30 UTC) by the incident response team.  Remaining error rates began to improve markedly by 6:45 AM PT (14:45 UTC) and all services were confirmed to be fully stabilized by 7:15 AM PT (15:15 UTC).

Recurrence Prevention:

A technical team postmortem meeting reviewed the change management process that allowed an errant default setting for the number of DNS nodes to be pushed to production, how to improve platform alert visibility of this condition in the future, and how to prevent unexpected loss of DNS service capacity.  The following changes have since been instituted:

  • An alert will now fire any time the core DNS capacity drops below the minimal viable threshold determined by Site Reliability Engineering.
  • All core service nodes will now launch with an attached DNS service component automatically.
  • Load testing has been performed to ensure scalability and appropriate buffer for potential spikes and organic growth in DNS request volume.
  • Updated infrastructure change management to ensure that any future configuration changes would persist following service restarts.
Posted Mar 15, 2024 - 17:54 UTC

Resolved
This incident has been resolved.
Posted Feb 20, 2024 - 17:47 UTC
Monitoring
All affected services have now been restored and are confirmed as available.

We will now be placing these services under monitoring.
Posted Feb 15, 2024 - 15:21 UTC
Identified
We identified a potential issue with the capacity of our core DNS. We have increased the number POD's to service the traffic level, and are seeing indication that performance is trending to recovering.

Another update will be provided within 1 hour.
Posted Feb 15, 2024 - 14:48 UTC
Investigating
We are investigating reports of Microapps and Partner APIs being unavailable.

We will provide an update within 1 hour.
Posted Feb 15, 2024 - 14:28 UTC
This incident affected: Ecosystem (Partner API) and Products (Microapps).