Summary:
On August 28th, 2024, at 1:03 PM EDT, system monitors alerted us of failing database health checks, and our engineering team immediately started investigating these alerts. Customer reports of core platform endpoints being unresponsive and/or returning error messages were received beginning at 1:12 PM EDT, and a platform incident was declared at 1:21 PM EDT.
Scope:
Any user on the US platform attempting to access or navigate through the Web and Mobile Experience, as well as Studio, was impacted by this incident.
Impact:
Core US platform endpoints such as Web and Mobile Experiences, as well as Studio, were slow to load or became intermittently unavailable for the duration of the incident (48 minutes).
Root Cause:
The root cause was determined to be a slow-running query for “user unread posts” that saw a huge spike in traffic following a campaign that was published to a large audience. As a result, the database CPU spiked and stopped taking new connection requests, causing new Web and Mobile Experience requests, as well as Studio requests to fail and the system appeared to be unresponsive.
Mitigation:
The immediate problem was mitigated by reducing the number of pods submitting requests to the database by half to alleviate the load on the database, which restored database responsiveness and platform endpoints availability by 1:51 PM EDT.
Recurrence Prevention:
To prevent this incident from recurring, our engineering incident response team has optimized the offending “slow-running” query to perform 2x faster, thereby reducing the required database CPU resources.
We are also working on implementing circuit breakers on the offending downstream services from the database, to prevent database CPU overutilization, to ensure platform endpoints availability.