Platform Service Unavailable - US Web Experience and Studio

Incident Report for Firstup

Postmortem

Summary:

On August 28th, 2024, at 1:03 PM EDT, system monitors alerted us of failing database health checks, and our engineering team immediately started investigating these alerts. Customer reports of core platform endpoints being unresponsive and/or returning error messages were received beginning at 1:12 PM EDT, and a platform incident was declared at 1:21 PM EDT.

Scope:

Any user on the US platform attempting to access or navigate through the Web and Mobile Experience, as well as Studio, was impacted by this incident.

Impact:

Core US platform endpoints such as Web and Mobile Experiences, as well as Studio, were slow to load or became intermittently unavailable for the duration of the incident (48 minutes).

Root Cause:

The root cause was determined to be a slow-running query for “user unread posts” that saw a huge spike in traffic following a campaign that was published to a large audience. As a result, the database CPU spiked and stopped taking new connection requests, causing new Web and Mobile Experience requests, as well as Studio requests to fail and the system appeared to be unresponsive.

Mitigation:

The immediate problem was mitigated by reducing the number of pods submitting requests to the database by half to alleviate the load on the database, which restored database responsiveness and platform endpoints availability by 1:51 PM EDT.

Recurrence Prevention:

To prevent this incident from recurring, our engineering incident response team has optimized the offending “slow-running” query to perform 2x faster, thereby reducing the required database CPU resources.

We are also working on implementing circuit breakers on the offending downstream services from the database, to prevent database CPU overutilization, to ensure platform endpoints availability.

Posted Sep 05, 2024 - 22:38 UTC

Resolved

This incident is now resolved.

Posted Sep 05, 2024 - 22:37 UTC

Monitoring

Moving platform incident in to a monitoring state. There has been no further recurrence of the service disruption to the web experience endpoint. A software hot fix has been deployed and verified. This fix is intended to address the suspected root cause of a non-optimal database query that resulted in unresponsiveness and 500 error responses observed by users prior to the incident being mitigated. All components remain fully operational.

Posted Aug 28, 2024 - 20:06 UTC

Identified

We have identified the cause of this service outage, and are working on a fix. However, Web Experience and Studio continue to be available.

Another update will be provided as more information is made available.

Posted Aug 28, 2024 - 18:18 UTC

Update

As we continue investigating this incident, we have relieved some pressure on back-end resources and services to mitigate the issue, and Web Experience and Studio are now available.

Another update in 30 minutes.

Posted Aug 28, 2024 - 17:51 UTC

Investigating

We are currently investigating reports of the US Web Experince being unavailable.

We will provide you with an update in 30 minutes.

Posted Aug 28, 2024 - 17:21 UTC

This incident affected: Platforms (US Firstup Platform) and Products (Web Experience).