US Web Experience is currently unavailable
Incident Report for Firstup
Postmortem

Summary:

On September 16th, 2024, starting at around 11:00 AM PDT, we started receiving customer reports stating that the Web and Mobile Experiences endpoints were unavailable. Following a correlation of these reports and system monitors, a platform incident was declared at 11:14 AM PDT.

Severity:

Sev1

Scope:

Any user on the US platform attempting to access the Web and Mobile Experiences intermittently received an error message, and the Employee Experience failed to load.

Impact:

The core Web and Mobile Experiences platform endpoints were intermittently unavailable for the duration of the incident (1hr 38mins).

Root Cause:

The root cause was determined to be an exhaustion of the available database connections due to a sudden burst of user engagement activity that correlated to a small number of high-visibility campaigns. At 10:50 AM PDT, a dependent back-end service entered into a crash loop back-off state due to the database connection requests being refused and returned the error message to end users.

Mitigation:

The immediate problem was mitigated by fully redeploying the Employee Experience microservice after initial failed attempts at more surgical standardized mitigation maneuvers proved ineffective. Earlier maneuvers focused on reducing database load by temporarily disabling platform features and functionality that make heavy use of database transactions, which reduced error rates overall, but did not eliminate Customer impact. Web and Mobile Experience availability was restored by 12:28 PM PDT.

 

Recurrence Prevention:

To prevent this incident from recurring, our engineering incident response team has:

  • Increased the available database connections by 40% to account for any unforeseen spikes in platform traffic.
  • Added circuit breakers that would intercept abnormal increases in platform traffic, thereby maintaining platform endpoints availability.
  • Added an additional incident mitigation maneuver to disable campaign reactions such that a full-service redeploy would not be required to restore platform availability.
Posted Sep 18, 2024 - 00:46 UTC

Resolved
All affected endpoints have remained stable and available. This incident is now resolved.
Posted Sep 18, 2024 - 00:43 UTC
Update
We are continuing to monitor for any further issues.
Posted Sep 17, 2024 - 18:09 UTC
Update
The unplanned performance enhancement maintenance to the Firstup cloud infrastructure is now completed. All services are now available and fully functional. Please notify our Customer Support team if you experience any issues with Firstup services following this notice.
Posted Sep 16, 2024 - 21:38 UTC
Update
Today at 2:30 PM PT / 9:30 PM UTC we will be performing unplanned maintenance to shore up Firstup cloud infrastructure as a preventative measure based on technical troubleshooting done since the incident was initially mitigated earlier today. This change may result in a service disruption lasting from a few seconds to several minutes as the changes take effect. We expect to be in a much more stable state as root cause troubleshooting continues following the completion of the maintenance.
Posted Sep 16, 2024 - 20:56 UTC
Monitoring
Web and Mobile Experiences have now been restored. We will be placing the offending services under monitoring for now.
Posted Sep 16, 2024 - 19:41 UTC
Update
We are continuing to work on a fix for this issue.
Posted Sep 16, 2024 - 19:27 UTC
Update
We continue to work on relieving the pressure on database resources, and the current user experience is intermittent and partial access to the Employee Experience (on both desktop and mobile EE).

Another update in 30 minutes.
Posted Sep 16, 2024 - 19:26 UTC
Update
We are working on relieving pressure on database resources to restore services.

Another update in 30 minutes.
Posted Sep 16, 2024 - 18:55 UTC
Identified
We have identified a potential cause of this service outage, and are working to restore services.

Another update in 30 minutes.
Posted Sep 16, 2024 - 18:31 UTC
Investigating
We are currently investigating reports of the US Web Experience being unavailable. Studio remains available
Posted Sep 16, 2024 - 18:14 UTC
This incident affected: Platforms (US Firstup Platform) and Products (Web Experience, Mobile Experience).