On Wednesday, November 13th, 2024, starting at around 6:14 AM PT, we received reports of some users experiencing a general slowness and decreased responsiveness while navigating around Firstup’s Studio platform. Additional reports of some published campaign emails not being delivered, and error messages being returned while navigating within the Employee Experience followed shortly thereafter.
Sev2
The scope of this service degradation included all Firstup customers utilizing the Studio and the Employee Experience (EE) endpoints.
For approximately 2.5 hours after the onset of the incident, users were unable to efficiently and consistently navigate around Studio and the EE. Symptoms included a loading spinner for an extended amount of time in Studio, and missing shortcuts in EE, with some of the errors noted below being returned:
Concurrently, campaign emails were also not being delivered consistently to their intended recipients, and user sync files were not being actively processed. Most campaign emails were delivered over the course of the next 3 hours with a few rare exceptions that required manual intervention and Customer coordination. User sync was not restored until it was discovered the process was not actively running the following day. Total incident duration less the user sync process was just over 6 hours.
The root cause has been attributed to a code change that was released during the Platform Software Release maintenance window the night before, which caused a slow-running query from one of our back-end services to run for too long and utilize an exceptionally large amount of database resources including the number of network connections and CPU processing cycles. This, coupled with the normal increase in database requests from our customer base during our platform utilization peak hours starting at around 5:54 AM PT, resulted in available database connections to be exhausted and CPU overutilization conditions in the database. As a result, new connections to the database could not be freely established until current connections were closed and made available for new requests from various platform services such as Studio and EE. The backend service responsible for campaign email delivery was also subject to this condition and could not process email deliveries as expected. Similarly, user sync file processing was also delayed beyond normal limits.
Various symptoms exhibited during the course of the incident were mitigated in phases.
The most significant service impact was mitigated after a hotfix was released at 8:29 AM PT to halt the aforementioned slow-running query, in effect relieving some resource pressure on the database, and allowing customer-facing service requests from Studio and EE to successfully re-establish connections with the database. Additional resources were also spun up to process the email delivery queue backlog that had been increasing during the incident, which started draining at 12:02 PM PT. Almost all campaign emails were confirmed delivered by 1:10 PM PT, with the exception of ~150 email messages, which had experienced an internal error, but were later cleared and delivered.
The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA (Root Cause Assessment) process: