Platform Service Degradation - Intermittent Studio Slow Performance
Incident Report for Firstup
Postmortem

Summary:

On Wednesday, November 13th, 2024, starting at around 6:14 AM PT, we received reports of some users experiencing a general slowness and decreased responsiveness while navigating around Firstup’s Studio platform. Additional reports of some published campaign emails not being delivered, and error messages being returned while navigating within the Employee Experience followed shortly thereafter.

Severity

Sev2

Scope:

The scope of this service degradation included all Firstup customers utilizing the Studio and the Employee Experience (EE) endpoints.

Impact:

For approximately 2.5 hours after the onset of the incident, users were unable to efficiently and consistently navigate around Studio and the EE. Symptoms included a loading spinner for an extended amount of time in Studio, and missing shortcuts in EE, with some of the errors noted below being returned:

  • Oops, an error occurred! Sorry about that.
  • TypeError: toMoment(…).format is not a function

Concurrently, campaign emails were also not being delivered consistently to their intended recipients, and user sync files were not being actively processed.  Most campaign emails were delivered over the course of the next 3 hours with a few rare exceptions that required manual intervention and Customer coordination.  User sync was not restored until it was discovered the process was not actively running the following day. Total incident duration less the user sync process was just over 6 hours.

Root Cause:

The root cause has been attributed to a code change that was released during the Platform Software Release maintenance window the night before, which caused a slow-running query from one of our back-end services to run for too long and utilize an exceptionally large amount of database resources including the number of network connections and CPU processing cycles. This, coupled with the normal increase in database requests from our customer base during our platform utilization peak hours starting at around 5:54 AM PT, resulted in available database connections to be exhausted and CPU overutilization conditions in the database. As a result, new connections to the database could not be freely established until current connections were closed and made available for new requests from various platform services such as Studio and EE. The backend service responsible for campaign email delivery was also subject to this condition and could not process email deliveries as expected.  Similarly, user sync file processing was also delayed beyond normal limits.

Mitigation:

Various symptoms exhibited during the course of the incident were mitigated in phases. 
The most significant service impact was mitigated after a hotfix was released at 8:29 AM PT to halt the aforementioned slow-running query, in effect relieving some resource pressure on the database, and allowing customer-facing service requests from Studio and EE to successfully re-establish connections with the database. Additional resources were also spun up to process the email delivery queue backlog that had been increasing during the incident, which started draining at 12:02 PM PT.  Almost all campaign emails were confirmed delivered by 1:10 PM PT, with the exception of ~150 email messages, which had experienced an internal error, but were later cleared and delivered.

Recurrence Prevention:

The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA (Root Cause Assessment) process:

  • Moved the slow-running query to run during after-hours (off customer peak hours).
  • Isolate the campaign email delivery back-end service to its own database cluster to avoid the general database impact on email deliveries.
  • Enhance our software change management policies to release risky backend changes behind a feature flag for a more controlled release.
  • Update post incident service verification to ensure that user sync processing has been fully restored and remains functional.
Posted Nov 16, 2024 - 02:06 UTC

Resolved
All impacted Firstup platform endpoints have remained stable and fully available.

This incident is now resolved.
Posted Nov 16, 2024 - 02:03 UTC
Monitoring
The hotfix for the campaign email delivery issue has now been vetted and deployed in the production environment. Campaign emails have now resumed delivery and may take a few minutes to hit the end user’s inbox.

The affected services will now be placed back under monitoring for now.
Posted Nov 13, 2024 - 20:01 UTC
Update
We continue working on developing and verifying another hotfix for the email delivery issue.

Another update within 1 hour.
Posted Nov 13, 2024 - 19:48 UTC
Update
We have identified a potential root cause for the residual impact of this incident where email campaigns are yet to be delivered following the hotfix. We are working on developing and verifying another hotfix for that issue.

Another update within 1 hour.
Posted Nov 13, 2024 - 18:47 UTC
Identified
We are currently investigating a residual impact of this incident where some email campaigns are yet to be delivered after the hotfix was released.

We will provide an update within 1 hour.
Posted Nov 13, 2024 - 17:50 UTC
Monitoring
The hotfix for this incident has been released in the production environment. All services should now be restored.

We will be placing the affected services in a monitoring state for now.
Posted Nov 13, 2024 - 16:35 UTC
Update
A hotfix for this incident has been developed and is currently being tested in our staging environment. Once vetted, it will be released in the production environment.

Another update within 1 hour.
Posted Nov 13, 2024 - 16:25 UTC
Identified
We have identified a potential root cause of this incident, and are working towards mitigation. Affected components are only on the US datacenter. The EU datacenter remains unaffected.

Another update within 1 hour.
Posted Nov 13, 2024 - 15:30 UTC
Investigating
We are currently investigating reports of intermittent Studio slow performance issues.

We will provide you with an update within 1 hour.
Posted Nov 13, 2024 - 15:07 UTC
This incident affected: Products (Creator Studio, Classic Studio, Web Experience).