Employee Experience and Shortcuts Service Disruption

Incident Report for Firstup

Postmortem

Summary:

On May 15th, 2025, starting at around 6:12 AM PT, we received reports where some end users on the Firstup platform were either unable to load the Employee Experience, experienced slow loading, or observed that some shortcuts were missing with an error message displayed above the available shortcuts. Upon initial investigations, it was determined that the issue was reproducible but intermittent. A platform incident was declared at 6:49 AM PT and published on our Status Page at 6:58 AM PT, and the incident response team continued with the investigations.

Severity:

Sev-2

Scope:

Some end users for customers on the Firstup platform attempting to access or navigate within the Employee Experience.

Impact:

Some end users on the Firstup platform were either intermittently unable to load the Employee Experience, experienced slow loading, or observed that some shortcuts were missing with an error message displayed above the available shortcuts in the duration of this incident (1hr 53mins).

Root Cause:

The root cause for this incident was determined to be an Out-Of-Memory (OOM) condition in the back-end service which provides the primary API that the web and mobile experience applications communicate with. This OOM condition was attributed to a memory leak in the version of the programming language this service is written in, resulting in the service’s inability to fulfill requests from the Employee Experience intermittently.

Mitigation:

To mitigate this incident, the incident response team performed several actions to restore services, including restarting the back-end service to free up memory, temporarily increased the memory size, and scaled up the number of processing pods (in effect replacing pods which were stuck in a CrashLoopBackOff state because they were unable to communicate with the impacted service) to facilitate requests from the Employee Experience during the peak traffic time.

Recurrence Prevention:

To prevent this incident from happening again, we have upgraded to a more recent version of the programming library that had been attributed as the cause of the original memory leak.

Posted Jun 04, 2025 - 20:48 UTC

Resolved

This incident is now resolved, and all impacted services have remained stable and available during the monitoring phase. A Root Cause Assessment will be published here as soon as a postmortem of the incident is completed.

Posted May 23, 2025 - 16:33 UTC

Monitoring

We have applied a fix for the issue affecting Employee Experience and Shortcuts and are currently monitoring the platform.

Posted May 15, 2025 - 15:11 UTC

Update

We are continuing to investigate the service disruption affecting the Employee Experience and Shortcuts.

Our next update will be in 30 minutes.

Posted May 15, 2025 - 14:39 UTC

Investigating

We are investigating a service disruption affecting the Employee Experience and Shortcuts.

The Employee Experience (web and mobile) may be intermittently unavailable for some users, and we have reports of Shortcuts missing within the experience once accessed.

Our next update will be in 30 minutes.

Posted May 15, 2025 - 13:58 UTC

This incident affected: Products (Web Experience).