On May 15th, 2025, starting at around 6:12 AM PT, we received reports where some end users on the Firstup platform were either unable to load the Employee Experience, experienced slow loading, or observed that some shortcuts were missing with an error message displayed above the available shortcuts. Upon initial investigations, it was determined that the issue was reproducible but intermittent. A platform incident was declared at 6:49 AM PT and published on our Status Page at 6:58 AM PT, and the incident response team continued with the investigations.
Sev-2
Some end users for customers on the Firstup platform attempting to access or navigate within the Employee Experience.
Some end users on the Firstup platform were either intermittently unable to load the Employee Experience, experienced slow loading, or observed that some shortcuts were missing with an error message displayed above the available shortcuts in the duration of this incident (1hr 53mins).
The root cause for this incident was determined to be an Out-Of-Memory (OOM) condition in the back-end service which provides the primary API that the web and mobile experience applications communicate with. This OOM condition was attributed to a memory leak in the version of the programming language this service is written in, resulting in the service’s inability to fulfill requests from the Employee Experience intermittently.
To mitigate this incident, the incident response team performed several actions to restore services, including restarting the back-end service to free up memory, temporarily increased the memory size, and scaled up the number of processing pods (in effect replacing pods which were stuck in a CrashLoopBackOff state because they were unable to communicate with the impacted service) to facilitate requests from the Employee Experience during the peak traffic time.
To prevent this incident from happening again, we have upgraded to a more recent version of the programming library that had been attributed as the cause of the original memory leak.