Platform Service Degradation - Error message while trying to access the Employee Experience

Incident Report for Firstup

Postmortem

Summary:

On March 26th, 2025, at 8:48 AM PT, system monitors alerted us of a potential issue with connecting to our database, and our incident response team immediately started looking into the alert. Starting at around 8:55 AM PT, we started receiving customer reports where users saw an error message while trying load the Employee Experience. Following initial correlation of related events, a platform incident was declared at 9:04 AM PT, and published on our Status Page at 9:11 AM PT.

Severity:

Sev-2

Scope:

Any end-user attempting to access or navigate through the Web or Mobile Employee Experience.

Impact:

A system error message was returned to any end-user attempting to access or navigate through the Web or Mobile Employee Experience between 8:48 AM and 8:58 AM PT (10 minutes).

Root Cause:

The root cause was determined to be an exhaustion of available connection to our database.

Mitigation:

The system self-healed by 8:58 AM PT as various services associated with the Employee Experience auto-restarted to release any unused/stale connections to the database, in effect reducing the total number of connections to the database, thereby allowing end-user requests from the Employee Experience to establish new connections with the database.

Recurrence Prevention:

The following system enhancements have been implemented or have been identified as a follow-up item to prevent a recurrence of this incident due to the same root cause:

  • We have performed database connection tuning by renaming the connections to utilize the service_name that serves multiple applications, as opposed to utilizing the individual application_name. This has encouraged connections re-use by applications within a specific service, and has lowered the overall number of connections to the database by over 50%.
  • We are looking to upgrade the Postgres Ruby library to better handle connection pinning in the database, which will facilitate more connection re-use. This is slated to be completed in our Scheduled Software Release window on 4/15/2025.
Posted Apr 14, 2025 - 17:27 UTC

Resolved

This service degradation is now resolved, and all associated systems have remained stable and fully available.
Posted Mar 31, 2025 - 15:29 UTC

Monitoring

The Employee Experience is now fully restored and accessible. We will be placing the impacted services under monitoring for now.
Posted Mar 26, 2025 - 16:18 UTC

Investigating

We are currently investigating reports where some users are receiving an error message while trying to log into the Employee Experience. We will provide you with an update in 1 hour.
Posted Mar 26, 2025 - 16:11 UTC
This incident affected: Products (Web Experience, Mobile Experience).