Platform Service Degradation - EE Shortcuts and Assistant Intermittently Unavailable
Incident Report for Firstup
Postmortem

Summary:

On September 30th, 2024, beginning at approximately 1:24 PM PDT (20:24 UTC), we started receiving reports of Shortcuts intermittently being unavailable and the Assistant returning an error in the Employee Experience. A platform incident was declared at 2:36 PM PDT (21:36 UTC) after initial investigations revealed the issue to be platform-wide.

Severity:

Sev2

Scope:

Any user on the US platform accessing the Web or Mobile Experiences intermittently experienced missing Shortcuts and/or received an error message while accessing the Assistant. A refresh of the Employee Experience page occasionally restored these endpoints. All other services in the Employee Experience remained available and functional.

Impact:

Shortcuts and the Assistant endpoints in the Employee Experience were intermittently unavailable during the incident.

Root Cause:

The root cause was determined to be due to an uncharacteristically high number of new user integrations introduced within a short period of time that exacerbated a newly uncovered non-optimized content caching behavior. This caused downstream latency and increased error rates served by the web service responsible for rendering shortcuts and the assistant notification page.

Mitigation:

The immediate impact was mitigated by restarting the Employee Experience integrations API, and services were restored by 2:42 PM PDT (21:42 UTC). While investigations into the root cause continued, the incident recurred the following day – October 1st, 2024, at 12:54 PM PDT (19:54 UTC). The Employee Experience integrations API and the dependent Employee Experience user-integrations request processing service (Pythia) were restarted, restoring Shortcuts and the Assistant endpoints by 1:46 PM PDT (20:46 UTC). Cache resources for Pythia were increased to mitigate the observed latency.

Recurrence Prevention:

To prevent this incident from recurring, our engineering incident response team:

  • Has developed a fix to optimize how user-integrations requests use the cache to reduce memory consumption and eliminate latency.

    • This fix will be released during our scheduled Software Release maintenance window on October 15th, 2024.
  • Will be adding a monitoring and alerting dashboard for the Employee Experience user-integrations requests processing service (Pythia).

Posted Oct 09, 2024 - 14:48 UTC

Resolved
Employee Experience Shortcuts and the Assistant have remained available and fully functional throughout the monitoring phase of this incident.

This incident is now resolved.
Posted Oct 09, 2024 - 14:47 UTC
Monitoring
We have restarted the offending backend service to restore the affected functionalities. Shortcuts and the Assistant are now available.

We will place these services back to monitoring for now.
Posted Oct 01, 2024 - 21:19 UTC
Investigating
We are currently investigating a recurrence of this issue.

We will provide you with an update in 1 hour.
Posted Oct 01, 2024 - 20:28 UTC
Monitoring
We have identified and restarted the offending backend services to restore the affected services. Shortcuts and the Assistant are now available.

We will place these services under monitoring for now.
Posted Sep 30, 2024 - 21:55 UTC
Investigating
We are currently investigating reports of shortcuts in the Employee Experience intermittently being unavailable, as well as an error message being returned while trying to access the assistant.

We will provide you with an update in 1 hour.
Posted Sep 30, 2024 - 21:36 UTC
This incident affected: Products (Web Experience, Mobile Experience).