Web Experience is unavailable

Incident Report for Firstup

Postmortem

Summary:

On October 10th, 2025 we received alerts indicating there was a service disruption with the Web Experience channel and also customer reports of the following error being displayed on the Web Experience for impacted users: ‘Oops! Something went wrong.' This behavior was attributed to a change that was made to enable a verbose log setting on a back end system which generated an unexpected amount of load on the database for the US1 region. A verbose log setting provides detailed, step-by-step information to help with troubleshooting, debugging, and performance optimization by logging every action and event. More capacity was added to handle the additional load which restored service for the Web Experience however a residual issue remained with some slowness navigating on the Web Experience which was fully mitigated by disabling the verbose log setting.

Impact:

A system error message was returned to any end-user attempting to access/use the Web Experience between 05:00 AM and 05:56 AM PT (56 minutes).

‘Oops! Something went wrong.'

Residual slowness was seen between 05:56 AM and 10:32 AM PT (4 hours 26 minutes).

Root Cause:

Root cause was determined to be due to a change made on October 9th to enable a verbose log setting on a back end system which generated an unexpected amount of load on the database for the US1 region, which, coupled with normal peaks in busyness, resulted in several services going into a restart loop. This restart loop ultimately could not be contained without causing a disruption in service to our Customers.

The verbose log setting was enabled outside of business hours to assist with the troubleshooting and debugging of a prior incident on October 9th whereby Studio Insights Reports were not available for some users. This increase of the granular reporting logging events on database activity for audit/compliance unexpectedly generated a high level of load on the database during peak morning traffic US East coast hours.

The frequent application restarts which caused the most significant amount of impact to users were the result of the database becoming too bogged down with writing the verbose log data creating I/O (input/output) boundness that blocked other services from interfacing with the database itself during peak traffic time.

Mitigation:

A platform incident was declared at 05:21 am PT on Friday, October 10th and the incident team restored service at 05:56 AM PT. Additional capacity was implemented to handle the unexpected load which mitigated the unavailability of the Web Experience channel. However residual slowness and latency were being experienced for some users until 10:32 AM PT until the verbose log setting was disabled. Database load then returned to normal levels.

Recurrence Prevention:

The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA (Root Cause Assessment) process:

  • A case was opened with AWS, our Cloud Provider to understand the nature of the load andto investigate how to optimize the logging without causing a platform performance degradation. This setting will not be re-enabled without full confidence that the issue will not reoccur.
  • Testing and Analysis - More rigorous testing analysis will be performed on UAT, if this setting is required to be enabled again, to detect any errors/irregularities before it is enabled in Production.
  • Improve real-time monitoring for services going into a restart loop and also average latency times so an unexpected spike can be detected faster.
Posted Oct 24, 2025 - 23:52 UTC

Resolved

This incident is resolved.
Posted Oct 24, 2025 - 23:45 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 20, 2025 - 16:38 UTC

Investigating

Moving back to phase I (unmitigated). Web Experience channel is frequently displaying a blank page followed by a 502 Gateway Error.
Posted Oct 20, 2025 - 16:37 UTC

Update

Residual slowness has been fully addressed and all components are now fully operational. Root cause ascertained, and impact has been fully mitigated. Risk of a recurrence is very low, but additional investigation is required for full resolution. This incident will remain in a monitoring status until that investigation and a recurrence prevention plan is committed, at which time an RCA (Root Cause Assessment) will be published.
Posted Oct 10, 2025 - 17:40 UTC

Update

All services have been restored and are fully operational. We will continue to monitor the impacted services.
Posted Oct 10, 2025 - 17:39 UTC

Update

We continue to monitor for any further issues.
Posted Oct 10, 2025 - 16:48 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 10, 2025 - 13:48 UTC

Update

We are continuing to work on a fix for this issue.
Posted Oct 10, 2025 - 13:28 UTC

Identified

Web Experience is available and the load times have improved - we are continuing to monitor.
Posted Oct 10, 2025 - 13:22 UTC

Update

Work is continuing to fully restore service, improvements have been seen with Web Experience loading however it is still slow to load for some customers.
Posted Oct 10, 2025 - 12:58 UTC

Update

EU customers are not impacted. Investigation is continuing.
Posted Oct 10, 2025 - 12:31 UTC

Investigating

We are currently investigating a new platform incident impacting the web experience channel for some users. A 'oops something went wrong' error is being displayed on some programs. Updates to follow asap.
Posted Oct 10, 2025 - 12:27 UTC
This incident affected: Platforms (US Firstup Platform) and Products (Web Experience).