Intermittent unavailability issues are being experienced for both Studio and Web Experience.

Incident Report for Firstup

Postmortem

Summary:

From approximately 11:08 am - 11:38 am PT (18:08 pm - 18:38 pm UTC), Thursday August 22nd, both Studio and Web Experience were unavailable due to the release of Version 2 of Personalized Fields (PFV2), a new feature with the Q3 quarterly update that was more resource intensive than initially planned. This caused high CPU usage, increased query latency and database connection pool exhaustion.

‌

Impact:

The scope of this incident primarily affected users who attempted to access Studio services and Web Experience between 11:08 am - 11:38 am PT. The issue manifested itself in the following observable ways through below errors on the frontend of the platform:

We’re sorry, but something went wrong.
502 Bad Gateway.
There was an error processing your request. Please try again.

‌

Root Cause:

The root cause was determined to be the release of Version 2 of Personalized Fields (PFV2), a new feature with the Q3 quarterly update that has been more resource intensive than was initially planned. The feature caused a significant increase in CPU usage, query latency on the shared database cluster and database connection pool exhaustion. This resulted in the Studio/Web Experience service unavailability and error messages observed by impacted users.

‌

Mitigation:

The immediate impact was mitigated by temporarily disabling the newly released feature that was causing excessive resource consumption. The cache Time-To-Live (TTL) was also changed from 1 minute to 3 hours to reduce load and stabilize performance.

After service was restored, we conducted platform tuning and scaled up infrastructure outside business hours to accommodate the increased load with the introduction of this new feature.

‌

Recurrence Prevention:

To prevent a recurrence of this incident, the below actions have or are being implemented:

Load Testing and Analysis - More rigorous load testing and analysis to detect N+1 calls or latency spikes before a feature goes live.
Infrastructure Planning and Caching Strategy - Refactor the caching for the affected feature, including pre-warming caches in batches to prevent cache-miss cascades, and optimizing the infrastructure to handle increased load efficiently whilst only caching what is needed.
Remove custom attributes for blocked users who have been inactive for a specific period to reduce table size and improve query performance.
Feature Flagging and Gradual Rollouts - Future high-risk changes will be rolled out gradually and improved resource monitoring performance will be done before full deployment.

Posted Aug 30, 2024 - 19:28 UTC

Resolved

A fix was put in place and the service disruption to the platform has been resolved.

Thank you for your patience whilst we carried out our investigation. Please contact Support at support.firstup.io should you encounter any further issues.

Posted Aug 23, 2024 - 15:35 UTC

Update

We are continuing to work on a fix for this issue.

Posted Aug 22, 2024 - 18:39 UTC

Identified

Both Studio and Web Experience are now available. The component which is causing the issue has been identified and steps have been taken to mitigate. Further updates to follow.

Posted Aug 22, 2024 - 18:39 UTC

Investigating

We are urgently investigating an issue with intermittent unavailability of the platform being experienced for both Studio and Web Experience. Updates to follow asap.

Posted Aug 22, 2024 - 18:27 UTC

This incident affected: Platforms (US Firstup Platform) and Products (Creator Studio, Web Experience, Mobile Experience).