Platform Service Degradation - Studio Malfunctions
Incident Report for Firstup
Postmortem

Summary:

On May 2nd, 2024, at 10:39 AM EDT, a system monitor alerted us of a potential issue where the disk space on a service used to pass messages between backend workers was approaching critical “free disk space limits”. As we started looking at the event condition, customer reports of various Studio functions experiencing issues started coming in, including but not limited to the following conditions:

  • Unable to send test campaigns
  • Processing error messages
  • Test campaign emails are not being delivered
  • White screens
  • Studio loading issues

A platform incident was declared at 12:41 PM EDT, and the incident response team was engaged to diagnose the reported issues.

Impact:
The impact was determined to affect all Studio users who attempted to connect to Studio or initiate new Studio activities.

Root Cause:
The incident response team identified that one of the queues in the impacted service was backed up, in effect utilizing too much memory, which led to the out-of-memory condition. As a result, new Studio service requests could not establish connections to this service. The inability to establish connections to the service presented itself as the aforementioned customer-reported issues.

Mitigation:

To restore Studio services, the backed-up queue was purged at around 1:00 PM EDT to free up memory, which increased the available disk space for the service. This allowed for other queues to continue processing, as well as new Studio service requests to gain a connection to the service, and process successfully. For any affected transactions that were stuck during the purge, such as scheduled campaigns that did not publish, these were manually published. No customer data was lost from purging the queue.

Recurrence Prevention:

To prevent a recurrence of this incident, we have since deployed a hotfix to the code that checks if the queue size is over a certain limit before queueing more messages, to prevent this exact out-of-memory failure scenario.

Posted May 17, 2024 - 23:54 UTC

Resolved
This incident has been resolved.
Posted May 17, 2024 - 23:54 UTC
Monitoring
This service degradation has been mitigated, and are working to identify the root cause. Studio and its functions are now available.

Please note that there may be some slight delays in campaign deliveries as tasks catch up within our databases.
Posted May 02, 2024 - 17:05 UTC
Investigating
We are currently investigating reports of various Studio functionalities not performing as expected.

We will provide you with another update in 30 minutes.
Posted May 02, 2024 - 16:50 UTC
This incident affected: Products (New Studio, Classic Studio) and Platforms (US Firstup Platform).