Summary:
On May 2nd, 2024, at 10:39 AM EDT, a system monitor alerted us of a potential issue where the disk space on a service used to pass messages between backend workers was approaching critical “free disk space limits”. As we started looking at the event condition, customer reports of various Studio functions experiencing issues started coming in, including but not limited to the following conditions:
A platform incident was declared at 12:41 PM EDT, and the incident response team was engaged to diagnose the reported issues.
Impact:
The impact was determined to affect all Studio users who attempted to connect to Studio or initiate new Studio activities.
Root Cause:
The incident response team identified that one of the queues in the impacted service was backed up, in effect utilizing too much memory, which led to the out-of-memory condition. As a result, new Studio service requests could not establish connections to this service. The inability to establish connections to the service presented itself as the aforementioned customer-reported issues.
Mitigation:
To restore Studio services, the backed-up queue was purged at around 1:00 PM EDT to free up memory, which increased the available disk space for the service. This allowed for other queues to continue processing, as well as new Studio service requests to gain a connection to the service, and process successfully. For any affected transactions that were stuck during the purge, such as scheduled campaigns that did not publish, these were manually published. No customer data was lost from purging the queue.
Recurrence Prevention:
To prevent a recurrence of this incident, we have since deployed a hotfix to the code that checks if the queue size is over a certain limit before queueing more messages, to prevent this exact out-of-memory failure scenario.