Summary:
On Wednesday, January 22nd, 2025, starting at around 2:22 PM PT, we started receiving reports of delays in the delivery of campaign emails to end users. After a correlation of these reports and system monitors and alerts, a platform incident was declared at 3:14 PM PT, and published in our status page.
Severity:
Sev2
Scope:
The scope of this service degradation included any communications (campaign emails, push notifications, assistant messages) on the US platform.
Impact:
During the service degradation, messages delivery on the US platform was delayed for a few minutes - up to 2.5 hours. These included campaign emails, push notifications, and assistant messages.
Root Cause:
The incident response team identified the root cause of the incident to be a long-running query that caused an over-utilization of the responding database’s resources. As a result, other database requests remained in queue, and were processed as database resources became available.
Mitigation:
To mitigate this incident, the long-running query was disabled at 3:30 PM PT, in effect freeing up database resources. This allowed for the messages that were in queue to be processed and delivered. System monitors confirmed normal system performance at 4:39 PM PT, and the incident was declared as resolved at 4:43 PM PT after the database queue was confirmed as caught up.
Recurrence Prevention:
The following actions have been taken or identified as follow-up items to prevent a recurrence of this incident: