Platform Service Degradation - Platform Communications Slightly Delayed

Incident Report for Firstup

Postmortem

Summary:

On Wednesday, January 22nd, 2025, starting at around 2:22 PM PT, we started receiving reports of delays in the delivery of campaign emails to end users. After a correlation of these reports and system monitors and alerts, a platform incident was declared at 3:14 PM PT, and published in our status page.

Severity:

Sev2

Scope:

The scope of this service degradation included any communications (campaign emails, push notifications, assistant messages) on the US platform.

Impact:

During the service degradation, messages delivery on the US platform was delayed for a few minutes - up to 2.5 hours. These included campaign emails, push notifications, and assistant messages.

Root Cause:

The incident response team identified the root cause of the incident to be a long-running query that caused an over-utilization of the responding database’s resources. As a result, other database requests remained in queue, and were processed as database resources became available.

Mitigation:

To mitigate this incident, the long-running query was disabled at 3:30 PM PT, in effect freeing up database resources. This allowed for the messages that were in queue to be processed and delivered. System monitors confirmed normal system performance at 4:39 PM PT, and the incident was declared as resolved at 4:43 PM PT after the database queue was confirmed as caught up.

Recurrence Prevention:

The following actions have been taken or identified as follow-up items to prevent a recurrence of this incident:

The offending query has been optimized to utilize minimal database resources.
Additional database timeout parameters will be added to better address long-running queries that result in resource contention.
Enhanced the existing system monitors and alerts to include links to runbooks to aid in quickly mitigating similar issues in the future.

Posted Jan 31, 2025 - 21:59 UTC

Resolved

This service degradation is now resolved, and all impacted services are fully available and functional.

Posted Jan 30, 2025 - 20:53 UTC

Monitoring

We identified that a long-running query overutilized the database's processing power, therefore slowing down the delivery of other campaign communications. This has now been mitigated, and the database is back to a normal state, with all communication being processed as expected.

We will be placing the affected services under monitoring for now.

Posted Jan 23, 2025 - 00:50 UTC

Update

As we continue to investigate this incident, we can confirm that campaign communications continue to be delivered, albeit in a slightly delayed fashion.

Another update within 1 hour.

Posted Jan 23, 2025 - 00:26 UTC

Investigating

We are investigating reports where campaign communication deliveries on the US platform are slightly delayed. This may include campaign emails, push notifications, or assistant notifications.

An update will be provided with 1 hour.

Posted Jan 22, 2025 - 23:25 UTC

This incident affected: Products (Creator Studio, Classic Studio, Web Experience, Mobile Experience).