Platform Service Degradation - Scheduled Campaigns Not Publishing Or Delayed
Incident Report for Firstup
Postmortem

Summary:

On March 15th, 2024, we started receiving reports where scheduled campaigns experienced delays in publishing at the scheduled time or did not publish at all at the scheduled time.

Impact:

The impact was restricted to any scheduled campaigns on the FirstUp platform scheduled to publish on March 15th, 2024, between 1:00 AM ET (05:00 UTC) and 8:04 PM ET (March 16th, 2024 - 00:04 UTC).

Root Cause:

The root cause was determined to be a regression to a software change to the “scheduled campaign callback service” that was deployed during our scheduled software release window the previous day causing a callback to the “scheduling service” (to publish a scheduled campaign at the scheduled time) to fail.

Mitigation:

A hotfix was deployed by 8:04 PM ET (March 16th, 2024 - 00:04 UTC) to address the software regression introduced in the campaign scheduling software. Any delayed scheduled campaigns were also manually published by the same time.

Recurrence Prevention:

The Incident Response Team has taken the following actions in an effort to prevent a recurrence of this incident:

  • Implemented additional pre-release regression testing around the “scheduling service”.
  • Documented the SQL rake task used to identify any failed/delayed scheduled campaigns in a runbook to aid in quickly mitigating any future similar incidents.
  • Created monitors to alert us on the first instance of a failed/delayed scheduled campaign to enable us to proactively get ahead of any campaign scheduling issue(s) and prevent similar platform-wide incidents.
Posted Apr 10, 2024 - 20:03 UTC

Resolved
This incident has been fully resolved and all components remain fully operational.
Posted Mar 19, 2024 - 19:58 UTC
Monitoring
A fix has been developed and deployed to mitigate this service degradation.

We have also manually published any impacted scheduled campaigns to this point, if they were not duplicated or manually published by the customer.

We will place the affected services under monitoring for now.
Posted Mar 16, 2024 - 00:15 UTC
Update
We continue to work on a solution to the potential root cause of this service degradation.

We have also manually published any impacted scheduled campaigns to this point, if they were not duplicated or manually published by the customer.

Another update will be provided within 1 hour.
Posted Mar 15, 2024 - 23:23 UTC
Update
We continue to work on a solution to the potential root cause of this service degradation.

We have also manually published any impacted scheduled campaigns to this point, if they were not duplicated or manually published by the customer.

Another update will be provided within 1 hour.
Posted Mar 15, 2024 - 22:02 UTC
Identified
We have identified a potential backend issue that may be the root cause of this service degradation, and are working to resolve it.

Another update will be provided within 1 hour.
Posted Mar 15, 2024 - 21:22 UTC
Update
We continue to investigate the cause of this service degradation, and will provide another update within 1 hour.
Posted Mar 15, 2024 - 21:15 UTC
Investigating
We are currently investigating reports where some scheduled campaigns did not publish at the scheduled time or were delayed in publishing.

We will provide an update within 1 hour.
Posted Mar 15, 2024 - 20:19 UTC
This incident affected: Products (New Studio).