Platform Services Degradation: Direct Delivery and Audiences stuck loading

Incident Report for Firstup

Postmortem

Summary: 

On July 23rd, 2025 we began receiving Customer reports of Audience and Direct Delivery pages failing to load. Audience pages were either returning HTTP 500 errors hanging during the page loading. This behavior was attributed to a code change which was released on July 22nd to improve detection of circular audience references. Circular audience references can occur when one audience refers to itself by being included in another existing audience. This code change had an unintended impact on page loading performance and audience resolution and was subsequently rolled back.

Impact: 

Performance degradation was observed across multiple areas of the Creator Studio, resulting in blocking some users from publishing Direct Delivery campaigns.  The following symptoms were observed while the problem was present:

  1. The Direct Delivery page failed to load or timed out for users across programs.
  2. Audience pages frequently returned 500 errors or hung during initial loading.

Root Cause: 

Root cause was determined to be due to a code change which was released on July 22nd that improved detection of circular audience references. Circular audience references when recalculated may result in errors or the audience itself being put into an erroneous state which will appear as archived and not able to be used in new or existing campaigns. The change made to prevent this condition from occurring had an unintended impact which led to drop in caching hit rates for the audience component, causing performance degradation in any Studio features relying on audience resolution.  A cache hit allows the audience compiler to reuse a stored result instead of recompiling the audience being referenced from scratch.  

Mitigation: 

A platform incident was declared at 11:20 am PT on Wednesday, July 23rd and the incident team determined the root cause at 11:56 pm PT. Issue was mitigated by rolling back that change and the cache hit rate returned to >98% from around 30%. Rollback completed by 3:30 pm PT mitigating the negative page performance issues seen on the Audience and Direct Delivery pages.

Recurrence Prevention: 

The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA (Root Cause Assessment) process: 

● Load Testing and Analysis - More rigorous load testing and analysis to detect performance issues or latency spikes before a change goes live.

● Additional real-time monitoring has been implemented that will alert the Engineering team of overall % change in audience query compilation cache hit rate.

● Enhance our software change management policies to use feature flags for rollout of changes in performance sensitive areas of the platform to reduce rollback time.

Posted Aug 18, 2025 - 08:55 UTC

Resolved

This incident is now resolved, and all associated services have remained available and fully operational.
Posted Jul 28, 2025 - 15:55 UTC

Monitoring

This issue has been mitigated, and the impacted services should now be fully operational.

We will be placing these services under monitoring for now.
Posted Jul 23, 2025 - 22:33 UTC

Update

We continue to work on mitigating this issue.

Another update within 1 hour.
Posted Jul 23, 2025 - 20:49 UTC

Identified

We have identified a potential root cause for this issue, and are working to mitigate it.

Another update within 1 hour.
Posted Jul 23, 2025 - 19:30 UTC

Investigating

We are investigating reports where the direct delivery and audiences page is stuck in a loading loop.

We will provide an update within 1 hour.
Posted Jul 23, 2025 - 18:32 UTC
This incident affected: Products (Creator Studio).