Summary:
On July 23rd, 2025 we began receiving Customer reports of Audience and Direct Delivery pages failing to load. Audience pages were either returning HTTP 500 errors hanging during the page loading. This behavior was attributed to a code change which was released on July 22nd to improve detection of circular audience references. Circular audience references can occur when one audience refers to itself by being included in another existing audience. This code change had an unintended impact on page loading performance and audience resolution and was subsequently rolled back.
Impact:
Performance degradation was observed across multiple areas of the Creator Studio, resulting in blocking some users from publishing Direct Delivery campaigns. The following symptoms were observed while the problem was present:
Root Cause:
Root cause was determined to be due to a code change which was released on July 22nd that improved detection of circular audience references. Circular audience references when recalculated may result in errors or the audience itself being put into an erroneous state which will appear as archived and not able to be used in new or existing campaigns. The change made to prevent this condition from occurring had an unintended impact which led to drop in caching hit rates for the audience component, causing performance degradation in any Studio features relying on audience resolution. A cache hit allows the audience compiler to reuse a stored result instead of recompiling the audience being referenced from scratch.
Mitigation:
A platform incident was declared at 11:20 am PT on Wednesday, July 23rd and the incident team determined the root cause at 11:56 pm PT. Issue was mitigated by rolling back that change and the cache hit rate returned to >98% from around 30%. Rollback completed by 3:30 pm PT mitigating the negative page performance issues seen on the Audience and Direct Delivery pages.
Recurrence Prevention:
The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA (Root Cause Assessment) process:
● Load Testing and Analysis - More rigorous load testing and analysis to detect performance issues or latency spikes before a change goes live.
● Additional real-time monitoring has been implemented that will alert the Engineering team of overall % change in audience query compilation cache hit rate.
● Enhance our software change management policies to use feature flags for rollout of changes in performance sensitive areas of the platform to reduce rollback time.