Platform Services Degradation: Unrealistic email campaign delivery/open metrics

Incident Report for Firstup

Postmortem

Summary: 

On July 8th, 2025 we began receiving Customer reports of email campaign metrics showing unusual numbers including engagement %ages greater than 100%, deliveries well below expected levels, and unique click counts higher than open rates.  This behavior was attributed to a standardized Amazon Web Services (AWS) inbound application firewall rule added as a part of a security initiative to continually improve the overall security posture of the Firstup platform on an ongoing basis.  Most missing data was able to be recovered and backfilled into respective Customer reports with an average of around 1% standard deviation of remaining missing events from July 2nd through July 7th.

Impact: 

Scope of the issue was limited to email campaign delivery metrics and any reports or data exports dependent on email engagement data.  While the platform incident remained active, and before data backfill efforts concluded, email metrics including delivery count, engagement percentage, number of unique opens, and number of unique clicks, remained inconsistent.Engagement Percentage in particular showed elevated numbers because delivery events may not have been fully recorded, while subsequent opens or click events for the same message ID were correctly recorded, allowing for %ages in excess of 100%, because the Insights calculations operate on the data available to them at the time. 

Root Cause:  

Root cause was determined to be due to the addition of a standardized AWS-recommended default application firewall rule intended to drop inbound Application Programming Interface (API) traffic with unusually large payload sizes.  This new rule, SizeRestrictions_BODY is outlined in the following document as a part of Amazon’s Baseline rule groups: https://docs.aws.amazon.com/waf/latest/developerguide/aws-managed-rule-groups-baseline.htmlThis additional rule negatively impacted webhook data containing user email engagement events sent from Firstup’s 3rd party email provider as those requests frequently batched data together for more efficient delivery–but caused the payload size to exceed what was allowed by the new rule (8KB), consequently dropping the traffic as it was assumed to be malicious in nature.The timing of the change prior to the US July 4th holiday, where total email volume was much lower, contributed to an overall delayed discovery of the underlying problem until after a full day of normal campaign volumes resumed the following Monday, July 7th.   

Mitigation: 

A platform incident was declared at 10:40am PT on Tuesday, July 8th and the incident team determined the root cause several hours later at 1:15pm PT.  The offending firewall rule was disabled 15 minutes later, effectively mitigating the condition from persisting any further.All webhook data for events that occurred on July 8th were reprocessed automatically within 24 hours, effectively narrowing the remaining impact window to only July 2nd through July 7th.  Because the data was dropped at the firewall, Firstup reached out to the third party email provider to see if there was a way to export the missing data presumed to be still stored on their systems.  This initial data extraction took several days to complete and a program to ingest it safely and update the historic reporting data was developed concurrently.Over the course of the next several days and weeks, there were several iterations of backfill as missing events were found that required further extractions and updates making the backfill process more complicated and prolonged than it would typically be.On Monday, August 4th, Firstup’s third party email provider informed us that any remaining missing events–primarily outstanding delivery events, would not be recoverable due to a software bug on their own systems.  As a result, all available backfill data is complete, and it is estimated that the standard deviation of irrecoverable events is 1% across the full duration of the platform incident.  

Recurrence Prevention: 

The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA (Root Cause Assessment) process:

  • Change management Standard Operating Procedures to add new rules in a warning/count mode only instead of block mode unless explicitly countering a known and active threat.
  • Additional real-time monitoring that will alert the Cloud Operations team of overall % change in detections/blocks in the WAF from the baseline levels.
  • Continued service segmentation at the API gateway so that impact to specific business processes or production data are easier and faster to isolate.
Posted Aug 06, 2025 - 18:12 UTC

Resolved

This incident has now been fully resolved and all products and components affected remain fully operational.
Posted Aug 06, 2025 - 18:10 UTC

Monitoring

This issue has now been mitigated. Email campaigns published after 11:15 AM PT today should now show accurate email delivery and open rates (when available).

We will be placing the impacted services under monitoring for now.

We will also be exploring options on whether it is possible to correct the impacted reports between July 3rd, 2025, through July 8th, 2025, at 11:15 AM PT.
Posted Jul 08, 2025 - 18:42 UTC

Identified

We have identified a potential root cause of this issue and are working to mitigate the problem. Another update within 1 hour.
Posted Jul 08, 2025 - 18:18 UTC

Update

We continue to investigate this issue. Another update within 1 hour.
Posted Jul 08, 2025 - 17:45 UTC

Update

We continue to investigate this issue and will provide another update within 1 hour. The current working hypothesis includes underreported email deliveries, resulting in reports that may be showing over 100% email open rates, as well as 0% delivery and open rates in some reports, even though the emails have been delivered successfully.
Posted Jul 08, 2025 - 16:45 UTC

Investigating

We have received several reports where Email Campaign metrics are inaccurate and showing over 100% open rates since June 3rd, 2025. We are currently investigating this issue and will provide an update within 1 hour.
Posted Jul 08, 2025 - 15:49 UTC
This incident affected: Products (Insights).