Summary:
On July 8th, 2025 we began receiving Customer reports of email campaign metrics showing unusual numbers including engagement %ages greater than 100%, deliveries well below expected levels, and unique click counts higher than open rates. This behavior was attributed to a standardized Amazon Web Services (AWS) inbound application firewall rule added as a part of a security initiative to continually improve the overall security posture of the Firstup platform on an ongoing basis. Most missing data was able to be recovered and backfilled into respective Customer reports with an average of around 1% standard deviation of remaining missing events from July 2nd through July 7th.
Impact:
Scope of the issue was limited to email campaign delivery metrics and any reports or data exports dependent on email engagement data. While the platform incident remained active, and before data backfill efforts concluded, email metrics including delivery count, engagement percentage, number of unique opens, and number of unique clicks, remained inconsistent.Engagement Percentage in particular showed elevated numbers because delivery events may not have been fully recorded, while subsequent opens or click events for the same message ID were correctly recorded, allowing for %ages in excess of 100%, because the Insights calculations operate on the data available to them at the time.
Root Cause:
Root cause was determined to be due to the addition of a standardized AWS-recommended default application firewall rule intended to drop inbound Application Programming Interface (API) traffic with unusually large payload sizes. This new rule, SizeRestrictions_BODY is outlined in the following document as a part of Amazon’s Baseline rule groups: https://docs.aws.amazon.com/waf/latest/developerguide/aws-managed-rule-groups-baseline.htmlThis additional rule negatively impacted webhook data containing user email engagement events sent from Firstup’s 3rd party email provider as those requests frequently batched data together for more efficient delivery–but caused the payload size to exceed what was allowed by the new rule (8KB), consequently dropping the traffic as it was assumed to be malicious in nature.The timing of the change prior to the US July 4th holiday, where total email volume was much lower, contributed to an overall delayed discovery of the underlying problem until after a full day of normal campaign volumes resumed the following Monday, July 7th.
Mitigation:
A platform incident was declared at 10:40am PT on Tuesday, July 8th and the incident team determined the root cause several hours later at 1:15pm PT. The offending firewall rule was disabled 15 minutes later, effectively mitigating the condition from persisting any further.All webhook data for events that occurred on July 8th were reprocessed automatically within 24 hours, effectively narrowing the remaining impact window to only July 2nd through July 7th. Because the data was dropped at the firewall, Firstup reached out to the third party email provider to see if there was a way to export the missing data presumed to be still stored on their systems. This initial data extraction took several days to complete and a program to ingest it safely and update the historic reporting data was developed concurrently.Over the course of the next several days and weeks, there were several iterations of backfill as missing events were found that required further extractions and updates making the backfill process more complicated and prolonged than it would typically be.On Monday, August 4th, Firstup’s third party email provider informed us that any remaining missing events–primarily outstanding delivery events, would not be recoverable due to a software bug on their own systems. As a result, all available backfill data is complete, and it is estimated that the standard deviation of irrecoverable events is 1% across the full duration of the platform incident.
Recurrence Prevention:
The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA (Root Cause Assessment) process: