Summary:
On Monday, November 4th, 2024, starting at around 7:53 AM PT, we received reports that some published campaign emails were not being delivered to their intended audiences. While some emails were delivered as expected, others were either delayed or appeared to be stuck. All emails were handed off in a timely manner from the Firstup platform to the third-party email provider. This problem worsened over the course of the next several hours where email throughput appeared to be highly restricted, while the backlog of the email delivery queue not only continued to grow, but also did not drain in a logical chronological order in which messages were initially queued. Through a joint troubleshooting call with the third-party email provider, it was determined that a large volume of email delivery errors starting at around 7:11 AM PT had put the entire pool of our email delivery IP addresses in a state of reduced performance. After jointly reviewing the highest volume of errors with the third-party email provider, the sender IPs were restored to a fully functioning state, resulting in the entire email backlog being drained fully by 3:30 PM PT.
Severity:
Sev2
Scope:
The scope of this service degradation was restricted to customers who use Firstup campaign email delivery as a channel, as well as any other non-campaign email content sent from the Firstup platform, such as password reset request emails. Push notifications, assistant notifications, and web or mobile experience channels were unaffected and remained fully functional.
Impact:
Within an individual campaign sent to email as a channel, some emails may not have been delivered as expected, while others wound up being stuck in a "processing" state on the third- party email delivery platform. During the incident (7hrs 58mins), some of the emails in the “processing” state were successfully delivered but heavily delayed, while others remained stuck in a “processing” status. Observed email throughput was reduced to approximately 46k messages per hour from a theoretical max of 30k messages per second. The total outstanding backlog prior to mitigation was over a million email messages.
Root Cause:
Root cause has been attributed to an elevated level of email delivery errors that triggered a protection mechanism on the third-party email provider platform. This resulted in reduced throughput for the entire pool of our sender IP addresses to the point where mostly retries for deferrals from earlier delivery errors were being processed, and very few queued up emails were delivered. Essentially the queue processing equivalent of running in place.
Mitigation:
After analyzing the top contributors to email delivery errors that appeared to be correlated to a single misconfigured email security endpoint all addresses associated with that endpoint were force-unsubscribed until it could be correctly configured, to avoid any further email delivery errors contributing to the underlying log jam. 80k email errors were attributed to that endpoint in just a couple of hours.
Through a joint incident bridge with the third-party email provider, Firstup demonstrated the irrelevance of the deferral rates to the overall email backlog queue. A data pipeline engineer was paged out and able to verify that the sender IPs had been relegated to a lower performing state that was actually contributing to a circular problem. Backend system changes were made at 3:09 PM PT on the third-party platform to restore prior state of the sender IPs, and the entire email messages backlog subsequently fully drained in less than 25 minutes.
Recurrence Prevention:
The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA (Root Cause Assessment) process: