Platform Service Degradation - Some Platform Emails Not Being Delivered
Incident Report for Firstup
Postmortem

Summary:

On Monday, November 4th, 2024, starting at around 7:53 AM PT, we received reports that some published campaign emails were not being delivered to their intended audiences. While some emails were delivered as expected, others were either delayed or appeared to be stuck. All emails were handed off in a timely manner from the Firstup platform to the third-party email provider. This problem worsened over the course of the next several hours where email throughput appeared to be highly restricted, while the backlog of the email delivery queue not only continued to grow, but also did not drain in a logical chronological order in which messages were initially queued. Through a joint troubleshooting call with the third-party email provider, it was determined that a large volume of email delivery errors starting at around 7:11 AM PT had put the entire pool of our email delivery IP addresses in a state of reduced performance. After jointly reviewing the highest volume of errors with the third-party email provider, the sender IPs were restored to a fully functioning state, resulting in the entire email backlog being drained fully by 3:30 PM PT.

Severity:

Sev2

Scope:

The scope of this service degradation was restricted to customers who use Firstup campaign email delivery as a channel, as well as any other non-campaign email content sent from the Firstup platform, such as password reset request emails. Push notifications, assistant notifications, and web or mobile experience channels were unaffected and remained fully functional.

Impact:

Within an individual campaign sent to email as a channel, some emails may not have been delivered as expected, while others wound up being stuck in a "processing" state on the third- party email delivery platform. During the incident (7hrs 58mins), some of the emails in the “processing” state were successfully delivered but heavily delayed, while others remained stuck in a “processing” status. Observed email throughput was reduced to approximately 46k messages per hour from a theoretical max of 30k messages per second. The total outstanding backlog prior to mitigation was over a million email messages.

Root Cause:

Root cause has been attributed to an elevated level of email delivery errors that triggered a protection mechanism on the third-party email provider platform. This resulted in reduced throughput for the entire pool of our sender IP addresses to the point where mostly retries for deferrals from earlier delivery errors were being processed, and very few queued up emails were delivered. Essentially the queue processing equivalent of running in place.

Mitigation:

After analyzing the top contributors to email delivery errors that appeared to be correlated to a single misconfigured email security endpoint all addresses associated with that endpoint were force-unsubscribed until it could be correctly configured, to avoid any further email delivery errors contributing to the underlying log jam. 80k email errors were attributed to that endpoint in just a couple of hours.

Through a joint incident bridge with the third-party email provider, Firstup demonstrated the irrelevance of the deferral rates to the overall email backlog queue. A data pipeline engineer was paged out and able to verify that the sender IPs had been relegated to a lower performing state that was actually contributing to a circular problem. Backend system changes were made at 3:09 PM PT on the third-party platform to restore prior state of the sender IPs, and the entire email messages backlog subsequently fully drained in less than 25 minutes.

Recurrence Prevention:

The following actions have been taken or have been identified as follow-up actions to commit to as a part of the formal RCA (Root Cause Assessment) process:

  • Email addresses contributing to elevated error rates will be bulk unsubscribed from the platform (or otherwise quarantined) until underlying conditions can be corrected.
  • Coordinate with third-party provider to better understand the characteristics of the platform safety mechanism, including why it triggered, how to avoid it entirely, and how to improve joint monitoring and mitigate elevated error rates from affecting overall delivery.
  • Implement any reasonable recommendations from the third-party RFO.
Posted Nov 13, 2024 - 01:26 UTC

Resolved
Email delivery has remained available and fully functional throughout the monitoring phase of this incident.

This incident is now resolved. Once available, a Root Cause Analysis for this incident will be published here.
Posted Nov 12, 2024 - 17:36 UTC
Monitoring
This issue has now been mitigated, and any emails that had not been delivered should now be hitting their user endpoints.

We will place the impacted services under monitoring.
Posted Nov 04, 2024 - 23:24 UTC
Identified
As we continue to work with our upstream third-party email delivery vendor towards a solution to this issue, we have identified that platform emails such as password reset emails are in the scope of this incident.

Another update within 1 hour.
Posted Nov 04, 2024 - 21:23 UTC
Update
We continue to work with our upstream third-party email delivery vendor towards a solution to this issue.

Another update within 1 hour.
Posted Nov 04, 2024 - 20:11 UTC
Update
We continue to work with our upstream third-party email delivery vendor towards a solution to this issue.

Another update within 1 hour.
Posted Nov 04, 2024 - 19:11 UTC
Update
Our investigations have not revealed any issues on our platform. However, we see a potential issue with our upstream third-party email delivery vendor, and have reached out to them for additional troubleshooting on their end.

We will provide you with another update within 1 hour.
Posted Nov 04, 2024 - 18:11 UTC
Update
As we continue to investigate these reports, our current observation is that email campaigns are being delivered, albeit with some delays.

Another update within 1 hour.
Posted Nov 04, 2024 - 17:11 UTC
Investigating
We are investigating reports where some email campaigns are not being delivered as expected.

We will provide you with an update within 1 hour.
Posted Nov 04, 2024 - 16:40 UTC
This incident affected: Products (Creator Studio, Classic Studio) and 3rd-Party Dependencies (SendGrid API v3).