Summary:
On Tuesday May 28th, 2024 starting at approximately 8:32 AM PT until 9:32 AM PT we received multiple reports of email campaigns not being sent. It was identified that several campaigns scheduled for delivery were either delayed or not delivered at all. The issue was confirmed in both customer environments and internal test environments, indicating a widespread problem affecting the email campaign functionality. It was related to a code error in a governor hotfix deployed earlier that day.
Impact:
Affected scheduled campaigns were temporarily delayed by one hour with some customers reporting a 10 minute delay for the campaign to start sending as well as reports that for some customers the time sensitive campaigns were delivered on time for certain recipients, but delayed for others or not delivered at all.
Root Cause:
The latest deployment of a hotfix for an audience restrictions feature caused an elevated error rate as a result of a missing expected default value that was not set.
Mitigation:
Once the root cause was identified a software change rollback was enforced in addition to manually re-driving the intake DLQ (dead-letter queue) to prevent further degradation and ensure the affected campaigns were sent despite the delay.
Recurrence Prevention:
The below changes have been implemented to ensure campaign delays are prevented due to internal deployment activities:
- Comprehensive Feature Flag Testing: While we conducted regression tests, we primarily focused on scenarios with the feature flag ON, missing issues when the flag was OFF. Moving forward, we will ensure that each feature flag is rigorously tested in both states. We will also collaborate with other domains for comprehensive testing, ensuring compatibility and robustness across all domains.
- Optimal Deployment Scheduling: Moving forward, we will study the platform usage to determine optimal times for hotfix deployments, ensuring minimal impact. This schedule will be documented and approved comprehensively to balance the urgency of fixes with operational stability.
- Enhanced Monitoring and Alerting: The initial alert was missed as it was bundled with other warnings. To prevent this, we will enhance our monitoring and alerting systems, ensuring that all critical alerts are promptly noticed. We will continue to refine our dashboards and tailored alerts to maintain proactive and continuous monitoring throughout Development, UAT, and production environments.
- High-Risk Multi-Service Hotfixes: Scheduling deployment for high-risk multi-service hotfixes outside of full release and regression cycles. Such changes should be evaluated to determine if they can be deferred to scheduled releases, ensuring they undergo thorough regression testing to prevent widespread issues.