Post-Mortem Report: Approvals Incident on November 13th
Summary
On November 13th, a release introduced an issue that caused certain approvals to get stuck when initiated immediately after the release. This problem went undetected during testing and required an immediate rollback and further remediation.
Root Cause
- Release Issue: The new code in the release contained an error that prevented some approvals from completing. This affected approvals initiated shortly after the release, primarily on the morning of November 13th.
- Delayed Resolution Due to Logging and Azure Issues: Limited production logging and minor Azure issues delayed both the rollback and investigation, resulting in a few hours of extended impact.
Timeline (Finnish Time)
- 11/12/2024, 10:00 PM – Release deployment began.
- 11/13/2024, 8:20 AM – Monitoring detected increased errors affecting approvals and data saving.
- 11/13/2024, 9:30 AM – Initial mitigation action implemented.
- 11/13/2024, 10:30 AM – Further investigation uncovered ongoing issues, including cache-related timeouts.
- 11/13/2024, 1:27 PM – Rollback completed, stopping further impact on approvals. Cleanup and monitoring updates followed.
- 11/14/2024, 10:30 PM – All pending approvals affected by the issue were successfully fixed.
Resolution
We reverted the approvals service to a stable version. Database adjustments were made to clear stuck approvals, and monitoring was improved. No data was lost.
Action Items
- Enhance Test Coverage: Increase automated testing to cover a broader range of approval scenarios.
- Improve Monitoring: Add more alerting to detect similar issues promptly.
- Cloud Provder Configuration: Improve Cloud Provder deployment configurations to mitigate similar issues in the future.
- Enhance Logging: Increase detail in production logging.
- Limit Release Scope: Restrict release scope and perform thorough post-release reviews.
This report summarizes the causes and actions to prevent similar issues in the future.