Approval not working for some customers
Incident Report for Sympa
Postmortem

Post-Mortem Report: Approvals Incident on November 13th

Summary

On November 13th, a release introduced an issue that caused certain approvals to get stuck when initiated immediately after the release. This problem went undetected during testing and required an immediate rollback and further remediation.

Root Cause

  1. Release Issue: The new code in the release contained an error that prevented some approvals from completing. This affected approvals initiated shortly after the release, primarily on the morning of November 13th.
  2. Delayed Resolution Due to Logging and Azure Issues: Limited production logging and minor Azure issues delayed both the rollback and investigation, resulting in a few hours of extended impact.

Timeline (Finnish Time)

  • 11/12/2024, 10:00 PM – Release deployment began.
  • 11/13/2024, 8:20 AM – Monitoring detected increased errors affecting approvals and data saving.
  • 11/13/2024, 9:30 AM – Initial mitigation action implemented.
  • 11/13/2024, 10:30 AM – Further investigation uncovered ongoing issues, including cache-related timeouts.
  • 11/13/2024, 1:27 PM – Rollback completed, stopping further impact on approvals. Cleanup and monitoring updates followed.
  • 11/14/2024, 10:30 PM – All pending approvals affected by the issue were successfully fixed.

Resolution
We reverted the approvals service to a stable version. Database adjustments were made to clear stuck approvals, and monitoring was improved. No data was lost.

Action Items

  • Enhance Test Coverage: Increase automated testing to cover a broader range of approval scenarios.
  • Improve Monitoring: Add more alerting to detect similar issues promptly.
  • Cloud Provder Configuration: Improve Cloud Provder deployment configurations to mitigate similar issues in the future.
  • Enhance Logging: Increase detail in production logging.
  • Limit Release Scope: Restrict release scope and perform thorough post-release reviews.

This report summarizes the causes and actions to prevent similar issues in the future.

Posted Nov 18, 2024 - 11:56 EET

Resolved
This incident has been resolved.

Workaround for Possible Remaining Issues with individual Approvals:
If you continue to experience issues with approvals, editing the table row with the approval can restart the approval workflow and effectively resolve the problem.
Posted Nov 13, 2024 - 18:08 EET
Monitoring
The mitigation for the issue affecting the approval process has been implemented, and most customers should now be experiencing normal functionality. We are actively monitoring the service to ensure stability and address any remaining impacted cases.

Workaround for Possible Remaining Issues with Approvals:
If you continue to experience issues with approvals, editing the table row with the approval can restart the approval workflow and effectively resolve the problem.

Thank you for your patience as we confirm full resolution. Please reach out to support if further assistance is needed.
Posted Nov 13, 2024 - 14:43 EET
Identified
We have successfully implemented a mitigation for the issue affecting our service. While most customers should now be experiencing normal service, we are aware that a small number of cases may still be impacted. Our team is actively monitoring and working to resolve these remaining cases as quickly as possible.
Posted Nov 13, 2024 - 13:48 EET
Investigating
We are currently investigating an issue where some customers are not able to use approvals.
Posted Nov 13, 2024 - 10:27 EET
This incident affected: System availability.