Connectivity issues

Incident Report for Sympa

Resolved

This incident has been resolved.

A short recap from Microsoft's point of view:
-----
What went wrong and why?

An inadvertent tenant configuration change within Azure Front Door (AFD) triggered a widespread service disruption affecting both Microsoft services and customer applications dependent on AFD for global content delivery. The change introduced an invalid or inconsistent configuration state that caused a significant number of AFD nodes to fail to load properly, leading to increased latencies, timeouts, and connection errors for downstream services.

As unhealthy nodes dropped out of the global pool, traffic distribution across healthy nodes became imbalanced, amplifying the impact and causing intermittent availability even for regions that were partially healthy. We immediately blocked all further configuration changes to prevent additional propagation of the faulty state and began deploying a ‘last known good’ configuration across the global fleet. Recovery required reloading configurations across a large number of nodes and rebalancing traffic gradually to avoid overload conditions as nodes returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.

The trigger was traced to a faulty tenant configuration deployment process. Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations. Safeguards have since been reviewed and additional validation and rollback controls have been immediately implemented to prevent similar issues in the future.

Our team will be completing an internal retrospective to understand the incident in more detail and will share findings within 14 days. Once we complete our internal retrospective, generally within 14 days, we will publish a final Post Incident Review (PIR) to all impacted customers.
-----

Best regards
Sympa Team
Posted Oct 30, 2025 - 07:53 EET

Update

Status from Microsoft's side is not yet closed but they expect situation to continue to improve. We continue to monitor our system as we wait for mitigation to take effect.

Latest status update from Azure: https://azure.status.microsoft/en-us/status
-----
We initiated the deployment of our ‘last known good’ configuration, which has now successfully been completed. Customers may have begun to see initial signs of recovery. We are currently recovering nodes and routing traffic through healthy nodes, and as we make progress in this workstream, customers will continue to see improvement.

At this stage, we anticipate full mitigation within the next four hours as we continue to recover nodes. This means we expect recovery to happen by 23:20 UTC on 29 October 2025. We will provide another update on our progress within two hours, or sooner if warranted.
-----
Posted Oct 29, 2025 - 21:56 EET

Update

Latest status update from Azure: https://azure.status.microsoft/en-us/status

-----
Current status:

We have pushed our ‘last known good’ configuration, and customers may begin to see initial signs of recovery. We are currently recovering nodes and routing traffic through healthy nodes, and as we make progress in this workstream, customers will continue to see improvement.

Customer configuration changes will remain temporarily blocked while we continue mitigation efforts. We will notify customers once this block has been lifted.

Some customers may also have experienced issues accessing the Azure management portal. We have failed the portal away from AFD to mitigate these access issues. Customers should now be able to access the Azure portal directly, and while most portal extensions are functioning as expected, a small number of endpoints (e.g., Marketplace) may still experience intermittent loading problems.

We are continuing to monitor progress closely and will provide an ETA for full mitigation within the next 20 minutes as we assess recovery across the AFD service.

Although we are seeing signs of recovery, customers may also consider implementing failover strategies using Azure Traffic Manager to redirect traffic from Azure Front Door to their origin servers as an interim measure. https://learn.microsoft.com/en-us/azure/architecture/guide/networking/global-web-applications/overview


This message was last updated at 19:01 UTC on 29 October 2025
-----
Posted Oct 29, 2025 - 21:13 EET

Update

Latest status update from Azure: https://azure.status.microsoft/en-us/status

-----
Starting at approximately 16:00 UTC, customers and Microsoft services that leverage Azure Front Door (AFD) may have experienced issues resulting in latencies, timeouts and errors. We have confirmed that an inadvertent configuration change as the trigger event for this issue.

Current status:

We have initiated the deployment of our 'last known good' configuration. This is expected to be fully deployed in about 30 minutes from which point customers will start to see initial signs of recovery. Once this is completed, the next stage is to start to recover nodes while we route traffic through these healthy nodes.

Customer configuration changes will remain blocked during this time as we work towards mitigation. We will communicate to customers when this block is reverted.

Customers may have experienced problems accessing the Azure management portal. We have failed the portal away from AFD to mitigate the portal access issues. Customers should be able to access the Azure management portal directly, while all portal extensions are working correctly there may be a small number of endpoints that might have a problem loading (i.e. Marketplace).

We do not have an ETA for full mitigation, we will update this communication within 30 minutes, once the deployment is completed.

Customers can consider implementing failover strategies with Azure Traffic Manager, to fail over from Azure Front Door to your origins

This message was last updated at 18:11 UTC on 29 October 2025
-----
Posted Oct 29, 2025 - 20:16 EET

Identified

We have identified an issue that has impacted Sympa.

This is unfortunately part of an Global Azure Incident: https://azure.status.microsoft/en-gb/status

--------
Starting at approximately 16:00 UTC, we began experiencing Azure Front Door (AFD) issues resulting in a loss of availability of some services. We suspect that an inadvertent configuration change as the trigger event for this issue. We are taking two concurrent actions where we are blocking all changes to the AFD services and at the same time rolling back to our last known good state.

We have failed the portal away from AFD to mitigate the portal access issues. Customers should be able to access the Azure management portal directly.

We do not have an ETA for when the rollback will be completed, but we will update this communication within 30 minutes or when we have an update.

This message was last updated at 17:18 UTC on 29 October 2025
--------

At the moment we can only monitor the situation from Sympa's point of view. We’ll share updates as soon as more information becomes available.
Posted Oct 29, 2025 - 19:35 EET
This incident affected: System availability, Authentication, and Integrations.