Summary
An issue in the CRM Online Service was detected through internal monitoring. It caused <10% of CRM Online Organizations served from the North American datacenters to be inaccessible for a short period of time. Microsoft engineers took action to restore access using known and documented resolution steps.
Customer Impact
During the Service Incident, <10% of CRM Online Organizations served from the North American datacenters would have found the CRM Online service to be unavailable.
Incident Start Date and Time
August 15, 2013 01:45 AM USA PDT
Date and Time Service was Restored
August 15, 2013 04:20 AM USA PDT
Root Cause
The SQL Server Availability Group (AG) got into an unhealthy state which required manual intervention to mitigate using known steps. The mitigation efforts took longer than typical to resolve due to a human error in the restoration process.
The root cause of ongoing AG health issue is already under investigation and follow up with the SQL Server and Windows Cluster Teams.
Next Step(s)
Issue | Next Step | Team Owner | Timeline |
Escalation | Follow up on the ongoing discussions with the SQL Server and Windows Cluster teams regarding unhealthy Availability Groups - to ensure they are aware of the issue and to determine if a patch or hotfix is available. | Microsoft Dynamics CRM Online Service Engineering | Ongoing |
Documentation | Review documentation for mitigation of Unhealthy AGs and identify any gaps that if addressed can reduce the chance of human error. | Microsoft Dynamics CRM Online Service Engineering | August 30 2013 |
Resiliency | Look for opportunities to implement scripts or programmatic methods to address AG health issues in a more resilient and efficient manner that is less prone to user error. | Microsoft Dynamics CRM Online Service Engineering | August and September 2013 |