Azure SQL Database disaster recovery

My last blog post was on Azure SQL Database high availability and I would like to continue along that discussion with a blog post about disaster recovery in Azure SQL Database. First, a clarification on the difference between high availability and disaster recovery:

High Availability (HA) – Keeping your database up 100% of the time with no data loss during common problems. Redundancy at system level, focus on failover, addresses single predictable failure, focus is on technology. SQL Server IaaS would handle this with:

Always On Failover cluster instances
Always On Availability Groups (in same Azure region)
SQL Server data files in Azure

Disaster Recovery (DR) – Protection if major disaster or unusual failure wipes out your database. Use of alternate site, focus on re-establishing services, addresses multiple failures, includes people and processes to execute recovery. Usually includes HA also. SQL Server IaaS would handle this with:

Log Shipping
Database Mirroring
Always On Availability Groups (different Azure regions)
Backup to Azure

Azure SQL Database makes setting up disaster recovery so much easier than SQL Server IaaS (in a VM). Disaster recovery is done via active geo-replication, which is an Azure SQL Database feature that allows you to create readable replicas of your database in the same or different data center (region). All it takes is navigating to this page and choosing the region to create a secondary database (this example of active geo-replication is configured with a primary in the North Central US region and secondary in the South Central US region):

Once created, the secondary database is populated with the data copied from the primary database. This process is known as seeding. After the secondary database has been created and seeded, updates to the primary database are asynchronously replicated to the secondary database automatically. Asynchronous replication means that transactions are committed on the primary database before they are replicated to the secondary database.

Compare this to setting up AlwaysOn Availability Groups! And then think about the time it takes to monitor and maintain AlwaysOn Availability Groups, stuff that you won’t have to worry about anymore, and you can see why Azure SQL database is such a pleasure.

Active geo-replication is designed as a business continuity solution that allows an application to perform quick disaster recovery in case of a data center scale outage. If geo-replication is enabled, the application can initiate a failover to a secondary database in a different Azure region. Up to four secondaries are supported in the same or different regions, and the secondaries can also be used for read-only access queries. The failover can be initiated manually by the application or the user. After failover, the new primary has a different connection end point. As each secondary is a discrete database with the same name as the primary but in a different server you will need to reconfigure your application(s) with an updated connection string.

Each geo-replicated database will be on a server in a region with server names you give them such as ServerNameWestUS.database.windows.net (primary) and ServerNameEastUS.database.windows.net (secondary). If you want to load balance between them in your application for read queries, if the application itself is load balanced between regions you can use Azure Traffic Manager. If the application itself is not geo-replicated, then use Azure Load Balancer.

Auto-failover groups is an extension of active geo-replication. It is designed to manage the failover of multiple geo-replicated databases simultaneously using an application initiated failover or by delegating failover to be done by the SQL Database service based on a user defined criteria. The latter allows you to automatically recover multiple related databases in a secondary region after a catastrophic failure or other unplanned event that results in full or partial loss of the SQL Database service’s availability in the primary region. Because auto-failover groups involve multiple databases, these databases must be configured on the primary server. Both primary and secondary servers for the databases in the failover group must be in the same subscription. Auto-failover groups support replication of all databases in the group to only one secondary server in a different region.

If you are using active geo-replication and for any reason your primary database fails, or simply needs to be taken offline, you can initiate failover to any of your secondary databases. When failover is activated to one of the secondary databases, all other secondaries are automatically linked to the new primary. If you are using auto-failover groups to manage database recovery, any outage that impacts one or several of the databases in the group results in automatic failover. You can configure the auto-failover policy that best meets your application needs, or you can opt out and use manual activation. In addition, auto-failover groups provide read-write and read-only listener end-points that remain unchanged during failovers. Whether you use manual or automatic failover activation, failover switches all secondary databases in the group to primary. After the database failover is completed, the DNS record is automatically updated to redirect the end-points to the new region.

Active geo-replication leverages the Always On technology of SQL Server to asynchronously replicate committed transactions on the primary database to a secondary database using snapshot isolation. The primary and secondary instances in a geo-replication relationship have independent HA capabilities, the same as a standalone instance would have. Auto-failover groups provide the group semantics on top of active geo-replication but the same asynchronous replication mechanism is used. While at any given point, the secondary database might be slightly behind the primary database, the secondary data is guaranteed to never have partial transactions. Cross-region redundancy enables applications to quickly recover from a permanent loss of an entire datacenter or parts of a datacenter caused by natural disasters, catastrophic human errors, or malicious acts. The specific recovery point objective (RPO) data can be found at Overview of Business Continuity (The time period of updates that you might lose is under 5 seconds).

More info:

Data Integrity in Azure SQL Database

High-availability and Azure SQL Database

Overview of business continuity with Azure SQL Database

Overview: Active geo-replication and auto-failover groups

Designing globally available services using Azure SQL Database

Spotlight on SQL Database Active Geo-Replication

Azure SQL Database Business Continuity Enhancements