Azure BCDR Fundamentals
This guide covers backup, disaster recovery, and business continuity in Azure. If you haven't already, start with the Azure Platform Fundamentals overview first.
Why BCDR Matters
Business Continuity and Disaster Recovery is not just an IT concern — it's a business requirement.
Every organization must answer:
- How long can we be down before it severely impacts the business? (RTO)
- How much data loss is acceptable? (RPO)
- What happens if an Azure region goes offline?
Design for failure. Everything fails eventually.
In Azure:
- Backup protects against data loss (accidental deletion, corruption, ransomware)
- Site Recovery protects against outages (regional disasters, datacenter failures)
- Resilience is built in layers: Resource → Availability Zone → Region → Region Pair
Region Pairs and Replication
What Are Region Pairs?
Two Azure regions within the same geography paired for disaster recovery.
Examples:
- East US ↔ West US
- North Europe (Ireland) ↔ West Europe (Netherlands)
- Southeast Asia (Singapore) ↔ East Asia (Hong Kong)
Why pairs exist:
- Physical separation — At least 300 miles apart
- Sequential updates — Azure updates one region at a time, not both
- Data residency — Pairs stay within the same geography for compliance
- Recovery priority — In massive outages, one region in each pair gets priority
Important: Many newer Azure regions are not paired. They provide redundancy through multiple Availability Zones within the region. Always verify your target region's pairing status before designing a DR strategy.
Reference: Cross-Region Replication
Availability Zones vs Region Pairs
| Concept | Availability Zones | Region Pairs |
|---|---|---|
| Scope | Within a region | Between regions |
| Distance | Connected by high-speed fiber (~2ms) | 300+ miles apart |
| Purpose | Datacenter-level failures | Regional disasters |
| Failover | Automatic (seconds–minutes) | Manual or semi-automated (minutes–hours) |
| Cost | Minimal increase | Data transfer costs |
Best practice: Use both — deploy across zones in the primary region, and replicate to a paired region for disaster recovery.
Backup Fundamentals
Backups protect against accidental deletion, data corruption, ransomware, hardware failures, and insider threats.
Azure Backup Service
A fully managed backup service (PaaS):
- No backup infrastructure to deploy
- Automatic backups to Recovery Services Vault
- Retention: 1 day to 99 years
- Supports: VMs, SQL databases, file shares, on-premises servers
Key benefits:
- No backup tape management
- Application-consistent backups (not just file copies)
- Encrypted at rest and in transit
- Immutable backups for ransomware protection
Reference: Azure Backup Overview
Backup Types
| Type | What It Backs Up | Speed | Storage | Restore Speed |
|---|---|---|---|---|
| Full | Complete copy of all data | Slowest | Largest | Fastest |
| Incremental | Only data changed since last backup | Fastest | Smallest | Slower (needs chain) |
| Differential | Only data changed since last full | Medium | Medium | Medium (needs full + last diff) |
Common pattern: Full backup weekly + incremental daily. Retain daily for 30 days, weekly for 12 weeks, monthly for 12 months.
What Can Be Backed Up?
| Resource | Backup Method |
|---|---|
| Azure VMs | Azure Backup |
| Azure SQL Database | Built-in automated backups + Azure Backup |
| Azure Files | Azure Backup for Files |
| Azure Blobs | Soft delete + versioning |
| On-premises files | MARS agent → Azure Backup |
| On-premises VMs (Hyper-V/VMware) | Azure Site Recovery |
Backup Best Practices
3-2-1 Rule:
- 3 copies of data (original + 2 backups)
- 2 different media types (e.g., disk + cloud)
- 1 copy offsite (different location)
Test restores: Schedule quarterly restore tests. Measure actual restore time and verify data integrity. You don't have a backup until you've successfully restored from it.
Immutable backups: Enable immutability to prevent deletion (ransomware protection). Requires Multi-User Authorization (MUA) to disable. Backups cannot be deleted until the retention period expires.
Reference: Immutable Vault
Site Recovery Patterns
While backups protect data, Site Recovery protects applications (infrastructure + data).
Azure Site Recovery (ASR)
- Disaster recovery service for VMs and physical servers
- Continuous replication to a secondary location
- Orchestrated failover (one-click DR activation)
- Supports: Azure-to-Azure, on-premises-to-Azure
Reference: Site Recovery Overview
Disaster Recovery Patterns
| Pattern | Secondary State | Cost | RTO | RPO | Failover |
|---|---|---|---|---|---|
| Cold Standby | Not running | $ | Hours | Hours | Manual |
| Warm Standby | Running (scaled-down) | $$ | Minutes | Minutes | Semi-automated |
| Hot Standby (Active-Active) | Running (full) | $$$ | Seconds | Near-zero | Automatic |
Cold Standby: Only storage replication, no secondary compute running. Cheapest but slowest to recover.
Warm Standby: Scaled-down secondary environment with real-time replication. Good balance of cost and recovery speed.
Hot Standby: Full production environment in both regions, actively serving traffic. Load balanced with Traffic Manager or Front Door. Most expensive but fastest recovery.
ASR Workflow
- Enable replication — Select source VMs, choose target region, configure replication policy
- Continuous replication — ASR replicates disk changes, application-consistent snapshots taken periodically (lag typically < 5 minutes)
- Failover — Initiate failover, VMs start in target region, update DNS/Traffic Manager
- Failback — Once primary is restored, reverse replication and switch back
Reference: Azure-to-Azure Replication
RTO and RPO Definitions
RTO and RPO are business requirements that drive technical design.
RTO (Recovery Time Objective)
How long can the business tolerate being down?
Measured from disaster occurrence to service restoration.
What affects RTO: Failover automation, secondary environment readiness, DNS propagation, data restoration time, validation steps.
Example calculation:
Disaster occurs: 10:00 AM
+ Detection: 10 minutes → 10:10 AM
+ Decision to failover: 20 minutes → 10:30 AM
+ Failover execution: 30 minutes → 11:00 AM
+ DNS propagation: 15 minutes → 11:15 AM
+ Validation: 15 minutes → 11:30 AM
Total RTO: 1.5 hours
RPO (Recovery Point Objective)
How much data loss can the business tolerate?
Measured as the time between the last good backup and the disaster.
Example: Last backup at 2:00 AM, disaster at 2:37 AM → 37 minutes of data lost. If the business requires RPO < 30 minutes, daily backups are insufficient — you need continuous replication.
RTO/RPO by Business Impact
| Business Impact | RTO | RPO | DR Strategy | Cost |
|---|---|---|---|---|
| Low | Days | 24 hours | Backups only | $ |
| Medium | 4–24 hours | 1–4 hours | Cold/warm standby | $$ |
| High | 1–4 hours | 15–60 min | Warm standby + replication | $$$ |
| Mission-Critical | < 1 hour | < 15 min | Active-active multi-region | $$$$ |
Lower RTO/RPO = higher cost. The business must weigh recovery requirements against budget.
Minimum Resiliency Expectations
Not every workload needs five-nines availability, but every production workload needs some level of resilience.
Resiliency Tiers
| Tier | Downtime OK | Data Loss OK | Protection | Example |
|---|---|---|---|---|
| Dev/Test | Hours–days | Complete loss OK | None required | Developer sandbox |
| Non-Critical Production | 4–24 hours | 24 hours | Backup only (GRS) | Internal wiki |
| Standard Production | 1–4 hours | 1–4 hours | Backup + Availability Zones + warm standby | Line-of-business apps |
| Business-Critical | < 1 hour | < 15 minutes | Multi-zone + multi-region + continuous replication | E-commerce, financial |
| Mission-Critical | < 5 minutes | Near-zero | Active-active multi-region + synchronous replication | Banking, healthcare |
Minimum Requirements by Tier
| Tier | Availability Zones | Regional DR | Backup Frequency | Retention | RTO | RPO |
|---|---|---|---|---|---|---|
| Dev/Test | Not required | No | Optional | Optional | N/A | N/A |
| Non-Critical | Not required | No | Daily | 30 days | 24 hrs | 24 hrs |
| Standard | Recommended | Warm standby | Hourly | 90 days | 4 hrs | 1 hr |
| Critical | Required | Warm standby | Continuous | 12 months | 1 hr | 15 min |
| Mission | Required | Active-active | Continuous | 7 years | 5 min | Near-zero |
Governance Policies
Common policy: "All production workloads must meet non-critical tier as a minimum."
Enforce with:
- Azure Policy: Require backup enabled on VMs tagged
Environment: Prod - Azure Policy: Require GRS on storage accounts tagged
Criticality: High - Monthly audit: Review backups and successful restore tests
- Quarterly: Review RTO/RPO against actual incidents
Key Takeaways
- Design for failure — everything fails eventually
- Use Availability Zones for datacenter resilience, Region Pairs for regional DR
- Follow the 3-2-1 backup rule and test restores quarterly
- Choose DR patterns (cold/warm/hot) based on RTO/RPO requirements and budget
- Enable immutable backups for ransomware protection
- Define minimum resiliency tiers and enforce with Azure Policy
Additional Resources
- Azure Backup Overview
- Site Recovery Overview
- Cross-Region Replication
- Availability Zones
- Disaster Recovery Overview
- Resiliency Design Requirements
This is part of the Azure Fundamentals Series. Return to the main guide to explore other topics.