1. Why High Availability Matters

Imagine waking up at 3 AM to alerts that your production database is down. Your e-commerce site is offline. Every minute costs thousands in lost revenue. Customers are frustrated. Your team is panicking.

This is why High Availability (HA) is not optional for production systems.

2. High Availability Options in RDS

RDS offers three deployment models, each with different availability guarantees.

2.1. Single-AZ (No Standby) ❌

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Availability   β”‚
    β”‚    Zone A       β”‚
    β”‚                 β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
    β”‚  β”‚ Primary   β”‚  β”‚
    β”‚  β”‚ Instance  β”‚  β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
    β”‚                 β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    No redundancy!

What happens if the primary fails:

  • RDS detects the failure
  • RDS provisions new EC2 instance
  • RDS attaches EBS volumes
  • PostgreSQL initializes
  • Database becomes available

⏰ Downtime: Typically 10-30 minutes (sometimes longer)

When to use:

  • Dev/test environments βœ…
  • Cost-sensitive non-critical workloads βœ…
  • Databases that can tolerate prolonged downtime βœ…

⚠️ Single-AZ Risks

  • Hardware failure β†’ 10-30 min downtime
  • AZ failure β†’ Potentially hours of downtime
  • Maintenance windows β†’ Downtime during upgrades
  • No automatic failover

Conclusion: Single-AZ is NOT production-ready.

2.2. Multi-AZ with One Standby (Synchronous) βœ…

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   Availability  β”‚        β”‚   Availability  β”‚
    β”‚     Zone A      β”‚        β”‚      Zone B     β”‚
    β”‚                 β”‚        β”‚                 β”‚
    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
    β”‚  β”‚ Primary   β”‚  │◄──────►│  β”‚ Standby   β”‚  β”‚
    β”‚  β”‚ Instance  β”‚  β”‚ Sync   β”‚  β”‚ Instance  β”‚  β”‚
    β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚ Replic β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
    β”‚        β”‚        β”‚        β”‚                 β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β”‚ Apps connect here
             β–Ό
       DNS Endpoint
       mydb.abc.rds.amazonaws.com

This is the standard production setup for most workloads.

Synchronous Replication Flow:

  1. App sends: INSERT INTO users…
  2. Primary receives the write
  3. Primary sends to standby: “I’m about to commit this”
  4. Standby acknowledges: “Received, persisted”
  5. Primary commits
  6. Primary responds to app: “Success!”

Write only confirmed after standby acknowledges Benefits:

  • Zero data loss (RPO = 0)
  • Fast failover (RTO = 1-2 min)
  • Automatic (no manual intervention)
  • Same endpoint (DNS change, no app changes)

2.2.1. Failover Scenario 1: Primary instance fails

What happens:

  • Primary crashes
  • RDS health check detects failure
  • RDS initiates failover
  • DNS points to standby
  • Standby promoted to primary
  • New connections accepted βœ…

⚠️ Total downtime: ~1-2 minutes

Your application sees:

  • Existing connections: Dropped (need to reconnect)
  • New connections: Brief rejection, then success
  • Data loss: ZERO (everything was synchronized)

RDS automatically:

  • Promotes standby to primary
  • Updates DNS (no IP change for the app)
  • Begins rebuilding new standby in background

2.2.2. Scenario 2: Standby instance fails

What happens:

  • Impact: NONE on your application

RDS actions:

  1. Detects standby failure
  2. Primary continues serving traffic normally
  3. RDS provisions new standby in background
  4. Synchronization resumes automatically

2.2.3. Scenario 3: Entire AZ fails

What happens:

  • If Primary’s AZ fails:
    • Standby in different AZ takes over
    • Failover time: ~1-2 minutes
    • Zero data loss
  • If Standby’s AZ fails:
    • Primary not affected
    • RDS rebuilds standby in healthy AZ
    • Zero impact on application
What causes automatic failover?

RDS automatically fails over for:

  • Infrastructure failures:
    • Primary instance hardware failure
    • Underlying storage failure
    • AZ-level outage
    • Network connectivity loss between AZs
  • Maintenance operations:
    • OS patching (applied to standby first, then failover)
    • Database engine upgrades (minimizes downtime)

Does NOT cause failover ❌ :

  • Long-running queries (PostgreSQL issue, not infrastructure)
  • Deadlocks (application/database logic issue)
  • Out of connections (configuration issue)
  • Full disk (need to increase storage)

For database-level issues: RDS restarts, doesn’t fail over.

🎯 Operational Advantages

  • Backups run on standby
  • Maintenance applied to standby first
    • Standby receives patch/upgrade
    • Failover happens (1-2 min downtime)
    • Old primary becomes new standby
    • New standby receives patch
    • Result: Minimal downtime for maintenance
  • SLA guarantee
    • 99.95% monthly uptime SLA
    • Translates to ~22 minutes max downtime per month
  • Same endpoint
    • DNS: mydb.abc.rds.amazonaws.com
    • No application changes after failover
    • Connection string remains the same

2.3. Multi-AZ DB Cluster (Semi-Synchronous) πŸš€

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚     AZ A     β”‚   β”‚     AZ B     β”‚   β”‚     AZ C     β”‚
    β”‚              β”‚   β”‚              β”‚   β”‚              β”‚
    β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
    β”‚ β”‚ Primary  β”‚ β”‚   β”‚ β”‚ Readable β”‚ β”‚   β”‚ β”‚ Readable β”‚ β”‚
    β”‚ β”‚ Instance β”‚β—„β”œβ”€β”€β”€β”€β–Ίβ”‚ Standby  β”‚ β”‚   β”‚ β”‚ Standby  β”‚ β”‚
    β”‚ β”‚          β”‚ β”‚   β”‚ β”‚    #1    β”‚ β”‚   β”‚ β”‚    #2    β”‚ β”‚
    β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚   β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚   β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
    β”‚      β”‚NVMe   β”‚   β”‚      β”‚NVMe   β”‚   β”‚      β”‚NVMe   β”‚
    β”‚      β”‚SSD    β”‚   β”‚      β”‚SSD    β”‚   β”‚      β”‚SSD    β”‚
    β””β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                  β”‚                  β”‚
           β–Ό                  β–Ό                  β–Ό
      Write Endpoint     Reader Endpoint    Reader Endpoint

This is the premium option for ultra-low RTO and integrated read scalability.

Semi-Synchronous Replication:

  1. App sends: INSERT INTO orders…
  2. Primary receives the write
  3. Primary sends to BOTH standbys
  4. Primary waits for ANY ONE standby to acknowledge
  5. Primary commits (doesn’t wait for both)
  6. Primary responds to app: “Success!”
  7. Only one standby needs to confirm (faster than full sync)

Key differences from Multi-AZ with One Standby:

  • 2 standbys instead of 1 (three AZs total)
  • Standbys are readable (can serve SELECT queries)
  • Faster failover (<35 seconds vs 1-2 minutes)
  • Local NVMe SSDs (better I/O performance)
  • Reader endpoint (automatic load balancing)

🎯 Operational Advantages

  1. Ultra-Fast Failover (<35 seconds)
  2. Integrated Read Scalability
  3. Automatic Reader Endpoint
  4. Better Performance

When to Use Multi-AZ Cluster:

  • RTO < 35 seconds required
  • Mission-critical applications
  • Financial services, healthcare
  • E-commerce during seasonal peaks
  • Heavy read workload
FeatureSingle-AZMulti-AZ (1 Standby)Multi-AZ Cluster
Availability Zones123
ReplicationNoneSynchronousSemi-synchronous
RTO10-30 min1-2 min<35 sec
RPOMinutes to hours0 (no data loss)0 (no data loss)
Readable StandbysN/A❌ Noβœ… Yes (2)
Automatic Failover❌ Noβœ… Yesβœ… Yes
Storage TypeEBSEBSLocal NVMe SSD
SLANone99.95%99.99%
Cost$$$ (2x Single-AZ)$$$$ (4x Single-AZ)
Best ForDev/testProduction (standard)Mission-critical

3. Disaster Recovery Strategies

High Availability protects against infrastructure failures. Disaster Recovery protects against catastrophic events: region-wide outages, accidental deletions, data corruption, ransomware.

3.1. RDS Snapshots

Snapshots are point-in-time backups of your entire database.

Automated snapshots
Day 1:
β”œβ”€ 00:00 - Full snapshot taken
└─ During the day: Transaction logs captured

Day 2:
β”œβ”€ 00:00 - Incremental snapshot (only changes)
└─ During the day: Transaction logs captured

Day 3:
β”œβ”€ 00:00 - Incremental snapshot
└─ And so on...

Features:

  • Automatic daily backups
  • Incremental (only changed blocks)
  • Includes transaction logs (for point-in-time restore)
  • Default retention: 7 days (configurable up to 35 days)
  • Backup window: Specify preferred time (low-traffic period)

Point-in-Time Restore (PITR) -> Scenario: Accidental DELETE at 14:47. You can restore to:

  • 14:46 (before the DELETE)
  • 14:30 (30 min before)
  • 10:00 (this morning)
  • Any 5-minute increment within retention period

Cost:

  • Included in RDS pricing
  • Storage cost: Same as provisioned storage
  • Example: 100 GB DB = ~$10/month for backups
Manual snapshots
Automated Snapshots:
β”œβ”€ Happen automatically daily
β”œβ”€ Deleted after retention period
β”œβ”€ Support point-in-time restore
└─ Tied to instance

Manual Snapshots:
β”œβ”€ You trigger them
β”œβ”€ NEVER automatically deleted
β”œβ”€ NO point-in-time restore (only snapshot moment)
└─ Independent of instance (persist after deletion)

When to use manual snapshots:

  • Before major changes
  • Compliance/audit requirements
  • Before deleting instance

Cost:

  • Pay only for storage
  • $0.095/GB/month in us-east-1
  • 100 GB snapshot = $9.50/month

🎯 Production Snapshot Strategy

  • Configure automated backups
  • Take manual snapshots before changes:
    • Schema migrations
    • Major version upgrades
    • Configuration changes
    • Before deleting instance
  • Copy snapshots cross-region
  • Test restores regularly

3.1.1. Snapshot Restore Process

Timeline:

  1. Initiate restore (API call): 1 minute
  2. RDS provisions new instance: 5-10 minutes
  3. Restore data from snapshot: 10-60 minutes (depends on size)
  4. Instance becomes available: Total 15-70 minutes

Factors affecting time:

  • Database size (larger = longer)
  • Instance class (larger = faster restore)
  • Region load (peak times may be slower)

After restore -> New instance:

  • Different endpoint (need to update connection strings)
  • Same data from snapshot moment
  • Same configuration (instance class, parameters)
  • Original instance still running (you choose which to keep)

3.2. Read Replicas for DR

Read replicas serve TWO purposes:

  • Read scalability (offload SELECT queries)
  • Disaster recovery (can be promoted to standalone)
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  us-east-1      β”‚        β”‚  us-west-2      β”‚
  β”‚                 β”‚        β”‚                 β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
  β”‚  β”‚ Primary   β”‚  │───────►│  β”‚  Read     β”‚  β”‚
  β”‚  β”‚ (Source)  β”‚  β”‚ Async  β”‚  β”‚  Replica  β”‚  β”‚
  β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚ Replic β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
  β”‚        β”‚        β”‚        β”‚                 β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
    Write traffic

Asynchronous Replication:

  1. App writes to primary
  2. Primary commits immediately
  3. Primary sends change to replica
  4. Replica applies change (eventually)
  5. Replica may lag behind primary (seconds to minutes) ⚠️

Key characteristics:

  • Async replication (no impact on write latency)
  • Can be in same region or cross-region
  • Can have different instance class from primary
  • Can be promoted to standalone instance
  • Replication lag possible (monitor closely)

3.2.1. In-Region vs Cross-Region Read Replicas

CharacteristicIn-RegionCross-Region
TopologyPrimary (us-east-1a) β†’ Replica (us-east-1b)Primary (us-east-1) β†’ Replica (eu-west-1)
Replication Lag<1 second (typically)1-5 seconds (depends on distance)
Cost~2x instance cost2x instance + data transfer ($0.02/GB)
Data Transferβœ… No charge❌ Charged ($0.02/GB out)
LatencyVery low (~ms)Variable (10-200ms depending on distance)
Primary Use CaseRead scalabilityDisaster recovery + global reads
Best ForOffload reads from primaryCross-region DR, compliance, global users

When to use cross-region:

  • Disaster recovery (protection against region-wide failure)
  • Compliance (data residency requirements)
  • Global application (serve users from nearest region)
  • Lower read latency for geographically distributed users

3.2.2. Read Replica Sizing

Replica can be different size from primary

Benefits of larger replica:

  • Handles heavy read workload easily
  • No performance degradation
  • Good for analytical queries

Risk of smaller replica:

  • Replica can’t keep up with primary
  • Replication lag increases
  • Eventually replica falls too far behind
  • Bad for disaster recovery!

Rule of thumb: Replica should be β‰₯ same class as primary for DR purposes.

4. High Availability vs Disaster Recovery

AspectHigh Availability (HA)Disaster Recovery (DR)
PurposeProtect against infrastructure failuresProtect against catastrophic events
TechnologyMulti-AZ (sync replication)Snapshots + Read Replicas (async)
RPO0 (no data loss)Minutes to hours (depends on backup)
RTO1-2 min (35 sec for cluster)Hours (snapshot restore)
ScopeSame region, different AZsCross-region
ReplicationSynchronousAsynchronous
Instance ClassSame as primaryCan differ
Cost2x instance cost2x + storage + data transfer
FailoverAutomaticManual (promote replica)
Protects AgainstAZ failure, hardware issuesRegion failure, data corruption, accidents

5. Conclusion

High Availability and Disaster Recovery are not optional for production databasesβ€”they’re essential insurance against inevitable failures.

8.1. Key Takeaways High Availability:

  • Use Multi-AZ for all production (RTO: 1-2 min, RPO: 0) βœ…
  • Use Multi-AZ Cluster for mission-critical (RTO: <35 sec) πŸš€
  • Never use Single-AZ for production ❌

Disaster Recovery:

  • Enable automated backups (35-day retention) βœ…
  • Take manual snapshots before major changes βœ…
  • Use cross-region read replica for critical systems βœ…
  • Test DR quarterly (restore, promote, measure) βœ…

Application Design:

  • Retry logic for connections βœ…
  • Idempotent transactions βœ…
  • Health checks and monitoring βœ…

Costs:

  • Multi-AZ: 2x Single-AZ cost ($300/month for typical setup)
  • ROI: Pays for itself preventing <1 hour downtime/month

Critical Metrics:

  • Monitor: DatabaseConnections, ReplicaLag, FreeStorageSpace
  • Alert: Lag >60s, Storage <10 GB, Connections >80%