RDS PostgreSQL: High Availability and Disaster Recovery
Guide to RDS availability: Multi-AZ configurations, failover mechanisms, snapshots, read replicas, and disaster recovery strategies for mission-critical databases.
Table of Contents
1. Why High Availability Matters
Imagine waking up at 3 AM to alerts that your production database is down. Your e-commerce site is offline. Every minute costs thousands in lost revenue. Customers are frustrated. Your team is panicking.
This is why High Availability (HA) is not optional for production systems.
2. High Availability Options in RDS
RDS offers three deployment models, each with different availability guarantees.
2.1. Single-AZ (No Standby) β
βββββββββββββββββββ
β Availability β
β Zone A β
β β
β βββββββββββββ β
β β Primary β β
β β Instance β β
β βββββββββββββ β
β β
βββββββββββββββββββ
No redundancy!
What happens if the primary fails:
- RDS detects the failure
- RDS provisions new EC2 instance
- RDS attaches EBS volumes
- PostgreSQL initializes
- Database becomes available
β° Downtime: Typically 10-30 minutes (sometimes longer)
When to use:
- Dev/test environments β
- Cost-sensitive non-critical workloads β
- Databases that can tolerate prolonged downtime β
β οΈ Single-AZ Risks
- Hardware failure β 10-30 min downtime
- AZ failure β Potentially hours of downtime
- Maintenance windows β Downtime during upgrades
- No automatic failover
Conclusion: Single-AZ is NOT production-ready.
2.2. Multi-AZ with One Standby (Synchronous) β
βββββββββββββββββββ βββββββββββββββββββ
β Availability β β Availability β
β Zone A β β Zone B β
β β β β
β βββββββββββββ β β βββββββββββββ β
β β Primary β βββββββββΊβ β Standby β β
β β Instance β β Sync β β Instance β β
β βββββββ¬ββββββ β Replic β βββββββββββββ β
β β β β β
ββββββββββΌβββββββββ βββββββββββββββββββ
β
β Apps connect here
βΌ
DNS Endpoint
mydb.abc.rds.amazonaws.com
This is the standard production setup for most workloads.
Synchronous Replication Flow:
- App sends: INSERT INTO users…
- Primary receives the write
- Primary sends to standby: “I’m about to commit this”
- Standby acknowledges: “Received, persisted”
- Primary commits
- Primary responds to app: “Success!”
Write only confirmed after standby acknowledges Benefits:
- Zero data loss (RPO = 0)
- Fast failover (RTO = 1-2 min)
- Automatic (no manual intervention)
- Same endpoint (DNS change, no app changes)
2.2.1. Failover Scenario 1: Primary instance fails
What happens:
- Primary crashes
- RDS health check detects failure
- RDS initiates failover
- DNS points to standby
- Standby promoted to primary
- New connections accepted β
β οΈ Total downtime: ~1-2 minutes
Your application sees:
- Existing connections: Dropped (need to reconnect)
- New connections: Brief rejection, then success
- Data loss: ZERO (everything was synchronized)
RDS automatically:
- Promotes standby to primary
- Updates DNS (no IP change for the app)
- Begins rebuilding new standby in background
2.2.2. Scenario 2: Standby instance fails
What happens:
- Impact: NONE on your application
RDS actions:
- Detects standby failure
- Primary continues serving traffic normally
- RDS provisions new standby in background
- Synchronization resumes automatically
2.2.3. Scenario 3: Entire AZ fails
What happens:
- If Primary’s AZ fails:
- Standby in different AZ takes over
- Failover time: ~1-2 minutes
- Zero data loss
- If Standby’s AZ fails:
- Primary not affected
- RDS rebuilds standby in healthy AZ
- Zero impact on application
What causes automatic failover?
RDS automatically fails over for:
- Infrastructure failures:
- Primary instance hardware failure
- Underlying storage failure
- AZ-level outage
- Network connectivity loss between AZs
- Maintenance operations:
- OS patching (applied to standby first, then failover)
- Database engine upgrades (minimizes downtime)
Does NOT cause failover β :
- Long-running queries (PostgreSQL issue, not infrastructure)
- Deadlocks (application/database logic issue)
- Out of connections (configuration issue)
- Full disk (need to increase storage)
For database-level issues: RDS restarts, doesn’t fail over.
π― Operational Advantages
- Backups run on standby
- Maintenance applied to standby first
- Standby receives patch/upgrade
- Failover happens (1-2 min downtime)
- Old primary becomes new standby
- New standby receives patch
- Result: Minimal downtime for maintenance
- SLA guarantee
- 99.95% monthly uptime SLA
- Translates to ~22 minutes max downtime per month
- Same endpoint
- DNS: mydb.abc.rds.amazonaws.com
- No application changes after failover
- Connection string remains the same
2.3. Multi-AZ DB Cluster (Semi-Synchronous) π
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β AZ A β β AZ B β β AZ C β
β β β β β β
β ββββββββββββ β β ββββββββββββ β β ββββββββββββ β
β β Primary β β β β Readable β β β β Readable β β
β β Instance βββββββ€βΊβ Standby β β β β Standby β β
β β β β β β #1 β β β β #2 β β
β ββββββ¬ββββββ β β ββββββ¬ββββββ β β ββββββ¬ββββββ β
β βNVMe β β βNVMe β β βNVMe β
β βSSD β β βSSD β β βSSD β
ββββββββΌββββββββ ββββββββΌββββββββ ββββββββΌββββββββ
β β β
βΌ βΌ βΌ
Write Endpoint Reader Endpoint Reader Endpoint
This is the premium option for ultra-low RTO and integrated read scalability.
Semi-Synchronous Replication:
- App sends: INSERT INTO orders…
- Primary receives the write
- Primary sends to BOTH standbys
- Primary waits for ANY ONE standby to acknowledge
- Primary commits (doesn’t wait for both)
- Primary responds to app: “Success!”
- Only one standby needs to confirm (faster than full sync)
Key differences from Multi-AZ with One Standby:
- 2 standbys instead of 1 (three AZs total)
- Standbys are readable (can serve SELECT queries)
- Faster failover (<35 seconds vs 1-2 minutes)
- Local NVMe SSDs (better I/O performance)
- Reader endpoint (automatic load balancing)
π― Operational Advantages
- Ultra-Fast Failover (<35 seconds)
- Integrated Read Scalability
- Automatic Reader Endpoint
- Better Performance
When to Use Multi-AZ Cluster:
- RTO < 35 seconds required
- Mission-critical applications
- Financial services, healthcare
- E-commerce during seasonal peaks
- Heavy read workload
| Feature | Single-AZ | Multi-AZ (1 Standby) | Multi-AZ Cluster |
|---|---|---|---|
| Availability Zones | 1 | 2 | 3 |
| Replication | None | Synchronous | Semi-synchronous |
| RTO | 10-30 min | 1-2 min | <35 sec |
| RPO | Minutes to hours | 0 (no data loss) | 0 (no data loss) |
| Readable Standbys | N/A | β No | β Yes (2) |
| Automatic Failover | β No | β Yes | β Yes |
| Storage Type | EBS | EBS | Local NVMe SSD |
| SLA | None | 99.95% | 99.99% |
| Cost | $ | $$ (2x Single-AZ) | $$$$ (4x Single-AZ) |
| Best For | Dev/test | Production (standard) | Mission-critical |
3. Disaster Recovery Strategies
High Availability protects against infrastructure failures. Disaster Recovery protects against catastrophic events: region-wide outages, accidental deletions, data corruption, ransomware.
3.1. RDS Snapshots
Snapshots are point-in-time backups of your entire database.
Automated snapshots
Day 1:
ββ 00:00 - Full snapshot taken
ββ During the day: Transaction logs captured
Day 2:
ββ 00:00 - Incremental snapshot (only changes)
ββ During the day: Transaction logs captured
Day 3:
ββ 00:00 - Incremental snapshot
ββ And so on...
Features:
- Automatic daily backups
- Incremental (only changed blocks)
- Includes transaction logs (for point-in-time restore)
- Default retention: 7 days (configurable up to 35 days)
- Backup window: Specify preferred time (low-traffic period)
Point-in-Time Restore (PITR) -> Scenario: Accidental DELETE at 14:47. You can restore to:
- 14:46 (before the DELETE)
- 14:30 (30 min before)
- 10:00 (this morning)
- Any 5-minute increment within retention period
Cost:
- Included in RDS pricing
- Storage cost: Same as provisioned storage
- Example: 100 GB DB = ~$10/month for backups
Manual snapshots
Automated Snapshots:
ββ Happen automatically daily
ββ Deleted after retention period
ββ Support point-in-time restore
ββ Tied to instance
Manual Snapshots:
ββ You trigger them
ββ NEVER automatically deleted
ββ NO point-in-time restore (only snapshot moment)
ββ Independent of instance (persist after deletion)
When to use manual snapshots:
- Before major changes
- Compliance/audit requirements
- Before deleting instance
Cost:
- Pay only for storage
- $0.095/GB/month in us-east-1
- 100 GB snapshot = $9.50/month
π― Production Snapshot Strategy
- Configure automated backups
- Take manual snapshots before changes:
- Schema migrations
- Major version upgrades
- Configuration changes
- Before deleting instance
- Copy snapshots cross-region
- Test restores regularly
3.1.1. Snapshot Restore Process
Timeline:
- Initiate restore (API call): 1 minute
- RDS provisions new instance: 5-10 minutes
- Restore data from snapshot: 10-60 minutes (depends on size)
- Instance becomes available: Total 15-70 minutes
Factors affecting time:
- Database size (larger = longer)
- Instance class (larger = faster restore)
- Region load (peak times may be slower)
After restore -> New instance:
- Different endpoint (need to update connection strings)
- Same data from snapshot moment
- Same configuration (instance class, parameters)
- Original instance still running (you choose which to keep)
3.2. Read Replicas for DR
Read replicas serve TWO purposes:
- Read scalability (offload SELECT queries)
- Disaster recovery (can be promoted to standalone)
βββββββββββββββββββ βββββββββββββββββββ
β us-east-1 β β us-west-2 β
β β β β
β βββββββββββββ β β βββββββββββββ β
β β Primary β βββββββββΊβ β Read β β
β β (Source) β β Async β β Replica β β
β βββββββ¬ββββββ β Replic β βββββββββββββ β
β β β β β
ββββββββββΌβββββββββ βββββββββββββββββββ
β
Write traffic
Asynchronous Replication:
- App writes to primary
- Primary commits immediately
- Primary sends change to replica
- Replica applies change (eventually)
- Replica may lag behind primary (seconds to minutes) β οΈ
Key characteristics:
- Async replication (no impact on write latency)
- Can be in same region or cross-region
- Can have different instance class from primary
- Can be promoted to standalone instance
- Replication lag possible (monitor closely)
3.2.1. In-Region vs Cross-Region Read Replicas
| Characteristic | In-Region | Cross-Region |
|---|---|---|
| Topology | Primary (us-east-1a) β Replica (us-east-1b) | Primary (us-east-1) β Replica (eu-west-1) |
| Replication Lag | <1 second (typically) | 1-5 seconds (depends on distance) |
| Cost | ~2x instance cost | 2x instance + data transfer ($0.02/GB) |
| Data Transfer | β No charge | β Charged ($0.02/GB out) |
| Latency | Very low (~ms) | Variable (10-200ms depending on distance) |
| Primary Use Case | Read scalability | Disaster recovery + global reads |
| Best For | Offload reads from primary | Cross-region DR, compliance, global users |
When to use cross-region:
- Disaster recovery (protection against region-wide failure)
- Compliance (data residency requirements)
- Global application (serve users from nearest region)
- Lower read latency for geographically distributed users
3.2.2. Read Replica Sizing
Replica can be different size from primary
Benefits of larger replica:
- Handles heavy read workload easily
- No performance degradation
- Good for analytical queries
Risk of smaller replica:
- Replica can’t keep up with primary
- Replication lag increases
- Eventually replica falls too far behind
- Bad for disaster recovery!
Rule of thumb: Replica should be β₯ same class as primary for DR purposes.
4. High Availability vs Disaster Recovery
| Aspect | High Availability (HA) | Disaster Recovery (DR) |
|---|---|---|
| Purpose | Protect against infrastructure failures | Protect against catastrophic events |
| Technology | Multi-AZ (sync replication) | Snapshots + Read Replicas (async) |
| RPO | 0 (no data loss) | Minutes to hours (depends on backup) |
| RTO | 1-2 min (35 sec for cluster) | Hours (snapshot restore) |
| Scope | Same region, different AZs | Cross-region |
| Replication | Synchronous | Asynchronous |
| Instance Class | Same as primary | Can differ |
| Cost | 2x instance cost | 2x + storage + data transfer |
| Failover | Automatic | Manual (promote replica) |
| Protects Against | AZ failure, hardware issues | Region failure, data corruption, accidents |
5. Conclusion
High Availability and Disaster Recovery are not optional for production databasesβthey’re essential insurance against inevitable failures.
8.1. Key Takeaways High Availability:
- Use Multi-AZ for all production (RTO: 1-2 min, RPO: 0) β
- Use Multi-AZ Cluster for mission-critical (RTO: <35 sec) π
- Never use Single-AZ for production β
Disaster Recovery:
- Enable automated backups (35-day retention) β
- Take manual snapshots before major changes β
- Use cross-region read replica for critical systems β
- Test DR quarterly (restore, promote, measure) β
Application Design:
- Retry logic for connections β
- Idempotent transactions β
- Health checks and monitoring β
Costs:
- Multi-AZ:
2x Single-AZ cost ($300/month for typical setup) - ROI: Pays for itself preventing <1 hour downtime/month
Critical Metrics:
- Monitor: DatabaseConnections, ReplicaLag, FreeStorageSpace
- Alert: Lag >60s, Storage <10 GB, Connections >80%