Skip to main content

Command Palette

Search for a command to run...

Cross-Region RDS Disaster Recovery: Production Failover Architecture

Updated
4 min read
Cross-Region RDS Disaster Recovery: Production Failover Architecture

1. Overview

This post documents how I designed and implemented cross-region disaster recovery for a production MySQL database running on Amazon RDS.

The requirement was straightforward:

If the primary region (ap-south-1) becomes unavailable, the database must recover automatically from another region with minimal downtime and no manual intervention.

The approach relies entirely on native AWS services and focuses on controlled automation rather than introducing additional orchestration layers.


2. Problem Statement

The application was running with:

  • A single primary RDS instance in ap-south-1

  • No cross-region replica

  • A manual disaster recovery procedure

In the event of a regional failure:

  • The database endpoint would become unreachable

  • The application would lose connectivity

  • Recovery would depend on human response time

This created an unacceptable recovery model for a production-facing workload.

The goal was to introduce regional resilience while keeping the system operationally predictable and architecturally simple.


3. Architecture After DR Implementation

Application

Primary RDS (ap-south-1)

Cross-Region Read Replica (us-east-1)

Automation Flow:

CloudWatch Alarm
→ SNS Topic
→ Lambda (us-east-1)
→ Promote Read Replica

After promotion, the replica becomes a standalone primary database.


4. Design Decisions

4.1 Cross-Region Read Replica

A read replica was created in us-east-1 from the primary RDS instance.

Replication is managed natively by RDS.

Important consideration:

Cross-region replication is asynchronous. This introduces a non-zero Recovery Point Objective (RPO). A small amount of recent data may not be replicated during failover.

For this workload, that trade-off was acceptable.


4.2 Failure Detection

Failure detection was implemented using Amazon CloudWatch.

A CloudWatch alarm was configured on the primary instance. The initial metric used was:

DatabaseConnections < 1

If the instance stops accepting connections, the alarm transitions to ALARM state.

In more mature setups, detection can be improved using:

  • RDS instance status

  • RDS event notifications

  • Composite alarms

The alarm publishes its state change to an SNS topic.


4.3 Event Routing

Amazon Simple Notification Service was introduced to decouple monitoring from execution.

The flow:

CloudWatch → SNS → Lambda

This separation provides:

  • Clear event-driven architecture

  • Loose coupling

  • Extensibility for future automation


4.4 Automated Promotion

Promotion logic was implemented using AWS Lambda in us-east-1.

The Lambda role was granted minimal permission:

rds:PromoteReadReplica

Core logic:

import boto3

def lambda_handler(event, context):
    client = boto3.client('rds', region_name='us-east-1')

    replica_id = "your-read-replica-id"

    client.promote_read_replica(
        DBInstanceIdentifier=replica_id
    )

    return {
        "statusCode": 200,
        "message": f"Promotion initiated for {replica_id}"
    }

When triggered, Lambda initiates promotion of the replica. Promotion typically completes within a few minutes.


5. DNS Consideration

Promotion alone does not restore application traffic.

The application must connect to the new primary endpoint.

This is typically handled using Amazon Route 53 with:

  • Failover routing policies

  • Health checks

  • Low TTL configuration

Without DNS integration, traffic would continue to target the failed endpoint.

This is a critical part of the disaster recovery strategy.


6. Validation Process

To validate the setup:

  • The primary instance was stopped

  • CloudWatch alarm triggered

  • SNS published the event

  • Lambda executed

  • The read replica was promoted

Post-promotion validation included:

  • Verifying instance state

  • Testing database connectivity

  • Executing queries

  • Confirming application behavior

Failover occurred without manual intervention.


7. Cost and Trade-offs

Cross-region disaster recovery increases infrastructure cost:

  • An additional RDS instance

  • Cross-region replication traffic

  • Additional storage

This cost is deliberate and justified by the reduction in downtime risk.

RTO: A few minutes (detection + promotion time)
RPO: A few seconds (due to asynchronous replication)

These expectations must be clearly defined with stakeholders before implementation.


8. When This Approach Makes Sense

This architecture is suitable when:

  • A few seconds of data loss is acceptable

  • Automated regional failover is required

  • Operational simplicity is preferred

  • The workload is not using Aurora Global Database

For workloads requiring near-zero RPO and faster failover, managed global database architectures may be more appropriate.


9. Final Takeaway

I implemented a cross-region disaster recovery architecture for Amazon RDS using native AWS services.

The system:

  • Detects primary failure automatically

  • Triggers event-driven automation

  • Promotes a cross-region replica

  • Reduces human dependency during outages

The design balances simplicity, cost, and resilience. It introduces regional fault tolerance without adding unnecessary operational complexity.

More from this blog

D

DevOps and Cloud Mastery Online - DevOps' World

34 posts