AWS Architecture

Designing Multi-Region Active-Active on AWS

Kloudpath Team March 15, 2025

Last year we were tasked with re-architecting an e-commerce platform that had outgrown its single-region deployment in us-east-1. The business had expanded into Europe and Asia-Pacific, and customers in those markets were experiencing 400-600ms page loads. Worse, a 2023 outage in us-east-1 had taken the entire platform offline for four hours, costing an estimated $2.3M in lost revenue. The mandate was clear: build a globally distributed, active-active architecture that could survive a full regional failure with zero customer-visible impact.

Why Active-Active Is Harder Than It Sounds

Most teams conflate active-active with simple geographic load balancing. The reality is far more nuanced. In an active-passive setup, your secondary region is a warm standby that receives replicated data but serves no traffic. Failover is a single DNS cut. Active-active means both regions are serving production traffic simultaneously, which introduces a fundamental challenge: distributed writes against shared state.

Consider a user in Frankfurt adding items to their cart while the same account is accessed from a mobile device in Singapore. Both regions need to reflect the same cart state within seconds. This is the kind of consistency problem that single-region architectures never have to solve, and it touches every layer of the stack.

The Database Layer: Aurora Global Database

We chose Aurora Global Database as the foundation. Aurora replicates data from a primary region to up to five secondary regions with typical lag under one second. Reads in secondary regions are served locally, which gave us the latency improvement we needed. However, all writes still flow to the primary region.

To handle writes from non-primary regions, we implemented a write-forwarding pattern. Application instances in eu-west-1 and ap-southeast-1 detect write operations and route them to the Aurora primary in us-east-1 through a dedicated internal API gateway. This adds 60-80ms to write latency for remote users, but for our use case (cart updates, order placement), that was acceptable.

For data that needed true multi-master semantics, such as session state and user preferences, we used DynamoDB Global Tables. DynamoDB provides last-writer-wins conflict resolution with sub-second replication, which works well for data where conflicts are rare and the latest value is always correct.

Compute and Service Discovery

Each region runs an identical ECS Fargate cluster behind an Application Load Balancer. We use a shared ECR repository in us-east-1 with cross-region replication to eu-west-1 and ap-southeast-1, ensuring that container images are always available locally. Deployments are coordinated through CodePipeline with a fan-out pattern: the pipeline builds once, then deploys to all three regions in parallel.

Service-to-service communication within a region uses AWS Cloud Map for service discovery. Cross-region service calls are avoided wherever possible, but when they are necessary (for example, checking inventory in real time against the primary database), we route through PrivateLink endpoints to keep traffic on the AWS backbone and avoid the public internet.

# ECS service definition with Cloud Map integration
resource "aws_ecs_service" "api" {
  name            = "api"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 6

  service_registries {
    registry_arn = aws_service_discovery_service.api.arn
  }

  network_configuration {
    subnets         = var.private_subnets
    security_groups = [aws_security_group.api.id]
  }
}

Traffic Routing with CloudFront and Route 53

CloudFront sits at the edge and provides two critical functions: static asset caching and intelligent origin routing. We configured CloudFront with origin groups, where the primary origin is determined by the viewer's geography using latency-based routing in Route 53. If the nearest origin returns a 5xx error or times out, CloudFront automatically fails over to the next-closest region.

Route 53 health checks monitor each region's ALB endpoint every 10 seconds. When a region fails three consecutive health checks, Route 53 removes it from the DNS response pool. Combined with CloudFront's origin failover, this gives us a two-layer failover mechanism. In our testing, the total failover time from regional failure to traffic re-routing was under 45 seconds.

Data Consistency Patterns

We adopted an eventual consistency model for most read paths and reserved strong consistency for operations where correctness is critical. For example, product catalog reads are eventually consistent with a typical lag of 200-500ms. But order placement requires a synchronous write to the primary region, and the order confirmation is only returned after the write is acknowledged.

To handle edge cases where a user reads their own recent write from a different region (the "read-your-writes" problem), we implemented a session affinity mechanism. After a write operation, the user's session cookie is tagged with a timestamp, and subsequent reads are routed to the primary region until the replication lag has elapsed. This adds complexity but prevents the confusing experience of a user placing an order and not seeing it in their order history.

Failover Testing and Chaos Engineering

An active-active architecture is only as good as its failover testing. We run monthly game days where we simulate a full regional failure by injecting faults at the ALB level using AWS Fault Injection Simulator. We measure three things: time to detect the failure (target: under 30 seconds), time to re-route traffic (target: under 60 seconds), and error rate during the transition (target: under 0.1%).

Our first game day was humbling. The failover worked, but the surge of traffic to the surviving regions caused ECS tasks to scale beyond the available Fargate capacity. We now maintain 30% headroom in each region and use Capacity Providers with a mix of Fargate and Fargate Spot to handle burst scenarios. We also pre-warm ALBs before game days by gradually increasing traffic, and we have documented runbooks for the operations team.

Cost Considerations

Running active-active across three regions roughly doubles your compute and database costs compared to a single-region deployment. Aurora Global Database charges for storage replication, and you pay for compute in every region. Our monthly AWS bill went from approximately $45K to $95K. However, when weighed against the $2.3M outage cost and the revenue gained from improved latency in international markets, the ROI was clear within three months.

We reduced costs by running development and staging environments in a single region and only deploying the multi-region setup in production. We also aggressively use CloudFront caching for static and semi-static content, which offloads approximately 70% of requests from the origin and keeps compute costs manageable.

Key Takeaways

Start with your data model. The database layer is the hardest part of active-active; get this right first.
Separate read and write paths early. Not all data needs multi-master semantics.
Invest in failover testing from day one. An untested failover is not a failover.
Accept the cost premium. Multi-region is an insurance policy against catastrophic outages and a performance multiplier for global users.
Monitor replication lag as a first-class metric. Set alarms on Aurora replication lag and DynamoDB Global Tables propagation delay.

Active-active is not a pattern you adopt lightly, but for businesses with a global user base and strict availability requirements, it is the only architecture that delivers both performance and resilience. The key is to be deliberate about which data paths require strong consistency and which can tolerate eventual consistency, then design your system accordingly.

← Back to Blog