Terraform IaC

Terraform at Scale: Modular Patterns for Multi-Account AWS

Kloudpath Team January 22, 2025

When a single team manages five AWS accounts, Terraform is straightforward. When an organization grows to 50, 100, or 200 accounts across multiple business units, the same patterns that worked at small scale become a liability. State files balloon, plan times stretch to minutes, blast radius expands, and a single misconfigured module can cascade across environments. This post describes the patterns we have refined over dozens of multi-account engagements.

The Directory Strategy vs. Workspaces Debate

The first decision in any multi-account Terraform setup is how to isolate configuration between accounts. Terraform workspaces allow you to maintain a single set of configuration files and switch between state files using terraform workspace select. The directory strategy uses separate directory trees for each account, each with its own backend configuration and state file.

We almost always recommend the directory strategy. Workspaces share the same backend configuration, which means a misconfigured provider in one workspace can affect another. They also make it tempting to use terraform.workspace conditionals throughout your code, leading to sprawling ternary expressions that are difficult to review and debug. With directories, each account is a self-contained unit with its own backend, provider configuration, and variable values. The tradeoff is more files and some duplication, but the isolation is worth it.

infrastructure/
  modules/               # Shared reusable modules
    networking/
    eks-cluster/
    rds/
    iam-baseline/
  accounts/
    production/
      us-east-1/
        main.tf
        backend.tf       # S3 backend in production account
        terraform.tfvars
      eu-west-1/
        main.tf
        backend.tf
        terraform.tfvars
    staging/
      us-east-1/
        main.tf
        backend.tf
        terraform.tfvars
    shared-services/
      main.tf
      backend.tf

Module Composition Patterns

We organize modules into three tiers. The first tier is resource modules that wrap a single AWS resource or a tightly coupled set of resources (for example, an ALB with its target group and listener). These modules expose every relevant configuration parameter and make no assumptions about naming conventions or tags.

The second tier is service modules that compose multiple resource modules into a deployable service. An EKS service module, for example, composes the VPC module, the EKS cluster module, the node group module, and the IAM roles module. Service modules enforce organizational standards: naming conventions, required tags, encryption defaults, and logging configuration.

The third tier is account modules (sometimes called root modules or compositions) that define the complete infrastructure for a specific account and region. These are the modules that teams actually apply. They call service modules with account-specific variable values and wire outputs between them.

# accounts/production/us-east-1/main.tf
module "network" {
  source = "../../../modules/networking"

  vpc_cidr           = "10.1.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  environment        = "production"
  enable_nat_gateway = true
}

module "eks" {
  source = "../../../modules/eks-cluster"

  cluster_name    = "prod-us-east-1"
  vpc_id          = module.network.vpc_id
  subnet_ids      = module.network.private_subnet_ids
  cluster_version = "1.29"
  node_groups = {
    general = {
      instance_types = ["m6i.xlarge"]
      min_size       = 3
      max_size       = 20
      desired_size   = 6
    }
  }
}

State Isolation and Backend Configuration

Every account gets its own S3 bucket for Terraform state, in its own AWS account. We never store state for account A in the S3 bucket of account B. The reason is blast radius: if the state bucket is compromised or accidentally deleted, only that account's state is affected. Additionally, within each account, we further isolate state by service. The networking stack, the EKS cluster, and the application layer each have their own state file.

The DynamoDB table for state locking follows the same pattern: one table per account. We provision the backend infrastructure (S3 bucket, DynamoDB table, KMS key for encryption) using a separate bootstrap Terraform configuration that is applied manually when a new account is onboarded. This bootstrap config is the only Terraform that does not use remote state, and it is stored in a dedicated repository with restricted access.

CI/CD with Plan and Apply

Our CI/CD pipeline for Terraform runs in GitHub Actions and follows a strict plan-then-apply workflow. On every pull request, the pipeline runs terraform plan for every affected account directory and posts the plan output as a PR comment. Reviewers can see exactly what will change before approving. After the PR is merged, the pipeline runs terraform apply with the saved plan file, ensuring that what was reviewed is exactly what gets applied.

Determining which account directories are "affected" by a PR is non-trivial. We wrote a script that uses git diff to identify changed files, then traces module dependencies to find all account directories that reference a changed module. If a change is made to the networking module, the script identifies every account directory that uses that module and includes them all in the plan.

# .github/workflows/terraform.yml (simplified)
- name: Detect affected directories
  run: |
    changed=$(git diff --name-only origin/main...HEAD)
    affected=$(python scripts/find_affected.py $changed)
    echo "dirs=$affected" >> $GITHUB_OUTPUT

- name: Terraform Plan
  run: |
    for dir in ${{ steps.detect.outputs.dirs }}; do
      cd $dir
      terraform init -backend-config=backend.hcl
      terraform plan -out=plan.tfplan
      terraform show -no-color plan.tfplan >> $GITHUB_STEP_SUMMARY
    done

Drift Detection

Configuration drift is the silent killer of infrastructure-as-code. Someone makes a quick change in the AWS console during an incident, forgets to back-port it to Terraform, and the next terraform apply reverts the change, potentially causing another incident. We run drift detection as a scheduled pipeline that executes terraform plan against every account directory nightly and alerts if any plan shows changes.

The drift detection pipeline is read-only and uses an IAM role with minimal permissions. It does not apply changes; it only reports them. When drift is detected, a Slack notification is sent to the infrastructure team with the plan output, and a ticket is automatically created in the team's backlog. We track drift rate as a metric, and our target is zero planned changes outside of active PRs.

Code Review for Infrastructure

Reviewing Terraform code is fundamentally different from reviewing application code. A one-line change to a module can trigger the recreation of hundreds of resources across dozens of accounts. We enforce several review practices that account for this:

Plan output is required reading. No PR is approved without the reviewer examining the plan output for every affected account. The plan is the ground truth, not the diff.
Destructive changes require two approvals. If the plan includes any resource destruction or recreation, a second reviewer from the platform team must approve.
Module version pinning. All module sources use explicit version tags, never branch references. Module updates are a separate PR from the configuration changes that consume them.
Policy as code. We use OPA (Open Policy Agent) with Conftest to validate plans against organizational policies before they can be applied. Policies include rules like "no public S3 buckets," "all RDS instances must have encryption enabled," and "no IAM policies with wildcard actions."

Scaling the Team

The organizational challenge of Terraform at scale is as significant as the technical one. We use a platform team model where a central infrastructure team maintains the module library and the CI/CD pipeline, while product teams own their account configurations. Product teams compose existing modules to build their infrastructure and submit PRs that the platform team reviews. This balance gives product teams autonomy while maintaining guardrails through module design and policy enforcement.

Documentation is embedded in the modules themselves. Every module has a README.md generated by terraform-docs, with examples, input descriptions, and output descriptions. When a product team needs to provision a new service, they can find a module, read its documentation, and compose it into their account configuration without needing to involve the platform team. This self-service model is essential for scaling beyond a handful of accounts.

← Back to Blog