Back to Playbooks
TerraformIaCState ManagementPlatform EngineeringDevSecOps

Terraform State Architecture Rollout Kit

Multi-environment isolation without the foot-guns

A practical rollout kit for fixing Terraform state architecture at scale: environment isolation patterns, shared-vs-service state boundaries, CI-only apply guardrails, drift detection routines, Terragrunt migration checks, and a runbook for accidental prod applies.

19 min read
Verified against Terraform/Terragrunt primary sources on Feb 24, 2026

Download the Playbook

Get the standalone version for internal reviews, architecture working sessions, and offline reading. The PDF is generated from the Markdown source and both were updated on February 24, 2026.

What This Playbook Gives You

A reference state architecture for multi-environment Terraform repos

Boundary rules for shared platform vs service stacks

CI-only apply and approval guardrail patterns

Drift detection workflow and owner routing pattern

Terragrunt upgrade/migration checklist for CLI changes

Incident response runbook for accidental prod applies

Non-Negotiables

Prod apply requires GitHub environment approval (reviewers) and protected workflow path.

Prod backend credentials not available to developer laptops by default.

Plans are artifacted and reviewed before apply.

Each stack has a documented owner and rollback path.

No routine use of `-target` except break-glass situations with explicit review.

CLI versions pinned and upgrade-tested before rollout.

Reference Repository Shape

This is a pragmatic starting point for teams managing many services across dev/staging/prod. It optimizes for explicit boundaries and predictable ownership over cleverness.

Reference layout (conceptual)text
infra/
  modules/
    networking/
    dns/
    service-api/
    service-worker/

  live/
    prod/
      platform/
        networking/
        dns/
        observability/
      services/
        payments-api/
        orders-api/
        web/
    staging/
      platform/
        networking/
        dns/
      services/
        payments-api/
        orders-api/
        web/
    dev/
      platform/
        networking/
        dns/
      services/
        payments-api/
        orders-api/
        web/

  ci/
    drift-check/
    policy-check/
Boundary rulesmarkdown
State boundary rules (recommended)

1. Split by environment first (dev/staging/prod are separate state boundaries)
2. Split by owner / blast radius second (platform vs service)
3. Split by change cadence third (rarely changed shared infra vs fast-moving app stacks)
4. Keep state small enough that locking contention is rare and plans are reviewable
5. Avoid mixing shared DNS/networking and app compute in the same state file
6. Document dependencies explicitly (outputs/data sources/orchestration), not implicitly via tribal knowledge

Backend Key and Naming Conventions

Your backend naming convention is not cosmetic. It drives IAM scoping, audit clarity, and incident response speed.

Backend key naming pattern (conceptual)bash
# Example backend key conventions (conceptual)
# The exact syntax depends on your backend/provider.

org/platform/prod/networking.tfstate
org/platform/prod/dns.tfstate
org/services/prod/payments-api.tfstate
org/services/staging/payments-api.tfstate
org/services/dev/web.tfstate

# Naming goals:
# - environment is obvious
# - ownership category is obvious
# - resource stack is obvious
# - easy to scope IAM permissions and audits

CI-Only Apply Pattern (Production)

Plan Artifact Then Apply (with Environment Approval)

This pattern prevents ad hoc production applies and makes approvals concrete. Reviewers approve a plan-backed apply path, not an operator promise.

GitHub Actions pattern (conceptual)yaml
name: Terraform Apply (Prod)

on:
  workflow_dispatch:
    inputs:
      stack:
        description: "Stack path (e.g., live/prod/services/payments-api)"
        required: true

permissions:
  contents: read
  id-token: write
  pull-requests: read

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
      - name: Terraform Init
        working-directory: infra/
        run: terraform -chdir=
          "${{ inputs.stack }}" init -input=false
      - name: Terraform Plan
        working-directory: infra/
        run: terraform -chdir=
          "${{ inputs.stack }}" plan -input=false -out=tfplan
      - name: Upload plan artifact
        uses: actions/upload-artifact@v4
        with:
          name: tfplan-${{ inputs.stack }}
          path: infra/${{ inputs.stack }}/tfplan

  apply:
    needs: plan
    runs-on: ubuntu-latest
    environment: production
    # protect this environment with required reviewers in GitHub
    steps:
      - uses: actions/checkout@v4
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
      - name: Download plan artifact
        uses: actions/download-artifact@v4
        with:
          name: tfplan-${{ inputs.stack }}
          path: infra/${{ inputs.stack }}
      - name: Terraform Init
        working-directory: infra/
        run: terraform -chdir=
          "${{ inputs.stack }}" init -input=false
      - name: Terraform Apply
        working-directory: infra/
        run: terraform -chdir=
          "${{ inputs.stack }}" apply -input=false tfplan

Drift Detection SOP

Workflow Pattern

Scheduled drift check (conceptual)yaml
name: Drift Detection

on:
  schedule:
    - cron: "0 4 * * 1"
  workflow_dispatch:

jobs:
  drift:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        stack:
          - live/prod/platform/networking
          - live/prod/platform/dns
          - live/prod/services/payments-api
          - live/staging/services/payments-api
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - name: Init
        run: terraform -chdir=infra/${{ matrix.stack }} init -input=false
      - name: Refresh-only plan
        id: drift
        run: |
          set +e
          terraform -chdir=infra/${{ matrix.stack }} plan -refresh-only -detailed-exitcode -out=drift.tfplan
          code=$?
          echo "exit_code=$code" >> $GITHUB_OUTPUT
          exit 0
      - name: Upload drift evidence
        if: steps.drift.outputs.exit_code == '2'
        uses: actions/upload-artifact@v4
        with:
          name: drift-${{ matrix.stack }}
          path: infra/${{ matrix.stack }}/drift.tfplan
      - name: Fail only on real errors
        if: steps.drift.outputs.exit_code == '1'
        run: exit 1

# Post-process artifacts into owner-routed tickets or reports

Operating Rules

Route drift results to stack owners; do not dump a giant weekly plan in a shared channel.

Treat exit code 2 (changes present) as a signal, not an immediate failure of the entire pipeline.

Fail hard on exit code 1 (real errors) so broken checks are fixed quickly.

Keep matrix scope manageable and parallelized by state boundary, not by monolithic repo.

Track recurring drift sources (manual console changes, autoscaling drift, provider defaults, policy tools).

Wrong Prod Apply Runbook

Contain First, Then Recover

The worst response to a wrong-environment apply is multiple operators trying to “fix it quickly” from different terminals. Freeze, collect evidence, then choose a recovery path deliberately.

Incident response runbookmarkdown
# Runbook: Accidental apply against production

1. Stop making it worse
- Freeze further applies for the affected stack (disable CI job / revoke temporary session / announce freeze)
- Do NOT start "quick fixes" from multiple laptops

2. Capture evidence
- Identify exact commit, operator, timestamp, workspace/directory/stack path
- Pull the plan/apply logs (CI or local shell history if available)
- Snapshot current state and backend object version metadata

3. Establish impact
- Which resources changed?
- Was data deleted, replaced, or only metadata/tags updated?
- Are downstream systems impacted (DNS, networking, IAM, secrets references)?

4. Decide recovery path
- Forward-fix (preferred when safe and understood)
- Rollback via reviewed Terraform change
- Provider-native emergency restore only if Terraform path is too slow and incident severity requires it

5. Restore control plane discipline
- Move all further changes into CI-only reviewed workflow
- Add or tighten guardrails that allowed the event

6. Post-incident action items
- State boundary redesign if blast radius was too large
- Credential / IAM scope reduction
- CI enforcement + environment protections
- Team training + checklist updates

Terragrunt Upgrade Safety Checklist

Especially if You Use `run --all` and `--filter`

Terragrunt CLI changes can alter execution scope assumptions. Treat upgrades like infrastructure changes, not developer tooling updates.

Terragrunt migration checklistmarkdown
Terragrunt upgrade / CLI migration checklist

- Pin version in CI (do not float latest)
- Read release notes before upgrade (especially CLI redesign changes)
- Validate --filter semantics in test pipelines
- Confirm which commands now imply broader scope (e.g., --all behavior)
- Review wrapper scripts and docs used by engineers
- Update runbooks with explicit examples
- Roll out to one team/repo first
- Monitor for unexpected multi-unit executions

Rollout Phases

Phase 0: Inventory + Risk Mapping

Catalog current states, backends, owners, and environments.

Map each state to blast radius (what breaks if changed/deleted?).

Identify mixed states (shared infra + service resources together).

Document where production applies are happening today (CI vs laptops).

Phase 1: Guardrails Before Migration

Enforce CI-only apply for production stacks.

Enable environment/reviewer protections in GitHub Actions for prod.

Tighten backend/IAM permissions by environment and stack category.

Pin Terraform and Terragrunt versions in CI.

Phase 2: State Boundary Refactor

Split platform/shared stacks from service stacks.

Split by environment with explicit directory or orchestration paths.

Migrate one non-prod service first and validate workflow.

Stage platform migrations separately with extra review depth.

Phase 3: Drift + Operations

Add scheduled drift detection by stack/owner.

Create owner-routed remediation process and SLA.

Review lock contention and resize boundaries if needed.

Add incident runbook drills for wrong-environment apply events.

Sources

Companion Deep Dive

Read the companion article for the decision framework, pattern trade-offs, and why remote state alone is not a state architecture.

Read the Blog Deep Dive