Infrastructure as Code Best Practices with Terraform

Why Infrastructure as Code Matters

I’ve been managing infrastructure with Terraform for over five years, across AWS, GCP, and Azure environments ranging from small startups to enterprise platforms running thousands of resources. Infrastructure as Code (IaC) transforms infrastructure management from manual, error-prone clicking in web consoles to version-controlled, reviewable, automated deployments.

But Terraform’s power comes with complexity. Poor patterns lead to broken state files, drift between environments, and deployments that work on your laptop but fail in CI. This guide distills hard-won lessons into actionable best practices.

State Management: Get This Right First

Terraform’s state file is the source of truth for what infrastructure exists. Mismanage state, and you’ll have a bad time.

Remote State is Non-Negotiable

Never use local state for production. Store state remotely with locking enabled.

AWS:

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Terraform Cloud:

terraform {
  cloud {
    organization = "mycompany"
    workspaces {
      name = "production-infra"
    }
  }
}

Why this matters: Local state files get lost, corrupted, or create conflicts when multiple engineers work on the same infrastructure. Remote state with locking prevents concurrent modifications that can corrupt state.

State Should Be Scoped Appropriately

Don’t manage all infrastructure in one giant state file. Split by:

Environment: dev/, staging/, production/
Layer: network/, data/, compute/, security/
Team ownership: platform/, data-engineering/, ml-infra/

Example structure:

infrastructure/
├── production/
│   ├── network/
│   │   └── main.tf
│   ├── database/
│   │   └── main.tf
│   └── compute/
│       └── main.tf
├── staging/
│   └── ...
└── shared/
    └── route53/
        └── main.tf

Benefits:

Faster plan/apply (smaller blast radius)
Easier to reason about
Less risk: changes to network layer don’t touch database layer
Team boundaries map to state boundaries

When to split state:

Different deployment cadences (DNS changes weekly, compute changes daily)
Different change approval requirements
Different team ownership
When terraform plan takes >2 minutes

Use State File Versioning

Enable versioning on your state bucket. When something goes wrong (and it will), you can recover.

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

I once had a junior engineer accidentally run terraform destroy on production. We recovered because S3 versioning let us restore the previous state file and resurrect the resources.

Module Design Patterns

Modules are how you create reusable infrastructure components. Good module design is an art.

Write Composable Modules

A module should do one thing well.

Good module:

# modules/vpc/main.tf
variable "cidr_block" { type = string }
variable "azs" { type = list(string) }
variable "environment" { type = string }

resource "aws_vpc" "main" {
  cidr_block = var.cidr_block
  enable_dns_hostnames = true
  tags = {
    Environment = var.environment
  }
}

output "vpc_id" { value = aws_vpc.main.id }
output "private_subnet_ids" { value = aws_subnet.private[*].id }

Using it:

module "vpc" {
  source = "./modules/vpc"
  cidr_block = "10.0.0.0/16"
  azs = ["us-west-2a", "us-west-2b", "us-west-2c"]
  environment = "production"
}

module "rds" {
  source = "./modules/rds"
  subnet_ids = module.vpc.private_subnet_ids
  vpc_id = module.vpc.vpc_id
}

Input Variables Should Have Good Defaults

Make modules easy to use by providing sensible defaults.

variable "instance_type" {
  type    = string
  default = "t3.medium"
  description = "EC2 instance type"
}

variable "enable_monitoring" {
  type    = bool
  default = true
  description = "Enable detailed CloudWatch monitoring"
}

variable "backup_retention_days" {
  type    = number
  default = 7
  description = "Number of days to retain automated backups"

  validation {
    condition     = var.backup_retention_days >= 1 && var.backup_retention_days <= 35
    error_message = "Backup retention must be between 1 and 35 days."
  }
}

Use validation blocks to catch invalid inputs early.

Outputs Should Expose What Consumers Need

Don’t expose internal implementation details. Do expose identifiers and connection information.

output "cluster_endpoint" {
  value       = aws_rds_cluster.main.endpoint
  description = "RDS cluster writer endpoint"
}

output "cluster_reader_endpoint" {
  value       = aws_rds_cluster.main.reader_endpoint
  description = "RDS cluster reader endpoint for read replicas"
}

output "cluster_identifier" {
  value       = aws_rds_cluster.main.cluster_identifier
  description = "RDS cluster identifier"
}

# Don't output sensitive values without marking them
output "master_password" {
  value       = aws_rds_cluster.main.master_password
  sensitive   = true
  description = "RDS master password (sensitive)"
}

Module Versioning

Pin module versions to avoid surprises.

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"  # Pin exact version

  name = "my-vpc"
  cidr = "10.0.0.0/16"
}

Use semantic versioning for your own modules:

module "internal_app" {
  source = "git::https://github.com/myorg/terraform-modules.git//app?ref=v2.1.0"
}

Workspace vs Directory Strategy

Terraform Workspaces let you manage multiple environments with the same configuration:

terraform workspace new production
terraform workspace new staging
terraform workspace select production
terraform apply

Directory-per-environment:

├── production/
│   └── main.tf
├── staging/
│   └── main.tf
└── dev/
    └── main.tf

My recommendation: Use directories, not workspaces.

Why?

Clearer: you can see what’s deployed where in version control
Different state files (isolation)
Can have environment-specific configurations
Easier to reason about in CI/CD

Workspaces are useful for temporary environments (feature branches, testing), not long-lived production vs staging.

Managing Secrets

Never commit secrets to version control. Use one of these approaches:

1. AWS Secrets Manager / Parameter Store

data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "production/db/master-password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

2. Terraform Cloud Variables

Store secrets as sensitive variables in Terraform Cloud. They’re encrypted and never appear in logs.

3. External Secret Management

data "external" "vault" {
  program = ["vault", "kv", "get", "-format=json", "secret/db"]
}

resource "aws_db_instance" "main" {
  password = data.external.vault.result.password
}

4. Random Passwords Stored in State

resource "random_password" "db_master" {
  length  = 32
  special = true
}

resource "aws_db_instance" "main" {
  password = random_password.db_master.result
}

output "db_password" {
  value     = random_password.db_master.result
  sensitive = true
}

The password lives in state, which should be encrypted. Retrieve it with terraform output -raw db_password.

Resource Naming and Tagging

Consistent naming prevents conflicts and improves observability.

Naming Convention

locals {
  name_prefix = "${var.project}-${var.environment}"

  common_tags = {
    Project     = var.project
    Environment = var.environment
    ManagedBy   = "terraform"
    Owner       = var.owner
  }
}

resource "aws_vpc" "main" {
  cidr_block = var.cidr_block

  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-vpc"
  })
}

resource "aws_subnet" "private" {
  count  = length(var.azs)
  vpc_id = aws_vpc.main.id

  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-private-${var.azs[count.index]}"
    Type = "private"
  })
}

This ensures:

All resources are tagged consistently
You can filter resources by environment, project, owner
Cost allocation by tag works automatically
It’s clear what Terraform manages

Preventing Destructive Changes

Some resources should never be destroyed accidentally.

resource "aws_db_instance" "main" {
  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket" "data" {
  lifecycle {
    prevent_destroy = true
  }
}

Now terraform destroy will fail unless you remove the prevent_destroy block.

For resources that need to be replaced carefully:

resource "aws_instance" "app" {
  lifecycle {
    create_before_destroy = true
  }
}

This ensures the new instance is created and healthy before the old one is destroyed.

Terraform in CI/CD

Automate Terraform to avoid “works on my machine” issues.

GitHub Actions Example

name: Terraform

on:
  pull_request:
    paths:
      - 'infrastructure/**'
  push:
    branches:
      - main

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.6.0

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Init
        run: terraform init
        working-directory: infrastructure/production
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Terraform Validate
        run: terraform validate
        working-directory: infrastructure/production

      - name: Terraform Plan
        if: github.event_name == 'pull_request'
        run: terraform plan -no-color
        working-directory: infrastructure/production
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: terraform apply -auto-approve
        working-directory: infrastructure/production
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Best Practices for CI/CD

Always run terraform plan on pull requests and post the output as a comment
Run terraform apply only on main branch after PR approval
Use separate credentials for CI with minimal required permissions
Enable state locking to prevent concurrent applies
Run terraform fmt and terraform validate in CI as quality gates

Atlantis for PR-based Workflows

Atlantis is a tool that listens to your repository and runs Terraform commands in response to pull request comments.

# Pull request opened
Atlantis runs: terraform plan

# Developer comments: "atlantis apply"
Atlantis runs: terraform apply

# Posts results as PR comment

This creates a GitOps workflow for infrastructure changes.

Handling State Drift

Drift happens when infrastructure is modified outside Terraform (manual changes in console, other automation, etc.).

Detect Drift Regularly

# Run this in a cron job or CI pipeline
terraform plan -detailed-exitcode

# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes detected (drift)

Alert when drift is detected.

Import Existing Resources

If resources were created manually, import them into Terraform:

terraform import aws_instance.example i-1234567890abcdef0

Then define the resource in your .tf files to match the existing configuration.

Refresh State

terraform refresh  # Updates state to match real infrastructure
terraform apply -refresh-only  # Terraform 0.15.4+

Use -refresh-only to see what has changed without applying updates.

Testing Terraform Code

1. Pre-commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.83.5
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_docs
      - id: terraform_tflint

2. Terratest for Integration Tests

package test

import (
  "testing"
  "github.com/gruntwork-io/terratest/modules/terraform"
)

func TestVPCCreation(t *testing.T) {
  opts := &terraform.Options{
    TerraformDir: "../examples/vpc",
  }

  defer terraform.Destroy(t, opts)
  terraform.InitAndApply(t, opts)

  vpcId := terraform.Output(t, opts, "vpc_id")
  assert.NotEmpty(t, vpcId)
}

3. Sentinel for Policy as Code (Terraform Cloud)

# Require all S3 buckets to have encryption
import "tfplan/v2" as tfplan

main = rule {
  all tfplan.resource_changes as _, rc {
    rc.type is "aws_s3_bucket" implies
      rc.change.after.server_side_encryption_configuration != null
  }
}

Common Pitfalls and How to Avoid Them

1. Count vs For_Each

Don’t use count for resources that might be reordered:

# BAD: If you remove the first item, all resources get recreated
resource "aws_instance" "app" {
  count = length(var.instances)
  instance_type = var.instances[count.index]
}

# GOOD: Resources are keyed by name, reordering doesn't matter
resource "aws_instance" "app" {
  for_each = toset(var.instances)
  instance_type = each.value
}

2. Depends_On Overuse

Terraform infers most dependencies automatically. Only use depends_on when there’s an implicit dependency Terraform can’t detect.

# Usually unnecessary
resource "aws_instance" "app" {
  depends_on = [aws_vpc.main]  # Terraform knows this already
}

# Necessary when dependency is implicit
resource "aws_iam_role_policy_attachment" "lambda" {
  role       = aws_iam_role.lambda.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "aws_lambda_function" "main" {
  depends_on = [aws_iam_role_policy_attachment.lambda]  # Needed: ensure policy is attached before Lambda runs
}

3. String Interpolation in Resource Names

# BAD: Creates circular dependencies
resource "aws_security_group" "app" {
  name = "${aws_security_group.app.id}-sg"  # Circular!
}

# GOOD
resource "aws_security_group" "app" {
  name = "${var.app_name}-sg"
}

4. Hardcoded Values

# BAD
resource "aws_instance" "app" {
  ami = "ami-0c55b159cbfafe1f0"  # What region? What AMI is this?
}

# GOOD
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
  }
}

resource "aws_instance" "app" {
  ami = data.aws_ami.ubuntu.id
}

Advanced Patterns

Dynamic Blocks

variable "ingress_rules" {
  type = list(object({
    from_port   = number
    to_port     = number
    protocol    = string
    cidr_blocks = list(string)
  }))
}

resource "aws_security_group" "app" {
  dynamic "ingress" {
    for_each = var.ingress_rules
    content {
      from_port   = ingress.value.from_port
      to_port     = ingress.value.to_port
      protocol    = ingress.value.protocol
      cidr_blocks = ingress.value.cidr_blocks
    }
  }
}

Moved Blocks (Terraform 1.1+)

Rename or restructure resources without recreating them:

resource "aws_instance" "new_name" {
  # config
}

moved {
  from = aws_instance.old_name
  to   = aws_instance.new_name
}

Using Functions

locals {
  # Merge tags from multiple sources
  all_tags = merge(
    var.default_tags,
    var.environment_tags,
    { Name = "${var.prefix}-${var.name}" }
  )

  # Conditional logic
  instance_type = var.environment == "production" ? "m5.large" : "t3.medium"

  # List manipulation
  private_subnet_cidrs = [for i, az in var.azs : cidrsubnet(var.vpc_cidr, 8, i)]
}

The Workflow

Here’s my daily Terraform workflow:

# 1. Pull latest state
terraform init -reconfigure

# 2. Format code
terraform fmt -recursive

# 3. Validate syntax
terraform validate

# 4. Plan changes
terraform plan -out=tfplan

# 5. Review plan carefully
# - What's being created/modified/destroyed?
# - Any unexpected changes?
# - Any sensitive data in the plan output?

# 6. Apply changes
terraform apply tfplan

# 7. Commit and push
git add .
git commit -m "feat: add RDS read replica"
git push

Conclusion

Terraform is powerful but unforgiving. The patterns in this guide are learned from years of managing production infrastructure across multiple cloud providers. Key takeaways:

Remote state with locking is mandatory
Modular design keeps infrastructure maintainable
Automate in CI/CD to catch errors early
Tag everything for visibility and cost tracking
Test your infrastructure code like application code
Plan frequently, apply carefully

Infrastructure as Code transforms infrastructure management from a manual, error-prone process into a repeatable, auditable, collaborative workflow. With these patterns, you can manage infrastructure at scale without losing your mind.

Infrastructure is code. Treat it like code: version control, code review, testing, CI/CD. Your future self will thank you.

AI & ML

Building an AI Blog Writer: From Topic to Published Post with n8n, Claude, and GitHub

Developer skills

Cutting Cortex LLM Costs by 90%: The Prompt Engineering Playbook

Engineering

Watching Infrastructure Learn From Itself: A Claude Code Reflection

Enterprise software

Zero-Downtime Database Migrations

News & insights

From Idea to Production in 28 Days

Open Source

Personal AI Operations Memory: Building a Learning System for Git-Ops

Security

Concept: Homomorphic encryption techniques for secure computation on encrypted data