Infrastructure as Code Best Practices with Terraform
Why Infrastructure as Code Matters
I’ve been managing infrastructure with Terraform for over five years, across AWS, GCP, and Azure environments ranging from small startups to enterprise platforms running thousands of resources. Infrastructure as Code (IaC) transforms infrastructure management from manual, error-prone clicking in web consoles to version-controlled, reviewable, automated deployments.
But Terraform’s power comes with complexity. Poor patterns lead to broken state files, drift between environments, and deployments that work on your laptop but fail in CI. This guide distills hard-won lessons into actionable best practices.
State Management: Get This Right First
Terraform’s state file is the source of truth for what infrastructure exists. Mismanage state, and you’ll have a bad time.
Remote State is Non-Negotiable
Never use local state for production. Store state remotely with locking enabled.
AWS:
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/infrastructure.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
Terraform Cloud:
terraform {
cloud {
organization = "mycompany"
workspaces {
name = "production-infra"
}
}
}
Why this matters: Local state files get lost, corrupted, or create conflicts when multiple engineers work on the same infrastructure. Remote state with locking prevents concurrent modifications that can corrupt state.
State Should Be Scoped Appropriately
Don’t manage all infrastructure in one giant state file. Split by:
- Environment:
dev/,staging/,production/ - Layer:
network/,data/,compute/,security/ - Team ownership:
platform/,data-engineering/,ml-infra/
Example structure:
infrastructure/
├── production/
│ ├── network/
│ │ └── main.tf
│ ├── database/
│ │ └── main.tf
│ └── compute/
│ └── main.tf
├── staging/
│ └── ...
└── shared/
└── route53/
└── main.tf
Benefits:
- Faster plan/apply (smaller blast radius)
- Easier to reason about
- Less risk: changes to network layer don’t touch database layer
- Team boundaries map to state boundaries
When to split state:
- Different deployment cadences (DNS changes weekly, compute changes daily)
- Different change approval requirements
- Different team ownership
- When
terraform plantakes >2 minutes
Use State File Versioning
Enable versioning on your state bucket. When something goes wrong (and it will), you can recover.
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
I once had a junior engineer accidentally run terraform destroy on production. We recovered because S3 versioning let us restore the previous state file and resurrect the resources.
Module Design Patterns
Modules are how you create reusable infrastructure components. Good module design is an art.
Write Composable Modules
A module should do one thing well.
Good module:
# modules/vpc/main.tf
variable "cidr_block" { type = string }
variable "azs" { type = list(string) }
variable "environment" { type = string }
resource "aws_vpc" "main" {
cidr_block = var.cidr_block
enable_dns_hostnames = true
tags = {
Environment = var.environment
}
}
output "vpc_id" { value = aws_vpc.main.id }
output "private_subnet_ids" { value = aws_subnet.private[*].id }
Using it:
module "vpc" {
source = "./modules/vpc"
cidr_block = "10.0.0.0/16"
azs = ["us-west-2a", "us-west-2b", "us-west-2c"]
environment = "production"
}
module "rds" {
source = "./modules/rds"
subnet_ids = module.vpc.private_subnet_ids
vpc_id = module.vpc.vpc_id
}
Input Variables Should Have Good Defaults
Make modules easy to use by providing sensible defaults.
variable "instance_type" {
type = string
default = "t3.medium"
description = "EC2 instance type"
}
variable "enable_monitoring" {
type = bool
default = true
description = "Enable detailed CloudWatch monitoring"
}
variable "backup_retention_days" {
type = number
default = 7
description = "Number of days to retain automated backups"
validation {
condition = var.backup_retention_days >= 1 && var.backup_retention_days <= 35
error_message = "Backup retention must be between 1 and 35 days."
}
}
Use validation blocks to catch invalid inputs early.
Outputs Should Expose What Consumers Need
Don’t expose internal implementation details. Do expose identifiers and connection information.
output "cluster_endpoint" {
value = aws_rds_cluster.main.endpoint
description = "RDS cluster writer endpoint"
}
output "cluster_reader_endpoint" {
value = aws_rds_cluster.main.reader_endpoint
description = "RDS cluster reader endpoint for read replicas"
}
output "cluster_identifier" {
value = aws_rds_cluster.main.cluster_identifier
description = "RDS cluster identifier"
}
# Don't output sensitive values without marking them
output "master_password" {
value = aws_rds_cluster.main.master_password
sensitive = true
description = "RDS master password (sensitive)"
}
Module Versioning
Pin module versions to avoid surprises.
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.1.2" # Pin exact version
name = "my-vpc"
cidr = "10.0.0.0/16"
}
Use semantic versioning for your own modules:
module "internal_app" {
source = "git::https://github.com/myorg/terraform-modules.git//app?ref=v2.1.0"
}
Workspace vs Directory Strategy
Terraform Workspaces let you manage multiple environments with the same configuration:
terraform workspace new production
terraform workspace new staging
terraform workspace select production
terraform apply
Directory-per-environment:
├── production/
│ └── main.tf
├── staging/
│ └── main.tf
└── dev/
└── main.tf
My recommendation: Use directories, not workspaces.
Why?
- Clearer: you can see what’s deployed where in version control
- Different state files (isolation)
- Can have environment-specific configurations
- Easier to reason about in CI/CD
Workspaces are useful for temporary environments (feature branches, testing), not long-lived production vs staging.
Managing Secrets
Never commit secrets to version control. Use one of these approaches:
1. AWS Secrets Manager / Parameter Store
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "production/db/master-password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
2. Terraform Cloud Variables
Store secrets as sensitive variables in Terraform Cloud. They’re encrypted and never appear in logs.
3. External Secret Management
data "external" "vault" {
program = ["vault", "kv", "get", "-format=json", "secret/db"]
}
resource "aws_db_instance" "main" {
password = data.external.vault.result.password
}
4. Random Passwords Stored in State
resource "random_password" "db_master" {
length = 32
special = true
}
resource "aws_db_instance" "main" {
password = random_password.db_master.result
}
output "db_password" {
value = random_password.db_master.result
sensitive = true
}
The password lives in state, which should be encrypted. Retrieve it with terraform output -raw db_password.
Resource Naming and Tagging
Consistent naming prevents conflicts and improves observability.
Naming Convention
locals {
name_prefix = "${var.project}-${var.environment}"
common_tags = {
Project = var.project
Environment = var.environment
ManagedBy = "terraform"
Owner = var.owner
}
}
resource "aws_vpc" "main" {
cidr_block = var.cidr_block
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-vpc"
})
}
resource "aws_subnet" "private" {
count = length(var.azs)
vpc_id = aws_vpc.main.id
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-private-${var.azs[count.index]}"
Type = "private"
})
}
This ensures:
- All resources are tagged consistently
- You can filter resources by environment, project, owner
- Cost allocation by tag works automatically
- It’s clear what Terraform manages
Preventing Destructive Changes
Some resources should never be destroyed accidentally.
resource "aws_db_instance" "main" {
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket" "data" {
lifecycle {
prevent_destroy = true
}
}
Now terraform destroy will fail unless you remove the prevent_destroy block.
For resources that need to be replaced carefully:
resource "aws_instance" "app" {
lifecycle {
create_before_destroy = true
}
}
This ensures the new instance is created and healthy before the old one is destroyed.
Terraform in CI/CD
Automate Terraform to avoid “works on my machine” issues.
GitHub Actions Example
name: Terraform
on:
pull_request:
paths:
- 'infrastructure/**'
push:
branches:
- main
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.6.0
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Init
run: terraform init
working-directory: infrastructure/production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Terraform Validate
run: terraform validate
working-directory: infrastructure/production
- name: Terraform Plan
if: github.event_name == 'pull_request'
run: terraform plan -no-color
working-directory: infrastructure/production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Terraform Apply
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: terraform apply -auto-approve
working-directory: infrastructure/production
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Best Practices for CI/CD
- Always run
terraform planon pull requests and post the output as a comment - Run
terraform applyonly on main branch after PR approval - Use separate credentials for CI with minimal required permissions
- Enable state locking to prevent concurrent applies
- Run
terraform fmtandterraform validatein CI as quality gates
Atlantis for PR-based Workflows
Atlantis is a tool that listens to your repository and runs Terraform commands in response to pull request comments.
# Pull request opened
Atlantis runs: terraform plan
# Developer comments: "atlantis apply"
Atlantis runs: terraform apply
# Posts results as PR comment
This creates a GitOps workflow for infrastructure changes.
Handling State Drift
Drift happens when infrastructure is modified outside Terraform (manual changes in console, other automation, etc.).
Detect Drift Regularly
# Run this in a cron job or CI pipeline
terraform plan -detailed-exitcode
# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes detected (drift)
Alert when drift is detected.
Import Existing Resources
If resources were created manually, import them into Terraform:
terraform import aws_instance.example i-1234567890abcdef0
Then define the resource in your .tf files to match the existing configuration.
Refresh State
terraform refresh # Updates state to match real infrastructure
terraform apply -refresh-only # Terraform 0.15.4+
Use -refresh-only to see what has changed without applying updates.
Testing Terraform Code
1. Pre-commit Hooks
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.83.5
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_docs
- id: terraform_tflint
2. Terratest for Integration Tests
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
)
func TestVPCCreation(t *testing.T) {
opts := &terraform.Options{
TerraformDir: "../examples/vpc",
}
defer terraform.Destroy(t, opts)
terraform.InitAndApply(t, opts)
vpcId := terraform.Output(t, opts, "vpc_id")
assert.NotEmpty(t, vpcId)
}
3. Sentinel for Policy as Code (Terraform Cloud)
# Require all S3 buckets to have encryption
import "tfplan/v2" as tfplan
main = rule {
all tfplan.resource_changes as _, rc {
rc.type is "aws_s3_bucket" implies
rc.change.after.server_side_encryption_configuration != null
}
}
Common Pitfalls and How to Avoid Them
1. Count vs For_Each
Don’t use count for resources that might be reordered:
# BAD: If you remove the first item, all resources get recreated
resource "aws_instance" "app" {
count = length(var.instances)
instance_type = var.instances[count.index]
}
# GOOD: Resources are keyed by name, reordering doesn't matter
resource "aws_instance" "app" {
for_each = toset(var.instances)
instance_type = each.value
}
2. Depends_On Overuse
Terraform infers most dependencies automatically. Only use depends_on when there’s an implicit dependency Terraform can’t detect.
# Usually unnecessary
resource "aws_instance" "app" {
depends_on = [aws_vpc.main] # Terraform knows this already
}
# Necessary when dependency is implicit
resource "aws_iam_role_policy_attachment" "lambda" {
role = aws_iam_role.lambda.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}
resource "aws_lambda_function" "main" {
depends_on = [aws_iam_role_policy_attachment.lambda] # Needed: ensure policy is attached before Lambda runs
}
3. String Interpolation in Resource Names
# BAD: Creates circular dependencies
resource "aws_security_group" "app" {
name = "${aws_security_group.app.id}-sg" # Circular!
}
# GOOD
resource "aws_security_group" "app" {
name = "${var.app_name}-sg"
}
4. Hardcoded Values
# BAD
resource "aws_instance" "app" {
ami = "ami-0c55b159cbfafe1f0" # What region? What AMI is this?
}
# GOOD
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
}
}
resource "aws_instance" "app" {
ami = data.aws_ami.ubuntu.id
}
Advanced Patterns
Dynamic Blocks
variable "ingress_rules" {
type = list(object({
from_port = number
to_port = number
protocol = string
cidr_blocks = list(string)
}))
}
resource "aws_security_group" "app" {
dynamic "ingress" {
for_each = var.ingress_rules
content {
from_port = ingress.value.from_port
to_port = ingress.value.to_port
protocol = ingress.value.protocol
cidr_blocks = ingress.value.cidr_blocks
}
}
}
Moved Blocks (Terraform 1.1+)
Rename or restructure resources without recreating them:
resource "aws_instance" "new_name" {
# config
}
moved {
from = aws_instance.old_name
to = aws_instance.new_name
}
Using Functions
locals {
# Merge tags from multiple sources
all_tags = merge(
var.default_tags,
var.environment_tags,
{ Name = "${var.prefix}-${var.name}" }
)
# Conditional logic
instance_type = var.environment == "production" ? "m5.large" : "t3.medium"
# List manipulation
private_subnet_cidrs = [for i, az in var.azs : cidrsubnet(var.vpc_cidr, 8, i)]
}
The Workflow
Here’s my daily Terraform workflow:
# 1. Pull latest state
terraform init -reconfigure
# 2. Format code
terraform fmt -recursive
# 3. Validate syntax
terraform validate
# 4. Plan changes
terraform plan -out=tfplan
# 5. Review plan carefully
# - What's being created/modified/destroyed?
# - Any unexpected changes?
# - Any sensitive data in the plan output?
# 6. Apply changes
terraform apply tfplan
# 7. Commit and push
git add .
git commit -m "feat: add RDS read replica"
git push
Conclusion
Terraform is powerful but unforgiving. The patterns in this guide are learned from years of managing production infrastructure across multiple cloud providers. Key takeaways:
- Remote state with locking is mandatory
- Modular design keeps infrastructure maintainable
- Automate in CI/CD to catch errors early
- Tag everything for visibility and cost tracking
- Test your infrastructure code like application code
- Plan frequently, apply carefully
Infrastructure as Code transforms infrastructure management from a manual, error-prone process into a repeatable, auditable, collaborative workflow. With these patterns, you can manage infrastructure at scale without losing your mind.
Infrastructure is code. Treat it like code: version control, code review, testing, CI/CD. Your future self will thank you.