Infrastructure as Code: Lessons from the Trenches

Infrastructure as Code sounds simple in theory: describe your infrastructure in files, version control those files, and let automation handle the rest. In practice, it’s one of the most nuanced areas of software engineering.

After managing Terraform codebases for multiple organizations, here’s what I wish someone had told me on day one.

Lesson 1: State Is Everything

Terraform state is the source of truth for what exists in your infrastructure. Mismanaging it is the fastest path to a very bad day.

Rules I follow:

Remote state backend from day one (S3 + DynamoDB locking)
State file per environment, no exceptions
Never manually edit state unless you absolutely must (and document it when you do)
Regular state backups, even with remote backends

terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "production/core/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Lesson 2: Small, Composable Modules

The temptation is to build one massive module that creates your entire stack. Resist it.

Good module design follows the same principles as good software design:

Single responsibility: One module does one thing
Clear interfaces: Inputs and outputs are well-defined
No hidden side effects: A module shouldn’t create resources you don’t expect
Versioned: Pin module versions in your root configurations

A module for a “web service” should create the load balancer, target group, and service definition — not the VPC, database, and monitoring stack.

Lesson 3: Environments Should Be Identical (But They Won’t Be)

The dream is that staging and production are identical except for scale. In reality, there are always differences: different instance sizes, different third-party integrations, different data volumes.

My approach: use the same modules everywhere, but accept that variables will differ. Keep those differences explicit and minimal.

# environments/production/main.tf
module "api" {
  source        = "../../modules/api-service"
  instance_type = "c5.xlarge"
  min_capacity  = 3
  max_capacity  = 20
}

# environments/staging/main.tf
module "api" {
  source        = "../../modules/api-service"
  instance_type = "t3.medium"
  min_capacity  = 1
  max_capacity  = 3
}

Lesson 4: Plan Before Apply (Always)

It sounds obvious, but I’ve seen teams set up CI/CD pipelines that auto-apply Terraform changes on merge. This is terrifying.

My workflow:

terraform plan runs on every PR
Plan output is posted as a PR comment for review
Apply requires manual approval, even in CI
Applies only happen from the main branch

The few minutes spent reviewing a plan have saved me from countless accidental resource deletions.

Lesson 5: Imports and Adoption Are Messy

Bringing existing infrastructure under Terraform management is one of the hardest tasks in DevOps. The terraform import command is your friend, but it only handles state — you still need to write the matching configuration by hand.

Tips for adoption:

Start with non-critical infrastructure
Import one resource type at a time
Run plan after every import to verify drift
Accept that it will take weeks, not days

The Bigger Picture

Infrastructure as Code isn’t just about automation — it’s about communication. When your infrastructure is defined in code, anyone on the team can understand what exists, why it exists, and how it’s configured.

That transparency is worth more than any time savings from automation alone.