Article
Infrastructure as Code: Lessons from the Trenches
Infrastructure as Code sounds simple in theory: describe your infrastructure in files, version control those files, and let automation handle the rest. In practice, it’s one of the most nuanced areas of software engineering.
After managing Terraform codebases for multiple organizations, here’s what I wish someone had told me on day one.
Lesson 1: State Is Everything
Terraform state is the source of truth for what exists in your infrastructure. Mismanaging it is the fastest path to a very bad day.
Rules I follow:
- Remote state backend from day one (S3 + DynamoDB locking)
- State file per environment, no exceptions
- Never manually edit state unless you absolutely must (and document it when you do)
- Regular state backups, even with remote backends
terraform {
backend "s3" {
bucket = "myorg-terraform-state"
key = "production/core/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Lesson 2: Small, Composable Modules
The temptation is to build one massive module that creates your entire stack. Resist it.
Good module design follows the same principles as good software design:
- Single responsibility: One module does one thing
- Clear interfaces: Inputs and outputs are well-defined
- No hidden side effects: A module shouldn’t create resources you don’t expect
- Versioned: Pin module versions in your root configurations
A module for a “web service” should create the load balancer, target group, and service definition — not the VPC, database, and monitoring stack.
Lesson 3: Environments Should Be Identical (But They Won’t Be)
The dream is that staging and production are identical except for scale. In reality, there are always differences: different instance sizes, different third-party integrations, different data volumes.
My approach: use the same modules everywhere, but accept that variables will differ. Keep those differences explicit and minimal.
# environments/production/main.tf
module "api" {
source = "../../modules/api-service"
instance_type = "c5.xlarge"
min_capacity = 3
max_capacity = 20
}
# environments/staging/main.tf
module "api" {
source = "../../modules/api-service"
instance_type = "t3.medium"
min_capacity = 1
max_capacity = 3
}
Lesson 4: Plan Before Apply (Always)
It sounds obvious, but I’ve seen teams set up CI/CD pipelines that auto-apply Terraform changes on merge. This is terrifying.
My workflow:
terraform planruns on every PR- Plan output is posted as a PR comment for review
- Apply requires manual approval, even in CI
- Applies only happen from the main branch
The few minutes spent reviewing a plan have saved me from countless accidental resource deletions.
Lesson 5: Imports and Adoption Are Messy
Bringing existing infrastructure under Terraform management is one of the hardest tasks in DevOps. The terraform import command is your friend, but it only handles state — you still need to write the matching configuration by hand.
Tips for adoption:
- Start with non-critical infrastructure
- Import one resource type at a time
- Run
planafter every import to verify drift - Accept that it will take weeks, not days
The Bigger Picture
Infrastructure as Code isn’t just about automation — it’s about communication. When your infrastructure is defined in code, anyone on the team can understand what exists, why it exists, and how it’s configured.
That transparency is worth more than any time savings from automation alone.