Article
Database Migrations Without Downtime
There’s a special kind of anxiety that comes with running ALTER TABLE on a production database with millions of rows. Will it lock the table? How long will it take? What happens to in-flight requests?
Over the years, I’ve developed a set of patterns for evolving database schemas without downtime. Here’s the playbook.
The Expand-and-Contract Pattern
The core idea is simple: never make a breaking change in a single step. Instead, expand the schema to support both old and new formats, migrate the data, then contract by removing the old format.
Example: Renaming a column
You can’t just ALTER TABLE users RENAME COLUMN name TO full_name. That breaks every query referencing name instantly.
Instead:
- Expand: Add the new column
full_name - Backfill: Copy data from
nametofull_name - Dual-write: Update application code to write to both columns
- Switch reads: Update application to read from
full_name - Contract: Drop the old
namecolumn
Yes, it’s five steps instead of one. But zero users experience an error.
Safe Operations vs. Dangerous Operations
In PostgreSQL, some DDL operations are safe to run on large tables, and some are not:
Safe (no long locks):
CREATE INDEX CONCURRENTLY— builds index without blocking writesADD COLUMNwith no default (PostgreSQL 11+, with default too)DROP COLUMN— only marks column as invisible, doesn’t rewriteCREATE TABLE,DROP TABLE
Dangerous (acquires heavy locks):
ADD COLUMNwith a volatile default (pre-11)ALTER COLUMN SET NOT NULL— requires full table scanALTER COLUMN TYPE— rewrites the entire table- Any operation that rewrites the table
-- Safe: concurrent index creation
CREATE INDEX CONCURRENTLY idx_users_email ON users (email);
-- Dangerous: this locks the table for the entire rewrite
ALTER TABLE users ALTER COLUMN status TYPE integer USING status::integer;
Adding a NOT NULL Constraint Safely
Adding NOT NULL directly on a large table can lock it for minutes. Here’s the safe approach:
-- Step 1: Add a CHECK constraint as NOT VALID (instant, no scan)
ALTER TABLE users
ADD CONSTRAINT users_email_not_null
CHECK (email IS NOT NULL) NOT VALID;
-- Step 2: Validate the constraint (scans table but doesn't lock writes)
ALTER TABLE users
VALIDATE CONSTRAINT users_email_not_null;
-- Step 3 (optional): Convert to a proper NOT NULL
-- PostgreSQL recognizes the CHECK constraint and skips the scan
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_email_not_null;
Backfilling Large Tables
When you need to update millions of rows, don’t do it in one transaction. That creates a massive transaction log entry and can cause replication lag.
Instead, batch it:
def backfill_column(table, source_col, target_col, batch_size=1000):
while True:
updated = db.execute(f"""
UPDATE {table}
SET {target_col} = {source_col}
WHERE {target_col} IS NULL
LIMIT {batch_size}
""")
if updated == 0:
break
db.commit()
time.sleep(0.1) # Be kind to the database
The time.sleep matters. You’re sharing this database with production traffic.
Migration Tooling
Whatever migration tool you use, make sure it supports:
- Reversibility: Every migration should have an
upand adown - Idempotency: Running a migration twice shouldn’t break anything
- Locking awareness: The tool should warn about operations that acquire heavy locks
- Deployment independence: Migrations should be deployable separately from application code
The Golden Rule
Your application code must be compatible with both the old and new schema during the migration window.
This means deploying code changes in this order:
- Deploy code that works with both schemas
- Run the migration
- Deploy code that only works with the new schema
It’s more work upfront, but it means you can roll back the code without rolling back the migration — and vice versa.
Final Thoughts
Zero-downtime migrations require patience and discipline. They’re slower than just taking the system down for maintenance. But for any system where availability matters, they’re non-negotiable.
The best migration is the one your users never notice happened.