Article

Database Migrations Without Downtime

There’s a special kind of anxiety that comes with running ALTER TABLE on a production database with millions of rows. Will it lock the table? How long will it take? What happens to in-flight requests?

Over the years, I’ve developed a set of patterns for evolving database schemas without downtime. Here’s the playbook.

The Expand-and-Contract Pattern

The core idea is simple: never make a breaking change in a single step. Instead, expand the schema to support both old and new formats, migrate the data, then contract by removing the old format.

Example: Renaming a column

You can’t just ALTER TABLE users RENAME COLUMN name TO full_name. That breaks every query referencing name instantly.

Instead:

  1. Expand: Add the new column full_name
  2. Backfill: Copy data from name to full_name
  3. Dual-write: Update application code to write to both columns
  4. Switch reads: Update application to read from full_name
  5. Contract: Drop the old name column

Yes, it’s five steps instead of one. But zero users experience an error.

Safe Operations vs. Dangerous Operations

In PostgreSQL, some DDL operations are safe to run on large tables, and some are not:

Safe (no long locks):

  • CREATE INDEX CONCURRENTLY — builds index without blocking writes
  • ADD COLUMN with no default (PostgreSQL 11+, with default too)
  • DROP COLUMN — only marks column as invisible, doesn’t rewrite
  • CREATE TABLE, DROP TABLE

Dangerous (acquires heavy locks):

  • ADD COLUMN with a volatile default (pre-11)
  • ALTER COLUMN SET NOT NULL — requires full table scan
  • ALTER COLUMN TYPE — rewrites the entire table
  • Any operation that rewrites the table
-- Safe: concurrent index creation
CREATE INDEX CONCURRENTLY idx_users_email ON users (email);

-- Dangerous: this locks the table for the entire rewrite
ALTER TABLE users ALTER COLUMN status TYPE integer USING status::integer;

Adding a NOT NULL Constraint Safely

Adding NOT NULL directly on a large table can lock it for minutes. Here’s the safe approach:

-- Step 1: Add a CHECK constraint as NOT VALID (instant, no scan)
ALTER TABLE users
  ADD CONSTRAINT users_email_not_null
  CHECK (email IS NOT NULL) NOT VALID;

-- Step 2: Validate the constraint (scans table but doesn't lock writes)
ALTER TABLE users
  VALIDATE CONSTRAINT users_email_not_null;

-- Step 3 (optional): Convert to a proper NOT NULL
-- PostgreSQL recognizes the CHECK constraint and skips the scan
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_email_not_null;

Backfilling Large Tables

When you need to update millions of rows, don’t do it in one transaction. That creates a massive transaction log entry and can cause replication lag.

Instead, batch it:

def backfill_column(table, source_col, target_col, batch_size=1000):
    while True:
        updated = db.execute(f"""
            UPDATE {table}
            SET {target_col} = {source_col}
            WHERE {target_col} IS NULL
            LIMIT {batch_size}
        """)
        if updated == 0:
            break
        db.commit()
        time.sleep(0.1)  # Be kind to the database

The time.sleep matters. You’re sharing this database with production traffic.

Migration Tooling

Whatever migration tool you use, make sure it supports:

  • Reversibility: Every migration should have an up and a down
  • Idempotency: Running a migration twice shouldn’t break anything
  • Locking awareness: The tool should warn about operations that acquire heavy locks
  • Deployment independence: Migrations should be deployable separately from application code

The Golden Rule

Your application code must be compatible with both the old and new schema during the migration window.

This means deploying code changes in this order:

  1. Deploy code that works with both schemas
  2. Run the migration
  3. Deploy code that only works with the new schema

It’s more work upfront, but it means you can roll back the code without rolling back the migration — and vice versa.

Final Thoughts

Zero-downtime migrations require patience and discipline. They’re slower than just taking the system down for maintenance. But for any system where availability matters, they’re non-negotiable.

The best migration is the one your users never notice happened.