Building Resilient Backend Systems

After spending years building and maintaining backend systems at scale, I’ve come to appreciate one truth above all others: everything fails. The question isn’t whether your system will encounter failures, but how gracefully it handles them.

The Fallacy of “Five Nines”

Early in my career, I obsessed over preventing failures. I’d write defensive code, add redundant checks, and convince myself that with enough effort, I could build something unbreakable. I was wrong.

The shift happened when I stopped trying to prevent all failures and started designing systems that expect them. This mental model change is the foundation of resilient architecture.

Circuit Breakers: Your First Line of Defense

The circuit breaker pattern is deceptively simple: if a downstream service starts failing, stop calling it for a while. Let it recover. Serve degraded responses instead of cascading the failure.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.threshold = failure_threshold
        self.timeout = recovery_timeout
        self.state = "closed"
        self.last_failure_time = None

    def call(self, func):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "half-open"
            else:
                raise CircuitOpenError()

        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise

The key insight isn’t the implementation — it’s knowing where to place circuit breakers. Every external dependency should have one. Every. Single. One.

Retry with Exponential Backoff

Retries are essential but dangerous. A naive retry loop can turn a minor blip into a thundering herd that takes down the service you’re trying to call.

The formula I use:

Base delay: 100ms
Multiplier: 2x per attempt
Jitter: Random 0-100% of calculated delay
Max retries: 3 (rarely more)
Max delay cap: 10 seconds

Jitter is the most important and most overlooked component. Without it, all your retries happen simultaneously.

Graceful Degradation

The best resilient systems don’t just survive failures — they provide useful partial responses. Some examples:

A product page can show cached pricing if the pricing service is down
Search results can fall back to a simpler algorithm if the ML ranking service is unavailable
A dashboard can display slightly stale data rather than an error page

This requires designing your system with degradation paths from the beginning. It’s much harder to retrofit.

Timeouts: The Unsung Hero

Every network call needs a timeout. Every single one. I’ve seen more outages caused by missing timeouts than almost any other issue.

My rules of thumb:

Database queries: 5 seconds max (if it takes longer, your query needs optimization)
Internal service calls: 2-3 seconds
External API calls: 10 seconds with a circuit breaker
Health checks: 1 second

What I’ve Learned

Building resilient systems is less about clever code and more about honest architecture. Acknowledge that failures will happen, design for them explicitly, and test those failure paths regularly.

The most reliable systems I’ve built weren’t the ones with the most redundancy — they were the ones where every engineer on the team understood exactly what would happen when things went wrong.