Article
Building Resilient Backend Systems
After spending years building and maintaining backend systems at scale, I’ve come to appreciate one truth above all others: everything fails. The question isn’t whether your system will encounter failures, but how gracefully it handles them.
The Fallacy of “Five Nines”
Early in my career, I obsessed over preventing failures. I’d write defensive code, add redundant checks, and convince myself that with enough effort, I could build something unbreakable. I was wrong.
The shift happened when I stopped trying to prevent all failures and started designing systems that expect them. This mental model change is the foundation of resilient architecture.
Circuit Breakers: Your First Line of Defense
The circuit breaker pattern is deceptively simple: if a downstream service starts failing, stop calling it for a while. Let it recover. Serve degraded responses instead of cascading the failure.
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_count = 0
self.threshold = failure_threshold
self.timeout = recovery_timeout
self.state = "closed"
self.last_failure_time = None
def call(self, func):
if self.state == "open":
if time.time() - self.last_failure_time > self.timeout:
self.state = "half-open"
else:
raise CircuitOpenError()
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
The key insight isn’t the implementation — it’s knowing where to place circuit breakers. Every external dependency should have one. Every. Single. One.
Retry with Exponential Backoff
Retries are essential but dangerous. A naive retry loop can turn a minor blip into a thundering herd that takes down the service you’re trying to call.
The formula I use:
- Base delay: 100ms
- Multiplier: 2x per attempt
- Jitter: Random 0-100% of calculated delay
- Max retries: 3 (rarely more)
- Max delay cap: 10 seconds
Jitter is the most important and most overlooked component. Without it, all your retries happen simultaneously.
Graceful Degradation
The best resilient systems don’t just survive failures — they provide useful partial responses. Some examples:
- A product page can show cached pricing if the pricing service is down
- Search results can fall back to a simpler algorithm if the ML ranking service is unavailable
- A dashboard can display slightly stale data rather than an error page
This requires designing your system with degradation paths from the beginning. It’s much harder to retrofit.
Timeouts: The Unsung Hero
Every network call needs a timeout. Every single one. I’ve seen more outages caused by missing timeouts than almost any other issue.
My rules of thumb:
- Database queries: 5 seconds max (if it takes longer, your query needs optimization)
- Internal service calls: 2-3 seconds
- External API calls: 10 seconds with a circuit breaker
- Health checks: 1 second
What I’ve Learned
Building resilient systems is less about clever code and more about honest architecture. Acknowledge that failures will happen, design for them explicitly, and test those failure paths regularly.
The most reliable systems I’ve built weren’t the ones with the most redundancy — they were the ones where every engineer on the team understood exactly what would happen when things went wrong.