Distributed Transactions Are a Systems Problem, Not Just a Tech One

Nicolas Cava
Edited onEdited on Jul 14, 2025
Reading time2 minutes

Distributed transactions are hard. Anyone who's worked beyond a single database knows this. Once you start coordinating across multiple services, you enter a world where failure is normal, networks are unreliable, and everything costs more to guarantee.

Throughout my career—working across an ecosystem of hundreds of microservices—I experienced firsthand just how complex distributed systems can get. Designing and maintaining services that had to interoperate reliably despite partial failures, asynchronous communication, and evolving schemas taught me something important:

Distributed transactions aren't just a technical problem. They're a systems thinking challenge.

Why Distributed Transactions Are So Difficult

Here's the no-bullshit truth:

  • You're trying to coordinate multiple independent systems.
  • Each system has its own state, its own failure modes, and its own performance characteristics.
  • Network calls are inherently unreliable—they may time out, fail, or return unexpected errors.
  • There is no global clock, so reasoning about "what happened first" is tricky at best.
  • Once any piece fails mid-transaction, you're in damage control: rollbacks, retries, or worse—manual cleanup.

Add asynchronous messaging, retries, and out-of-order execution, and you're managing chaos unless you design explicitly for failure and reconciliation.

Example: Distributed Order Fulfillment

Imagine an e-commerce checkout flow across multiple microservices:

Order Service:

  • Creates order
  • Emits orderCreated

Payment Service:

  • Charges customer
  • Emits paymentProcessed

Inventory Service:

  • Reserves items
  • Emits inventoryReserved

Shipping Service:

  • Ships order
  • Emits orderShipped

Each step is loosely coupled, and each service only knows how to do its job and emit an event. If, say, payment fails, you can trigger a rollback: release the reserved inventory and cancel the order. No central coordinator, no 2PC—just event-driven recovery.

Final Thoughts

Distributed transactions force you to embrace uncertainty and design for failure. Trying to fake strong consistency with fragile coordination doesn't scale. Instead, lean into event-driven architecture, design for eventual consistency, and use patterns like sagas to build resilient, observable, testable systems.

If you're building anything asynchronous and distributed, your job isn't to prevent failure. It's to make failure safe.

Recent Notes

Growth

How I Broke the Loop

After getting laid off, I found myself caught in a perfectionist spiral—waiting to feel ready again. But readiness never came. Here's what finally got me moving.

Nicolas Cava

Nicolas Cava

Engineering Leader