- [[#The Big Idea|The Big Idea]]
- [[#Why Distributed Transactions Are Problematic|Why Distributed Transactions Are Problematic]]
- [[#Why Distributed Transactions Are Problematic#ACID Reminder|ACID Reminder]]
- [[#Why Distributed Transactions Are Problematic#The Problem with Microservices|The Problem with Microservices]]
- [[#Two-Phase Commit (2PC) — Just Say No|Two-Phase Commit (2PC) — Just Say No]]
- [[#Two-Phase Commit (2PC) — Just Say No#How 2PC Works|How 2PC Works]]
- [[#Two-Phase Commit (2PC) — Just Say No#Why It's Problematic|Why It's Problematic]]
- [[#Two-Phase Commit (2PC) — Just Say No#When 2PC Is OK|When 2PC Is OK]]
- [[#Sagas: The Alternative|Sagas: The Alternative]]
- [[#Sagas: The Alternative#Key Insight|Key Insight]]
- [[#Sagas: The Alternative#MusicCorp Order Fulfillment Example (from book)|MusicCorp Order Fulfillment Example (from book)]]
- [[#Saga Failure Modes|Saga Failure Modes]]
- [[#Saga Failure Modes#Backward Recovery (Rollback)|Backward Recovery (Rollback)]]
- [[#Saga Failure Modes#Important: Semantic Rollback|Important: Semantic Rollback]]
- [[#Saga Failure Modes#Forward Recovery (Retry)|Forward Recovery (Retry)]]
- [[#Saga Failure Modes#Mixing Recovery Modes|Mixing Recovery Modes]]
- [[#Reordering Steps to Reduce Rollbacks|Reordering Steps to Reduce Rollbacks]]
- [[#Orchestration vs Choreography|Orchestration vs Choreography]]
- [[#Orchestration vs Choreography#Orchestrated Saga|Orchestrated Saga]]
- [[#Orchestration vs Choreography#Choreographed Saga|Choreographed Saga]]
- [[#Orchestration vs Choreography#Author's Advice|Author's Advice]]
- [[#Business Failures vs Technical Failures|Business Failures vs Technical Failures]]
- [[#How MusicCorp Compares to Chapter 6 Recommendations|How MusicCorp Compares to Chapter 6 Recommendations]]
- [[#Our Current Order Flow Analyzed|Our Current Order Flow Analyzed]]
- [[#Our Current Order Flow Analyzed#What's Working|What's Working]]
- [[#Our Current Order Flow Analyzed#What's Missing|What's Missing]]
- [[#Discussion Questions|Discussion Questions]]
- [[#Key Quotes|Key Quotes]]
- [[#Recommended Reading|Recommended Reading]]
## The Big Idea
When you break a monolith into microservices, you lose **ACID transactions** across service boundaries. This chapter is about how to coordinate multi-service operations without distributed transactions—specifically, using **sagas**.
---
## Why Distributed Transactions Are Problematic
### ACID Reminder
| Property | What It Gives You |
|----------|------------------|
| **Atomicity** | All changes commit or none do |
| **Consistency** | Database always in valid state |
| **Isolation** | Concurrent transactions don't interfere |
| **Durability** | Committed data survives failures |
### The Problem with Microservices
```
MONOLITH (single transaction):
┌─────────────────────────────────────────────┐
│ BEGIN TRANSACTION │
│ UPDATE customers SET status = 'VERIFIED' │
│ DELETE FROM pending_enrollments │
│ COMMIT │
└─────────────────────────────────────────────┘
↓ SPLIT INTO MICROSERVICES ↓
MICROSERVICES (two transactions):
┌──────────────────┐ ┌──────────────────┐
│ Customer Service │ │ Enrollment Svc │
│ BEGIN │ │ BEGIN │
│ UPDATE... │ │ DELETE... │
│ COMMIT │ │ COMMIT │ ← What if this fails?
└──────────────────┘ └──────────────────┘
```
We've lost guaranteed atomicity. If the second transaction fails, we're in an inconsistent state.
---
## Two-Phase Commit (2PC) — Just Say No
The book strongly advises against distributed transactions:
### How 2PC Works
1. **Voting phase**: Coordinator asks all workers "Can you commit?"
2. **Commit phase**: If all say yes, coordinator says "Do it"
### Why It's Problematic
- **Locking**: Workers must lock resources during voting (contention)
- **Latency**: More participants = more latency
- **Failure modes**: What if worker votes yes, then crashes before commit?
- **Coordinator is SPOF**: If coordinator dies mid-transaction, workers stuck
> "The more participants you have, and the more latency you have in the system, the more issues a two-phase commit will have."
### When 2PC Is OK
- Within a single distributed database (Spanner, CockroachDB)
- Very short-lived operations
- Not for business processes that span minutes/hours/days
---
## Sagas: The Alternative
A saga is a **sequence of local transactions** where each step can be independently committed, with **compensating transactions** to undo previous steps if something fails.
### Key Insight
>
> "A saga does not give us atomicity in ACID terms. What a saga gives us is enough information to reason about which state it's in; it's up to us to handle the implications."
### MusicCorp Order Fulfillment Example (from book)
```
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Verify │──▶│ Reserve │──▶│ Take │──▶│ Award │──▶│ Package │
│ Acct │ │ Stock │ │ Payment │ │ Points │ │ & Ship │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Customer Warehouse Payment Loyalty Warehouse
Service Service Gateway Service Service
```
Each step is its own transaction. If "Package & Ship" fails, we need compensating actions.
---
## Saga Failure Modes
### Backward Recovery (Rollback)
When something fails, trigger **compensating transactions** for all previously committed steps:
```
Order Failed at "Package" Step:
✗ FAILED
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Reserve │──▶│ Take │──▶│ Award │──▶│ Package │
│ Stock │ │ Payment │ │ Points │ │ │
└────┬────┘ └────┬────┘ └────┬────┘ └─────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ RELEASE │ │ REFUND │ │ REVOKE │ ← Compensating transactions
│ STOCK │ │ PAYMENT │ │ POINTS │
└─────────┘ └─────────┘ └─────────┘
```
### Important: Semantic Rollback
> "We cannot always cleanly revert a transaction... these compensating transactions are semantic rollbacks."
Example: If we sent "Your order is confirmed!" email, we can't unsend it. Compensating action = send "Sorry, order cancelled" email.
### Forward Recovery (Retry)
Some failures don't need rollback—just retry:
```
Dispatch Failed (delivery truck full):
→ Don't cancel order!
→ Queue for tomorrow
→ Retry until success or human intervention
```
### Mixing Recovery Modes
Real sagas often have both:
- Early steps (payment declined): Rollback
- Late steps (shipping delay): Retry/forward
---
## Reordering Steps to Reduce Rollbacks
Smart ordering minimizes compensating transactions:
```
BEFORE (Award points early):
Reserve → Pay → Award Points → Package
↑
If package fails, must revoke points
AFTER (Award points late):
Reserve → Pay → Package → Award Points
↑
Points only awarded after success
```
> "By pulling forward those steps that are most likely to fail and failing the process earlier, you avoid having to trigger later compensating transactions."
---
## Orchestration vs Choreography
### Orchestrated Saga
Central coordinator tells services what to do:
```
┌────────────────┐
│ Order Processor│ (Orchestrator)
│ (knows flow) │
└───────┬────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Payment │ │ Loyalty │ │Warehouse│
│ Gateway │ │ Service │ │ Service │
└─────────┘ └─────────┘ └─────────┘
```
**Pros:**
- Process logic in one place
- Easy to understand the flow
- Clear state tracking
**Cons:**
- Higher domain coupling (orchestrator knows everything)
- Logic tends to migrate into orchestrator (anemic services)
- Single team often owns the orchestrator
### Choreographed Saga
Services react to events, no central coordinator:
```
Order ─────▶ order.placed ─────▶ Warehouse (reserves stock)
│ │
│ ▼
│ stock.reserved ─────▶ Payment
│ │
│ ▼
│ payment.taken
│ │
├────────────────────────────────────────┘
▼
Loyalty (award points) Warehouse (dispatch)
```
**Pros:**
- Loose coupling (services don't know about each other)
- Easier to distribute ownership
- More resilient (no single coordinator)
**Cons:**
- Harder to see the big picture
- Tracking saga state requires correlation IDs + event aggregation
- Debugging is harder
### Author's Advice
> "I am very relaxed in the use of orchestrated sagas when one team owns implementation of the entire saga. If you have multiple teams involved, I greatly prefer choreographed sagas."
---
## Business Failures vs Technical Failures
**Critical distinction:**
| Type | Example | How Saga Handles It |
|------|---------|---------------------|
| **Business failure** | Insufficient funds | Compensating transaction (cancel order) |
| **Technical failure** | Service timeout, 500 error | Retry, circuit breaker (NOT saga's job) |
> "The saga assumes the underlying components are working properly—that the underlying system is reliable."
Technical resilience (retries, circuit breakers) is covered in Chapter 12.
---
## How MusicCorp Compares to Chapter 6 Recommendations
| Book Recommendation | Our Implementation | Status |
|---------------------|-------------------|--------|
| **Avoid 2PC** | We don't use distributed transactions | Good |
| **Use sagas for multi-service ops** | Order flow is saga-like | Done |
| **Choreographed saga** | Events: order.placed → payment.received → shipment.dispatched | Done |
| **Orchestrated elements** | Order Service makes sync calls (Catalog, Inventory) | Hybrid |
| **Compensating transactions** | payment.failed → release stock, cancel order | Done |
| **Saga state tracking** | Correlation IDs exist, no saga view | Partial |
| **Reorder steps for fewer rollbacks** | Not analyzed | Gap |
| **Handle business vs technical failures** | Basic error handling only | Gap |
---
## Our Current Order Flow Analyzed
```
POST /orders (Order Service)
│
├─► GET /albums/{sku} ───► Catalog ← Sync (orchestrated)
│ (price lookup)
│
├─► POST /stock/{sku}/reserve ───► Inventory ← Sync (orchestrated)
│ (reserve stock)
│
├─► Create order record (PLACED)
│
└─► Publish: order.placed ───────────────────┐
│
┌─────────────────────────────────────────────┘
▼
Payment Service ← Async (choreographed)
│
├─► Create payment record (PENDING)
├─► Process payment (simulated)
└─► Publish: payment.received ───────────────┐
│
┌─────────────────────────────────────────────┘
▼
Order Service: update to PAID ← Async (choreographed)
Shipping Service: create shipment
│
└─► Publish: shipment.dispatched ────────────┐
│
┌─────────────────────────────────────────────┘
▼
Order Service: update to SHIPPED → COMPLETED ← Async (choreographed)
```
### What's Working
- Hybrid approach (sync for queries, async for state transitions)
- Correlation IDs propagated
- State machine (PLACED → PAID → SHIPPED → COMPLETED)
- Compensating transactions for payment failure
### What's Missing
1. **No saga state view**: Can't query "show me all in-flight sagas"
2. **No dead letter queue**: Failed events should go to DLQ after retries
3. **Limited retry logic**: Could use exponential backoff
---
## Discussion Questions
1. **Orchestration vs Choreography**: We have a hybrid (sync calls + events). Should we go fully choreographed? What would Order Service look like if it only reacted to events?
2. **Compensation complexity**: What if the refund fails during compensation? Do we need compensating transactions for our compensating transactions?
3. **Idempotency**: If `payment.received` event is delivered twice, we might award points twice. How do we make event handlers idempotent?
4. **Long-running sagas**: What if shipping takes 5 days? How do we handle sagas that span days/weeks?
5. **Saga persistence**: Our saga state lives only in events. Should we persist saga state to a database for querying/recovery?
6. **The Kafka advantage**: Kafka provides message persistence and replay. How does this help with saga recovery compared to Redis pub/sub?
---
## Key Quotes
> "I strongly suggest you avoid the use of distributed transactions like the two-phase commit to coordinate changes in state across your microservices."
> "A saga does not give us atomicity in ACID terms... What a saga gives us is enough information to reason about which state it's in."
> "It's really important to note that a saga allows us to recover from business failures, not technical failures."
> "If logic has a place where it can be centralized, it will become centralized!" (warning about orchestrators)
> "I am very relaxed in the use of orchestrated sagas when one team owns implementation of the entire saga."
---
## Recommended Reading
- "Sagas" by Hector Garcia-Molina and Kenneth Salem (original paper)
- *Enterprise Integration Patterns* by Gregor Hohpe and Bobby Woolf
- *Practical Process Automation* by Bernd Ruecker
- "The Limits of the Saga Pattern" by Uwe Friedrichsen