06 - Workflow Sagas - Josh's Notes

- [[#The Big Idea|The Big Idea]] - [[#Why Distributed Transactions Are Problematic|Why Distributed Transactions Are Problematic]] - [[#Why Distributed Transactions Are Problematic#ACID Reminder|ACID Reminder]] - [[#Why Distributed Transactions Are Problematic#The Problem with Microservices|The Problem with Microservices]] - [[#Two-Phase Commit (2PC) — Just Say No|Two-Phase Commit (2PC) — Just Say No]] - [[#Two-Phase Commit (2PC) — Just Say No#How 2PC Works|How 2PC Works]] - [[#Two-Phase Commit (2PC) — Just Say No#Why It's Problematic|Why It's Problematic]] - [[#Two-Phase Commit (2PC) — Just Say No#When 2PC Is OK|When 2PC Is OK]] - [[#Sagas: The Alternative|Sagas: The Alternative]] - [[#Sagas: The Alternative#Key Insight|Key Insight]] - [[#Sagas: The Alternative#MusicCorp Order Fulfillment Example (from book)|MusicCorp Order Fulfillment Example (from book)]] - [[#Saga Failure Modes|Saga Failure Modes]] - [[#Saga Failure Modes#Backward Recovery (Rollback)|Backward Recovery (Rollback)]] - [[#Saga Failure Modes#Important: Semantic Rollback|Important: Semantic Rollback]] - [[#Saga Failure Modes#Forward Recovery (Retry)|Forward Recovery (Retry)]] - [[#Saga Failure Modes#Mixing Recovery Modes|Mixing Recovery Modes]] - [[#Reordering Steps to Reduce Rollbacks|Reordering Steps to Reduce Rollbacks]] - [[#Orchestration vs Choreography|Orchestration vs Choreography]] - [[#Orchestration vs Choreography#Orchestrated Saga|Orchestrated Saga]] - [[#Orchestration vs Choreography#Choreographed Saga|Choreographed Saga]] - [[#Orchestration vs Choreography#Author's Advice|Author's Advice]] - [[#Business Failures vs Technical Failures|Business Failures vs Technical Failures]] - [[#How MusicCorp Compares to Chapter 6 Recommendations|How MusicCorp Compares to Chapter 6 Recommendations]] - [[#Our Current Order Flow Analyzed|Our Current Order Flow Analyzed]] - [[#Our Current Order Flow Analyzed#What's Working|What's Working]] - [[#Our Current Order Flow Analyzed#What's Missing|What's Missing]] - [[#Discussion Questions|Discussion Questions]] - [[#Key Quotes|Key Quotes]] - [[#Recommended Reading|Recommended Reading]] ## The Big Idea When you break a monolith into microservices, you lose **ACID transactions** across service boundaries. This chapter is about how to coordinate multi-service operations without distributed transactions—specifically, using **sagas**. --- ## Why Distributed Transactions Are Problematic ### ACID Reminder | Property | What It Gives You | |----------|------------------| | **Atomicity** | All changes commit or none do | | **Consistency** | Database always in valid state | | **Isolation** | Concurrent transactions don't interfere | | **Durability** | Committed data survives failures | ### The Problem with Microservices ``` MONOLITH (single transaction): ┌─────────────────────────────────────────────┐ │ BEGIN TRANSACTION │ │ UPDATE customers SET status = 'VERIFIED' │ │ DELETE FROM pending_enrollments │ │ COMMIT │ └─────────────────────────────────────────────┘ ↓ SPLIT INTO MICROSERVICES ↓ MICROSERVICES (two transactions): ┌──────────────────┐ ┌──────────────────┐ │ Customer Service │ │ Enrollment Svc │ │ BEGIN │ │ BEGIN │ │ UPDATE... │ │ DELETE... │ │ COMMIT │ │ COMMIT │ ← What if this fails? └──────────────────┘ └──────────────────┘ ``` We've lost guaranteed atomicity. If the second transaction fails, we're in an inconsistent state. --- ## Two-Phase Commit (2PC) — Just Say No The book strongly advises against distributed transactions: ### How 2PC Works 1. **Voting phase**: Coordinator asks all workers "Can you commit?" 2. **Commit phase**: If all say yes, coordinator says "Do it" ### Why It's Problematic - **Locking**: Workers must lock resources during voting (contention) - **Latency**: More participants = more latency - **Failure modes**: What if worker votes yes, then crashes before commit? - **Coordinator is SPOF**: If coordinator dies mid-transaction, workers stuck > "The more participants you have, and the more latency you have in the system, the more issues a two-phase commit will have." ### When 2PC Is OK - Within a single distributed database (Spanner, CockroachDB) - Very short-lived operations - Not for business processes that span minutes/hours/days --- ## Sagas: The Alternative A saga is a **sequence of local transactions** where each step can be independently committed, with **compensating transactions** to undo previous steps if something fails. ### Key Insight > > "A saga does not give us atomicity in ACID terms. What a saga gives us is enough information to reason about which state it's in; it's up to us to handle the implications." ### MusicCorp Order Fulfillment Example (from book) ``` ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Verify │──▶│ Reserve │──▶│ Take │──▶│ Award │──▶│ Package │ │ Acct │ │ Stock │ │ Payment │ │ Points │ │ & Ship │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ Customer Warehouse Payment Loyalty Warehouse Service Service Gateway Service Service ``` Each step is its own transaction. If "Package & Ship" fails, we need compensating actions. --- ## Saga Failure Modes ### Backward Recovery (Rollback) When something fails, trigger **compensating transactions** for all previously committed steps: ``` Order Failed at "Package" Step: ✗ FAILED ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Reserve │──▶│ Take │──▶│ Award │──▶│ Package │ │ Stock │ │ Payment │ │ Points │ │ │ └────┬────┘ └────┬────┘ └────┬────┘ └─────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ RELEASE │ │ REFUND │ │ REVOKE │ ← Compensating transactions │ STOCK │ │ PAYMENT │ │ POINTS │ └─────────┘ └─────────┘ └─────────┘ ``` ### Important: Semantic Rollback > "We cannot always cleanly revert a transaction... these compensating transactions are semantic rollbacks." Example: If we sent "Your order is confirmed!" email, we can't unsend it. Compensating action = send "Sorry, order cancelled" email. ### Forward Recovery (Retry) Some failures don't need rollback—just retry: ``` Dispatch Failed (delivery truck full): → Don't cancel order! → Queue for tomorrow → Retry until success or human intervention ``` ### Mixing Recovery Modes Real sagas often have both: - Early steps (payment declined): Rollback - Late steps (shipping delay): Retry/forward --- ## Reordering Steps to Reduce Rollbacks Smart ordering minimizes compensating transactions: ``` BEFORE (Award points early): Reserve → Pay → Award Points → Package ↑ If package fails, must revoke points AFTER (Award points late): Reserve → Pay → Package → Award Points ↑ Points only awarded after success ``` > "By pulling forward those steps that are most likely to fail and failing the process earlier, you avoid having to trigger later compensating transactions." --- ## Orchestration vs Choreography ### Orchestrated Saga Central coordinator tells services what to do: ``` ┌────────────────┐ │ Order Processor│ (Orchestrator) │ (knows flow) │ └───────┬────────┘ │ ┌───────────────────┼───────────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Payment │ │ Loyalty │ │Warehouse│ │ Gateway │ │ Service │ │ Service │ └─────────┘ └─────────┘ └─────────┘ ``` **Pros:** - Process logic in one place - Easy to understand the flow - Clear state tracking **Cons:** - Higher domain coupling (orchestrator knows everything) - Logic tends to migrate into orchestrator (anemic services) - Single team often owns the orchestrator ### Choreographed Saga Services react to events, no central coordinator: ``` Order ─────▶ order.placed ─────▶ Warehouse (reserves stock) │ │ │ ▼ │ stock.reserved ─────▶ Payment │ │ │ ▼ │ payment.taken │ │ ├────────────────────────────────────────┘ ▼ Loyalty (award points) Warehouse (dispatch) ``` **Pros:** - Loose coupling (services don't know about each other) - Easier to distribute ownership - More resilient (no single coordinator) **Cons:** - Harder to see the big picture - Tracking saga state requires correlation IDs + event aggregation - Debugging is harder ### Author's Advice > "I am very relaxed in the use of orchestrated sagas when one team owns implementation of the entire saga. If you have multiple teams involved, I greatly prefer choreographed sagas." --- ## Business Failures vs Technical Failures **Critical distinction:** | Type | Example | How Saga Handles It | |------|---------|---------------------| | **Business failure** | Insufficient funds | Compensating transaction (cancel order) | | **Technical failure** | Service timeout, 500 error | Retry, circuit breaker (NOT saga's job) | > "The saga assumes the underlying components are working properly—that the underlying system is reliable." Technical resilience (retries, circuit breakers) is covered in Chapter 12. --- ## How MusicCorp Compares to Chapter 6 Recommendations | Book Recommendation | Our Implementation | Status | |---------------------|-------------------|--------| | **Avoid 2PC** | We don't use distributed transactions | Good | | **Use sagas for multi-service ops** | Order flow is saga-like | Done | | **Choreographed saga** | Events: order.placed → payment.received → shipment.dispatched | Done | | **Orchestrated elements** | Order Service makes sync calls (Catalog, Inventory) | Hybrid | | **Compensating transactions** | payment.failed → release stock, cancel order | Done | | **Saga state tracking** | Correlation IDs exist, no saga view | Partial | | **Reorder steps for fewer rollbacks** | Not analyzed | Gap | | **Handle business vs technical failures** | Basic error handling only | Gap | --- ## Our Current Order Flow Analyzed ``` POST /orders (Order Service) │ ├─► GET /albums/{sku} ───► Catalog ← Sync (orchestrated) │ (price lookup) │ ├─► POST /stock/{sku}/reserve ───► Inventory ← Sync (orchestrated) │ (reserve stock) │ ├─► Create order record (PLACED) │ └─► Publish: order.placed ───────────────────┐ │ ┌─────────────────────────────────────────────┘ ▼ Payment Service ← Async (choreographed) │ ├─► Create payment record (PENDING) ├─► Process payment (simulated) └─► Publish: payment.received ───────────────┐ │ ┌─────────────────────────────────────────────┘ ▼ Order Service: update to PAID ← Async (choreographed) Shipping Service: create shipment │ └─► Publish: shipment.dispatched ────────────┐ │ ┌─────────────────────────────────────────────┘ ▼ Order Service: update to SHIPPED → COMPLETED ← Async (choreographed) ``` ### What's Working - Hybrid approach (sync for queries, async for state transitions) - Correlation IDs propagated - State machine (PLACED → PAID → SHIPPED → COMPLETED) - Compensating transactions for payment failure ### What's Missing 1. **No saga state view**: Can't query "show me all in-flight sagas" 2. **No dead letter queue**: Failed events should go to DLQ after retries 3. **Limited retry logic**: Could use exponential backoff --- ## Discussion Questions 1. **Orchestration vs Choreography**: We have a hybrid (sync calls + events). Should we go fully choreographed? What would Order Service look like if it only reacted to events? 2. **Compensation complexity**: What if the refund fails during compensation? Do we need compensating transactions for our compensating transactions? 3. **Idempotency**: If `payment.received` event is delivered twice, we might award points twice. How do we make event handlers idempotent? 4. **Long-running sagas**: What if shipping takes 5 days? How do we handle sagas that span days/weeks? 5. **Saga persistence**: Our saga state lives only in events. Should we persist saga state to a database for querying/recovery? 6. **The Kafka advantage**: Kafka provides message persistence and replay. How does this help with saga recovery compared to Redis pub/sub? --- ## Key Quotes > "I strongly suggest you avoid the use of distributed transactions like the two-phase commit to coordinate changes in state across your microservices." > "A saga does not give us atomicity in ACID terms... What a saga gives us is enough information to reason about which state it's in." > "It's really important to note that a saga allows us to recover from business failures, not technical failures." > "If logic has a place where it can be centralized, it will become centralized!" (warning about orchestrators) > "I am very relaxed in the use of orchestrated sagas when one team owns implementation of the entire saga." --- ## Recommended Reading - "Sagas" by Hector Garcia-Molina and Kenneth Salem (original paper) - *Enterprise Integration Patterns* by Gregor Hohpe and Bobby Woolf - *Practical Process Automation* by Bernd Ruecker - "The Limits of the Saga Pattern" by Uwe Friedrichsen