12 - Resiliency - Josh's Notes

- [[#Four Concepts of Resiliency (David Woods)|Four Concepts of Resiliency (David Woods)]] - [[#Four Concepts of Resiliency (David Woods)#1. Robustness|1. Robustness]] - [[#Four Concepts of Resiliency (David Woods)#2. Rebound|2. Rebound]] - [[#Four Concepts of Resiliency (David Woods)#3. Graceful Extensibility|3. Graceful Extensibility]] - [[#Four Concepts of Resiliency (David Woods)#4. Sustained Adaptability|4. Sustained Adaptability]] - [[#Failure Is Everywhere|Failure Is Everywhere]] - [[#Degrading Functionality|Degrading Functionality]] - [[#Stability Patterns|Stability Patterns]] - [[#Stability Patterns#Timeouts|Timeouts]] - [[#Stability Patterns#Retries|Retries]] - [[#Stability Patterns#Bulkheads|Bulkheads]] - [[#Stability Patterns#Circuit Breakers|Circuit Breakers]] - [[#Idempotency|Idempotency]] - [[#CAP Theorem|CAP Theorem]] - [[#Chaos Engineering|Chaos Engineering]] - [[#Chaos Engineering#Not Just "Break Things"|Not Just "Break Things"]] - [[#Chaos Engineering#Techniques|Techniques]] - [[#Chaos Engineering#Game Days|Game Days]] - [[#Chaos Engineering#Tools|Tools]] - [[#Blame-Free Culture|Blame-Free Culture]] - [[#Blame-Free Culture#Why Blame Prevents Learning|Why Blame Prevents Learning]] - [[#Blame-Free Culture#Blameless Postmortems|Blameless Postmortems]] - [[#How MusicCorp Compares to Chapter 12 Recommendations|How MusicCorp Compares to Chapter 12 Recommendations]] - [[#Action Items for MusicCorp|Action Items for MusicCorp]] - [[#Action Items for MusicCorp#High Priority|High Priority]] - [[#Action Items for MusicCorp#Medium Priority|Medium Priority]] - [[#Action Items for MusicCorp#Lower Priority|Lower Priority]] - [[#Discussion Questions|Discussion Questions]] - [[#Key Quotes|Key Quotes]] - [[#Recommended Reading|Recommended Reading]] ## Four Concepts of Resiliency (David Woods) ### 1. Robustness **Definition:** Absorb expected perturbations **Patterns:** - Circuit breakers - Retries with backoff - Timeouts - Bulkheads **Limitation:** Requires prior knowledge of failure modes ### 2. Rebound **Definition:** Recover from traumatic events **Practices:** - Backups - Runbooks - Incident response plans - Disaster recovery testing **Key insight:** Practice recovery before you need it. ### 3. Graceful Extensibility **Definition:** Handle the unexpected **Enablers:** - Flat organizations respond better to surprise - Human judgment for novel situations - Automation balanced with adaptability ### 4. Sustained Adaptability **Definition:** Continuously adapt over time **Practices:** - Learning culture - Blameless postmortems - Chaos engineering - Avoiding complacency --- ## Failure Is Everywhere > At scale, failure is a statistical certainty. **Examples:** - Hard drives fail (~2% annual failure rate) - Network partitions happen - Services crash - Dependencies become unavailable **Mindset shift:** Plan for failure rather than just preventing it. --- ## Degrading Functionality Not all features are equally critical. Design graceful degradation: | Feature | Criticality | Degradation | |---------|-------------|-------------| | Order placement | Critical | Cannot degrade | | Recommendations | Low | Hide if unavailable | | Payment | Critical | Queue if provider slow | | Email notifications | Medium | Retry later | **In MusicCorp:** - If Catalog is down, Order fails (critical dependency) - If Payment is slow, orders queue (async via Kafka) - If Shipping is down, payments still process --- ## Stability Patterns ### Timeouts **Always set timeouts on external calls.** ```python # Python example response = requests.get( f"{CATALOG_URL}/albums/{sku}", timeout=5.0 # 5 second timeout ) ``` **Considerations:** - Individual call timeout - Overall operation timeout - Log timeouts for tuning ### Retries **Appropriate for transient failures.** ```python # Exponential backoff with jitter import random import time def retry_with_backoff(func, max_retries=3): for attempt in range(max_retries): try: return func() except TransientError: if attempt == max_retries - 1: raise delay = (2 ** attempt) + random.uniform(0, 1) time.sleep(delay) ``` **Key considerations:** - Use exponential backoff - Add jitter to prevent thundering herd - Set retry budget (max retries) - Respect overall timeout ### Bulkheads **Isolate failures to prevent cascade.** ``` ┌─────────────────────────────────────────────────┐ │ Order Service │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Catalog Pool │ │ Inventory Pool │ │ │ │ (10 connections)│ │ (10 connections)│ │ │ └────────┬────────┘ └────────┬────────┘ │ └───────────┼────────────────────┼────────────────┘ │ │ ▼ ▼ Catalog Inventory ``` **If Catalog is slow:** - Catalog pool exhausted - Inventory pool unaffected - Order can still check stock (partial functionality) ### Circuit Breakers **Fail fast when downstream is unhealthy.** ``` ┌─────────┐ request ┌─────────────────┐ │ Order │───────────────│ Circuit Breaker │ │ Service │◀──────────────│ │ └─────────┘ response │ ┌───────────┐ │ │ │ CLOSED │ │ ← Normal operation │ └─────┬─────┘ │ │ │ failures│ │ ┌─────▼─────┐ │ │ │ OPEN │ │ ← Fail immediately │ └─────┬─────┘ │ │ │ timeout │ │ ┌─────▼─────┐ │ │ │ HALF-OPEN │ │ ← Test if recovered │ └───────────┘ │ └─────────────────┘ ``` **States:** - **CLOSED**: Requests flow normally - **OPEN**: Requests fail immediately (no downstream call) - **HALF-OPEN**: Allow one request to test recovery --- ## Idempotency **Same operation multiple times = same result.** **Why it matters:** - Network failures may cause duplicate requests - Retries may send same message twice - Event handlers may process same event twice **Techniques:** | Technique | Example | |-----------|---------| | **Idempotency key** | Include request ID, deduplicate on server | | **Database constraints** | Unique constraint on order ID | | **Check before write** | Verify order doesn't exist before creating | ```python # Idempotent event handler def handle_payment_received(event): order_id = event["order_id"] # Check if already processed if order.status == "PAID": return # Already handled, skip order.transition_to("PAID") ``` --- ## CAP Theorem **In a distributed system, you can only guarantee two of three:** | Property | Description | |----------|-------------| | **Consistency** | All nodes see same data at same time | | **Availability** | Every request receives a response | | **Partition Tolerance** | System works despite network splits | **Reality:** Network partitions happen, so you choose between: | Choice | Trade-off | Example | |--------|-----------|---------| | **CP** | Sacrifice availability for consistency | Banking transactions | | **AP** | Sacrifice consistency for availability | Shopping cart | **MusicCorp:** Generally AP (eventual consistency via Kafka events). --- ## Chaos Engineering ### Not Just "Break Things" Chaos engineering is structured experimentation: 1. Define steady state (normal behavior) 2. Hypothesize about failures 3. Introduce controlled failures 4. Measure deviation from steady state 5. Fix weaknesses discovered ### Techniques | Technique | What It Tests | |-----------|---------------| | **Kill pods** | Pod restart, rescheduling | | **Network latency** | Timeout handling | | **Dependency failure** | Circuit breakers, fallbacks | | **Resource exhaustion** | Memory limits, CPU throttling | ### Game Days Planned exercises to test incident response: - Schedule a failure scenario - Involve all relevant teams - Practice incident response - Document lessons learned ### Tools - **Chaos Monkey** (Netflix): Random instance termination - **Gremlin**: Commercial chaos platform - **Litmus**: Kubernetes-native chaos - **Chaos Mesh**: CNCF chaos engineering platform --- ## Blame-Free Culture ### Why Blame Prevents Learning - People hide mistakes to avoid punishment - Root causes go unaddressed - Same failures repeat ### Blameless Postmortems Focus on: - What happened (timeline) - Why it happened (contributing factors) - How to prevent it (action items) **NOT:** - Who caused it - Who should be punished - Who was on call --- ## How MusicCorp Compares to Chapter 12 Recommendations | Book Recommendation | Our Implementation | Status | |---------------------|-------------------|--------| | **Timeouts** | Not set on HTTP calls | Gap | | **Retries** | Basic Kafka retry | Partial | | **Circuit breakers** | Not implemented | Gap | | **Bulkheads** | Separate pods (isolation) | Partial | | **Idempotency** | Order ID uniqueness | Partial | | **Health checks** | Readiness/liveness probes | Done | | **Graceful shutdown** | terminationGracePeriod | Done | | **Chaos testing** | Not implemented | Gap | --- ## Action Items for MusicCorp ### High Priority 1. **Add timeouts to all HTTP calls** ```python response = requests.get(url, timeout=(3.0, 10.0)) # (connect_timeout, read_timeout) ``` 2. **Add circuit breaker to critical paths** - Order → Catalog (price lookup) - Order → Inventory (stock check) ### Medium Priority 1. **Implement retry with backoff** - HTTP calls to other services - Kafka message processing 2. **Make event handlers idempotent** - Check order status before transition - Deduplicate based on event ID ### Lower Priority 1. **Add chaos testing** - Kill pods during order flow - Introduce network latency - Test circuit breaker behavior --- ## Discussion Questions 1. **Timeout values**: How do we choose appropriate timeout values? What's too short vs too long? 2. **Circuit breaker tuning**: How many failures before opening? How long to stay open? 3. **Retry storms**: How do we prevent retries from overwhelming a recovering service? 4. **Idempotency in events**: If `order.placed` is delivered twice, what happens? Is this handled? 5. **Chaos engineering scope**: Should we run chaos experiments in our Kind cluster? What would we learn? --- ## Key Quotes > "At scale, failure is a statistical certainty. Plan for failure rather than just preventing it." > "Circuit breakers fail fast, saving resources when downstream is unhealthy." > "Blame prevents learning. Focus on what happened, not who caused it." > "Chaos engineering is not about breaking things—it's about building confidence through controlled experiments." --- ## Recommended Reading - "Release It!" by Michael Nygard (circuit breakers, bulkheads) - "Chaos Engineering" by Casey Rosenthal and Nora Jones - Netflix Tech Blog: Chaos Engineering articles - "The Field Guide to Understanding Human Error" by Sidney Dekker