- [[#Four Concepts of Resiliency (David Woods)|Four Concepts of Resiliency (David Woods)]]
- [[#Four Concepts of Resiliency (David Woods)#1. Robustness|1. Robustness]]
- [[#Four Concepts of Resiliency (David Woods)#2. Rebound|2. Rebound]]
- [[#Four Concepts of Resiliency (David Woods)#3. Graceful Extensibility|3. Graceful Extensibility]]
- [[#Four Concepts of Resiliency (David Woods)#4. Sustained Adaptability|4. Sustained Adaptability]]
- [[#Failure Is Everywhere|Failure Is Everywhere]]
- [[#Degrading Functionality|Degrading Functionality]]
- [[#Stability Patterns|Stability Patterns]]
- [[#Stability Patterns#Timeouts|Timeouts]]
- [[#Stability Patterns#Retries|Retries]]
- [[#Stability Patterns#Bulkheads|Bulkheads]]
- [[#Stability Patterns#Circuit Breakers|Circuit Breakers]]
- [[#Idempotency|Idempotency]]
- [[#CAP Theorem|CAP Theorem]]
- [[#Chaos Engineering|Chaos Engineering]]
- [[#Chaos Engineering#Not Just "Break Things"|Not Just "Break Things"]]
- [[#Chaos Engineering#Techniques|Techniques]]
- [[#Chaos Engineering#Game Days|Game Days]]
- [[#Chaos Engineering#Tools|Tools]]
- [[#Blame-Free Culture|Blame-Free Culture]]
- [[#Blame-Free Culture#Why Blame Prevents Learning|Why Blame Prevents Learning]]
- [[#Blame-Free Culture#Blameless Postmortems|Blameless Postmortems]]
- [[#How MusicCorp Compares to Chapter 12 Recommendations|How MusicCorp Compares to Chapter 12 Recommendations]]
- [[#Action Items for MusicCorp|Action Items for MusicCorp]]
- [[#Action Items for MusicCorp#High Priority|High Priority]]
- [[#Action Items for MusicCorp#Medium Priority|Medium Priority]]
- [[#Action Items for MusicCorp#Lower Priority|Lower Priority]]
- [[#Discussion Questions|Discussion Questions]]
- [[#Key Quotes|Key Quotes]]
- [[#Recommended Reading|Recommended Reading]]
## Four Concepts of Resiliency (David Woods)
### 1. Robustness
**Definition:** Absorb expected perturbations
**Patterns:**
- Circuit breakers
- Retries with backoff
- Timeouts
- Bulkheads
**Limitation:** Requires prior knowledge of failure modes
### 2. Rebound
**Definition:** Recover from traumatic events
**Practices:**
- Backups
- Runbooks
- Incident response plans
- Disaster recovery testing
**Key insight:** Practice recovery before you need it.
### 3. Graceful Extensibility
**Definition:** Handle the unexpected
**Enablers:**
- Flat organizations respond better to surprise
- Human judgment for novel situations
- Automation balanced with adaptability
### 4. Sustained Adaptability
**Definition:** Continuously adapt over time
**Practices:**
- Learning culture
- Blameless postmortems
- Chaos engineering
- Avoiding complacency
---
## Failure Is Everywhere
> At scale, failure is a statistical certainty.
**Examples:**
- Hard drives fail (~2% annual failure rate)
- Network partitions happen
- Services crash
- Dependencies become unavailable
**Mindset shift:** Plan for failure rather than just preventing it.
---
## Degrading Functionality
Not all features are equally critical. Design graceful degradation:
| Feature | Criticality | Degradation |
|---------|-------------|-------------|
| Order placement | Critical | Cannot degrade |
| Recommendations | Low | Hide if unavailable |
| Payment | Critical | Queue if provider slow |
| Email notifications | Medium | Retry later |
**In MusicCorp:**
- If Catalog is down, Order fails (critical dependency)
- If Payment is slow, orders queue (async via Kafka)
- If Shipping is down, payments still process
---
## Stability Patterns
### Timeouts
**Always set timeouts on external calls.**
```python
# Python example
response = requests.get(
f"{CATALOG_URL}/albums/{sku}",
timeout=5.0 # 5 second timeout
)
```
**Considerations:**
- Individual call timeout
- Overall operation timeout
- Log timeouts for tuning
### Retries
**Appropriate for transient failures.**
```python
# Exponential backoff with jitter
import random
import time
def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except TransientError:
if attempt == max_retries - 1:
raise
delay = (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
```
**Key considerations:**
- Use exponential backoff
- Add jitter to prevent thundering herd
- Set retry budget (max retries)
- Respect overall timeout
### Bulkheads
**Isolate failures to prevent cascade.**
```
┌─────────────────────────────────────────────────┐
│ Order Service │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Catalog Pool │ │ Inventory Pool │ │
│ │ (10 connections)│ │ (10 connections)│ │
│ └────────┬────────┘ └────────┬────────┘ │
└───────────┼────────────────────┼────────────────┘
│ │
▼ ▼
Catalog Inventory
```
**If Catalog is slow:**
- Catalog pool exhausted
- Inventory pool unaffected
- Order can still check stock (partial functionality)
### Circuit Breakers
**Fail fast when downstream is unhealthy.**
```
┌─────────┐ request ┌─────────────────┐
│ Order │───────────────│ Circuit Breaker │
│ Service │◀──────────────│ │
└─────────┘ response │ ┌───────────┐ │
│ │ CLOSED │ │ ← Normal operation
│ └─────┬─────┘ │
│ │ failures│
│ ┌─────▼─────┐ │
│ │ OPEN │ │ ← Fail immediately
│ └─────┬─────┘ │
│ │ timeout │
│ ┌─────▼─────┐ │
│ │ HALF-OPEN │ │ ← Test if recovered
│ └───────────┘ │
└─────────────────┘
```
**States:**
- **CLOSED**: Requests flow normally
- **OPEN**: Requests fail immediately (no downstream call)
- **HALF-OPEN**: Allow one request to test recovery
---
## Idempotency
**Same operation multiple times = same result.**
**Why it matters:**
- Network failures may cause duplicate requests
- Retries may send same message twice
- Event handlers may process same event twice
**Techniques:**
| Technique | Example |
|-----------|---------|
| **Idempotency key** | Include request ID, deduplicate on server |
| **Database constraints** | Unique constraint on order ID |
| **Check before write** | Verify order doesn't exist before creating |
```python
# Idempotent event handler
def handle_payment_received(event):
order_id = event["order_id"]
# Check if already processed
if order.status == "PAID":
return # Already handled, skip
order.transition_to("PAID")
```
---
## CAP Theorem
**In a distributed system, you can only guarantee two of three:**
| Property | Description |
|----------|-------------|
| **Consistency** | All nodes see same data at same time |
| **Availability** | Every request receives a response |
| **Partition Tolerance** | System works despite network splits |
**Reality:** Network partitions happen, so you choose between:
| Choice | Trade-off | Example |
|--------|-----------|---------|
| **CP** | Sacrifice availability for consistency | Banking transactions |
| **AP** | Sacrifice consistency for availability | Shopping cart |
**MusicCorp:** Generally AP (eventual consistency via Kafka events).
---
## Chaos Engineering
### Not Just "Break Things"
Chaos engineering is structured experimentation:
1. Define steady state (normal behavior)
2. Hypothesize about failures
3. Introduce controlled failures
4. Measure deviation from steady state
5. Fix weaknesses discovered
### Techniques
| Technique | What It Tests |
|-----------|---------------|
| **Kill pods** | Pod restart, rescheduling |
| **Network latency** | Timeout handling |
| **Dependency failure** | Circuit breakers, fallbacks |
| **Resource exhaustion** | Memory limits, CPU throttling |
### Game Days
Planned exercises to test incident response:
- Schedule a failure scenario
- Involve all relevant teams
- Practice incident response
- Document lessons learned
### Tools
- **Chaos Monkey** (Netflix): Random instance termination
- **Gremlin**: Commercial chaos platform
- **Litmus**: Kubernetes-native chaos
- **Chaos Mesh**: CNCF chaos engineering platform
---
## Blame-Free Culture
### Why Blame Prevents Learning
- People hide mistakes to avoid punishment
- Root causes go unaddressed
- Same failures repeat
### Blameless Postmortems
Focus on:
- What happened (timeline)
- Why it happened (contributing factors)
- How to prevent it (action items)
**NOT:**
- Who caused it
- Who should be punished
- Who was on call
---
## How MusicCorp Compares to Chapter 12 Recommendations
| Book Recommendation | Our Implementation | Status |
|---------------------|-------------------|--------|
| **Timeouts** | Not set on HTTP calls | Gap |
| **Retries** | Basic Kafka retry | Partial |
| **Circuit breakers** | Not implemented | Gap |
| **Bulkheads** | Separate pods (isolation) | Partial |
| **Idempotency** | Order ID uniqueness | Partial |
| **Health checks** | Readiness/liveness probes | Done |
| **Graceful shutdown** | terminationGracePeriod | Done |
| **Chaos testing** | Not implemented | Gap |
---
## Action Items for MusicCorp
### High Priority
1. **Add timeouts to all HTTP calls**
```python
response = requests.get(url, timeout=(3.0, 10.0))
# (connect_timeout, read_timeout)
```
2. **Add circuit breaker to critical paths**
- Order → Catalog (price lookup)
- Order → Inventory (stock check)
### Medium Priority
1. **Implement retry with backoff**
- HTTP calls to other services
- Kafka message processing
2. **Make event handlers idempotent**
- Check order status before transition
- Deduplicate based on event ID
### Lower Priority
1. **Add chaos testing**
- Kill pods during order flow
- Introduce network latency
- Test circuit breaker behavior
---
## Discussion Questions
1. **Timeout values**: How do we choose appropriate timeout values? What's too short vs too long?
2. **Circuit breaker tuning**: How many failures before opening? How long to stay open?
3. **Retry storms**: How do we prevent retries from overwhelming a recovering service?
4. **Idempotency in events**: If `order.placed` is delivered twice, what happens? Is this handled?
5. **Chaos engineering scope**: Should we run chaos experiments in our Kind cluster? What would we learn?
---
## Key Quotes
> "At scale, failure is a statistical certainty. Plan for failure rather than just preventing it."
> "Circuit breakers fail fast, saving resources when downstream is unhealthy."
> "Blame prevents learning. Focus on what happened, not who caused it."
> "Chaos engineering is not about breaking things—it's about building confidence through controlled experiments."
---
## Recommended Reading
- "Release It!" by Michael Nygard (circuit breakers, bulkheads)
- "Chaos Engineering" by Casey Rosenthal and Nora Jones
- Netflix Tech Blog: Chaos Engineering articles
- "The Field Guide to Understanding Human Error" by Sidney Dekker