10 - From Monitoring to Observability

- [[#Chapter Overview|Chapter Overview]] - [[#The Problem Statement|The Problem Statement]] - [[#The Problem Statement#Why Microservices Make Monitoring Harder|Why Microservices Make Monitoring Harder]] - [[#The Problem Statement#The Friday Afternoon Scenario|The Friday Afternoon Scenario]] - [[#Observability vs Monitoring: The Conceptual Shift|Observability vs Monitoring: The Conceptual Shift]] - [[#Observability vs Monitoring: The Conceptual Shift#Definitions|Definitions]] - [[#Observability vs Monitoring: The Conceptual Shift#Why This Distinction Matters|Why This Distinction Matters]] - [[#Observability vs Monitoring: The Conceptual Shift#Critique of the "Three Pillars"|Critique of the "Three Pillars"]] - [[#Building Blocks for Observability|Building Blocks for Observability]] - [[#Building Blocks for Observability#1. Log Aggregation|1. Log Aggregation]] - [[#1. Log Aggregation#How It Works|How It Works]] - [[#1. Log Aggregation#Best Practices|Best Practices]] - [[#1. Log Aggregation#Correlation IDs|Correlation IDs]] - [[#1. Log Aggregation#Timing Caveat|Timing Caveat]] - [[#1. Log Aggregation#Tool Options|Tool Options]] - [[#1. Log Aggregation#Shortcomings of Logs|Shortcomings of Logs]] - [[#Building Blocks for Observability#2. Metrics Aggregation|2. Metrics Aggregation]] - [[#2. Metrics Aggregation#Purpose|Purpose]] - [[#2. Metrics Aggregation#Key Considerations|Key Considerations]] - [[#2. Metrics Aggregation#Low vs High Cardinality: A Critical Distinction|Low vs High Cardinality: A Critical Distinction]] - [[#2. Metrics Aggregation#Charity Majors on Metrics Limitations|Charity Majors on Metrics Limitations]] - [[#Building Blocks for Observability#3. Distributed Tracing|3. Distributed Tracing]] - [[#3. Distributed Tracing#Concepts|Concepts]] - [[#3. Distributed Tracing#How It Works|How It Works]] - [[#3. Distributed Tracing#Sampling|Sampling]] - [[#3. Distributed Tracing#Implementation Requirements|Implementation Requirements]] - [[#3. Distributed Tracing#Tool Options|Tool Options]] - [[#Building Blocks for Observability#4. SLAs, SLOs, SLIs, and Error Budgets|4. SLAs, SLOs, SLIs, and Error Budgets]] - [[#4. SLAs, SLOs, SLIs, and Error Budgets#Definitions|Definitions]] - [[#4. SLAs, SLOs, SLIs, and Error Budgets#Error Budgets as Decision Tools|Error Budgets as Decision Tools]] - [[#Building Blocks for Observability#5. Alerting|5. Alerting]] - [[#5. Alerting#The Danger of Too Many Alerts|The Danger of Too Many Alerts]] - [[#5. Alerting#Alert Fatigue|Alert Fatigue]] - [[#5. Alerting#EEMUA Guidelines for Good Alerts|EEMUA Guidelines for Good Alerts]] - [[#5. Alerting#Key Question|Key Question]] - [[#Building Blocks for Observability#6. Semantic Monitoring|6. Semantic Monitoring]] - [[#6. Semantic Monitoring#The Concept|The Concept]] - [[#6. Semantic Monitoring#Implementation Approaches|Implementation Approaches]] - [[#Building Blocks for Observability#7. Testing in Production|7. Testing in Production]] - [[#7. Testing in Production#Types of Production Testing|Types of Production Testing]] - [[#7. Testing in Production#Synthetic Transaction Implementation|Synthetic Transaction Implementation]] - [[#Standardization|Standardization]] - [[#Selecting Tools: Evaluation Criteria|Selecting Tools: Evaluation Criteria]] - [[#Selecting Tools: Evaluation Criteria#Scale Considerations|Scale Considerations]] - [[#The Expert in the Machine (AI/ML Skepticism)|The Expert in the Machine (AI/ML Skepticism)]] - [[#Getting Started: Practical Recommendations|Getting Started: Practical Recommendations]] - [[#Key Takeaways|Key Takeaways]] - [[#Recommended Further Reading|Recommended Further Reading]] - [[#Connection to Other Chapters|Connection to Other Chapters]] ## Chapter Overview This chapter addresses one of the most critical operational challenges in microservice architectures: understanding what's happening in production. Sam Newman argues for a fundamental shift from traditional monitoring approaches to building truly observable systems—a change that affects not just tooling but organizational mindset. The central tension the chapter explores: as we decompose monoliths into microservices, we gain architectural benefits but dramatically increase the complexity of production troubleshooting. Every outage becomes, as one infamous tweet put it, "more like a murder mystery." --- ## The Problem Statement ### Why Microservices Make Monitoring Harder In a monolithic application, troubleshooting has an obvious starting point—the monolith itself. With microservices: - Multiple servers to monitor - Multiple log files to sift through - Multiple points where network latency can cause problems - Increased surface area of failure - Complex call chains that obscure root causes The chapter walks through a progression of complexity: 1. **Single microservice, single server**: Simple—just look at host metrics and local logs 2. **Single microservice, multiple servers**: Need aggregation to see patterns vs isolated issues 3. **Multiple services, multiple servers**: Requires holistic thinking, correlation, and new mental models ### The Friday Afternoon Scenario Newman paints a vivid picture: alerts firing, Twitter ablaze, the boss calling—and you have no idea which of your 50 microservices is the culprit. This scenario motivates the entire chapter's recommendations. --- ## Observability vs Monitoring: The Conceptual Shift ### Definitions | Concept | Definition | |---------|------------| | **Monitoring** | An activity—something we *do* (watching dashboards, checking metrics) | | **Observability** | A property of the system—the extent to which internal state can be understood from external outputs | ### Why This Distinction Matters Traditional monitoring asks: "What metrics should we watch for known failure modes?" Observability asks: "Can we ask questions about our system that we didn't anticipate needing to ask?" With distributed systems, you'll encounter issues you never imagined. An observable system provides the raw material to investigate novel problems, not just detect predetermined ones. ### Critique of the "Three Pillars" Newman pushes back on the industry's "three pillars of observability" framing (metrics, logs, traces): 1. **It's backward**: Reducing a system property to implementation details puts tools before outcomes 2. **The boundaries are fuzzy**: Metrics can go in logs; traces can be constructed from logs 3. **It's vendor-driven**: Makes it easier to sell three separate tools rather than integrated solutions **Better framing**: Think in terms of *events*. All telemetry is fundamentally events with varying richness. Logs, metrics, and traces are projections of an event stream. --- ## Building Blocks for Observability ### 1. Log Aggregation **Newman's strongest recommendation**: Treat log aggregation as a *prerequisite* for microservices, not an afterthought. > "If your organization is unable to successfully implement a simple log aggregation solution, it will likely find the other aspects of a microservice architecture too much to handle." #### How It Works ``` [Microservice Instance] → [Local Filesystem] → [Forwarding Agent] → [Central Store] → [Query Interface] ``` #### Best Practices - **Standardize log format**: Date, time, service name, log level in consistent positions - **Consider structured logging (JSON)**: Easier to query, harder to read raw - **Avoid reformatting in forwarding agents**: CPU-intensive; change at the source instead #### Correlation IDs Critical for tracing requests across services: ``` 15-02-2020 16:01:01 Gateway INFO [abc-123] Signup for streaming 15-02-2020 16:01:02 Streaming INFO [abc-123] Cust 773 signs up ... 15-02-2020 16:01:03 Customer INFO [abc-123] Streaming package added ... 15-02-2020 16:01:03 Email INFO [abc-123] Send streaming welcome ... 15-02-2020 16:01:03 Payment ERROR [abc-123] ValidatePayment ... ``` The `[abc-123]` correlation ID links all related activity. **Add these early**—retrofitting is painful. #### Timing Caveat Log timestamps come from individual machines. Clock skew means you can't fully trust the apparent order of events. NTP helps but doesn't eliminate skew. For accurate timing and causality, use distributed tracing. #### Tool Options | Tool | Notes | |------|-------| | **Fluentd + Elasticsearch + Kibana** | Popular open-source stack; Elasticsearch management can be challenging | | **Splunk** | Powerful but expensive | | **Humio** | Designed for high-volume ingestion without expensive indexing | | **Datadog** | Commercial, feature-rich | | **CloudWatch / Application Insights** | Cloud-native options (AWS/Azure) | **Note on Elasticsearch**: Newman expresses caution about treating it as a database (it's fundamentally a search index) and concerns about Elastic's license changes from Apache 2.0 to SSPL. #### Shortcomings of Logs - Can't provide accurate timing due to clock skew - Generate massive data volumes at scale - Storage and query costs can explode - May contain sensitive data requiring access controls --- ### 2. Metrics Aggregation #### Purpose - Understand system behavior over time - Detect anomalies through pattern recognition - Enable capacity planning - Drive auto-scaling decisions #### Key Considerations - Aggregate across hosts while maintaining drill-down capability - Associate metadata with metrics for flexible querying - Consider storage at different resolutions (high-res recent, lower-res historical) #### Low vs High Cardinality: A Critical Distinction **Cardinality** = the number of fields you can query on in a data point | Low Cardinality | High Cardinality | |-----------------|------------------| | CPU rate per host | Customer ID + Order ID + Request ID + Build # + Host + ... | | Response time per service | All the above, across time | | Simple, traditional metrics | Rich, queryable telemetry | **The problem**: Tools like Prometheus were built for low-cardinality data. From Prometheus docs: > "Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values." **High-cardinality systems** (Honeycomb, Lightstep) allow you to ask ad-hoc questions: "Show me all requests from customer X that touched service Y and had latency > 500ms last Tuesday." #### Charity Majors on Metrics Limitations > "The metric is a dot of data, a single number with a name and some identifying tags. All of the context you can get has to be stuffed into those tags. But the write explosion of writing all those tags is expensive because of how metrics are stored on disk." --- ### 3. Distributed Tracing #### Concepts - **Span**: Local activity within a thread (has start time, end time, logs, key-value tags) - **Trace**: Collection of correlated spans showing the full request path #### How It Works 1. Spans are captured locally with unique identifiers 2. Forwarding agents send spans to a central collector 3. Collector assembles spans into traces 4. UI shows waterfall views of request flow #### Sampling Capturing everything is often impractical. Strategies: - **Random sampling**: Capture 1 in N requests (Jaeger default: 1 in 1,000) - **Dynamic sampling**: Capture more of interesting events (errors, slow requests) #### Implementation Requirements 1. Instrument your code (or use frameworks with built-in support) 2. Deploy local forwarding agents 3. Run a central collector 4. Use standard APIs: **OpenTelemetry** (successor to OpenTracing + OpenCensus) #### Tool Options - **Jaeger**: Popular open-source option - **Honeycomb**: Commercial, high-cardinality focused - **Lightstep**: Commercial, high-cardinality focused - **Zipkin**: Open-source alternative --- ### 4. SLAs, SLOs, SLIs, and Error Budgets #### Definitions | Term | Meaning | Example | |------|---------|---------| | **SLA** (Service-Level Agreement) | Contractual commitment to customers | "99.9% monthly uptime or we refund you" | | **SLO** (Service-Level Objective) | Internal team targets (often exceed SLA) | "Our service will respond in < 200ms p99" | | **SLI** (Service-Level Indicator) | Actual measured data | "P99 latency was 187ms last hour" | | **Error Budget** | Acceptable failure rate | "We can be down 2h 11m per quarter" | #### Error Budgets as Decision Tools If you're well under budget: take risks, ship that experimental feature. If you've exceeded budget: focus on reliability, defer risky changes. Error budgets give teams permission to innovate while maintaining accountability. --- ### 5. Alerting #### The Danger of Too Many Alerts Newman draws on sobering examples: **Three Mile Island (1979)**: Operators were so overwhelmed by alarms they couldn't identify the core problem. One operator said: "I would have liked to have thrown away the alarm panel. It wasn't giving us any useful information." **Boeing 737 Max**: NTSB report cited confusing multiple alerts as a contributing factor in crashes that killed 346 people. #### Alert Fatigue When everything alerts, nothing is prioritized. Operators become desensitized. #### EEMUA Guidelines for Good Alerts | Criterion | Meaning | |-----------|---------| | **Relevant** | The alert is of value | | **Unique** | Not duplicating another alert | | **Timely** | Arrives quickly enough to act on | | **Prioritized** | Operator knows what to address first | | **Understandable** | Clear and readable | | **Diagnostic** | Clear what is wrong | | **Advisory** | Suggests what action to take | | **Focusing** | Draws attention to most important issues | #### Key Question > "Should this problem cause someone to be woken up at 3 a.m.?" Not all problems are equal. A hard drive failure in a fault-tolerant system can wait until morning. --- ### 6. Semantic Monitoring #### The Concept Instead of asking "are there errors?" ask "is the system behaving as we expect?" Define a model of correct behavior at the *business level*: - "New customers can register" - "We're selling at least $20K/hour during peak" - "Orders are shipping at normal rate" If these hold true, the system is "healthy enough" even if individual components show issues. #### Implementation Approaches **Real User Monitoring (RUM)**: Observe actual production behavior against your model. - Pro: Real data from real users - Con: Noisy, after-the-fact (you find out after a customer is impacted) **Synthetic Transactions**: Inject fake user behavior with known expected outcomes. - Pro: Proactive—catch issues before users do - Pro: Cleaner signal - Con: May not cover all real-world scenarios --- ### 7. Testing in Production Newman normalizes this practice—many organizations already do it without realizing. #### Types of Production Testing | Technique | Description | |-----------|-------------| | **Synthetic Transactions** | Fake user actions with expected outcomes; run continuously | | **A/B Testing** | Two versions of functionality; measure which performs better | | **Canary Releases** | Small % of users see new version; expand if healthy | | **Parallel Runs** | Both old and new implementations execute; compare results | | **Smoke Tests** | Post-deploy, pre-release verification | | **Chaos Engineering** | Intentionally inject failures (Netflix Chaos Monkey) | #### Synthetic Transaction Implementation Reuse your end-to-end tests! They already have the right structure. Just: - Ensure test data doesn't pollute production (use dedicated test accounts) - Avoid triggering real side effects (don't ship washing machines to the office) --- ## Standardization With microservices, standardization in observability is essential: - **Log format**: Same structure across all services - **Metric names**: `ResponseTime` everywhere, not `RspTimeSecs` in one service - **Tagging conventions**: Consistent metadata across telemetry A platform team often owns this, providing pre-configured infrastructure that makes "doing the right thing" easy. --- ## Selecting Tools: Evaluation Criteria | Criterion | Meaning | |-----------|---------| | **Democratic** | Usable by everyone, not just experienced operators; affordable enough for dev/test | | **Easy to Integrate** | Supports OpenTelemetry; minimal code changes needed | | **Provides Context** | Temporal (vs. past), relative (vs. other metrics), relational (dependencies), proportional (severity) | | **Real-Time** | Seconds, not minutes or hours | | **Suitable for Your Scale** | Don't over-engineer; Google-scale solutions sacrifice features for scale you don't need | ### Scale Considerations Ben Sigelman (LightStep founder, Dapper creator): > "Google's microservices generate about 5 billion RPCs per second; building observability tools that scale to 5B RPCs/sec therefore boils down to building observability tools that are profoundly feature poor. If your organization is doing more like 5 million RPCs/sec...you can afford much more powerful features." --- ## The Expert in the Machine (AI/ML Skepticism) Newman is skeptical of "automated anomaly detection" promises: - Vendors have promised magic AI solutions for decades - Current ML can identify patterns but can't explain *meaning* - A clustering algorithm can say "these patients are similar" but needs a clinician to say "they're all critically ill" - Human expertise remains essential for interpretation and decision-making > "The expert in the system is, and will remain for some time, a human." Tools should augment human operators, not replace them. --- ## Getting Started: Practical Recommendations For a new or simple microservice architecture: 1. **Capture host metrics** (CPU, I/O, memory) and map microservice instances to hosts 2. **Record response times** for every service interface 3. **Log all downstream calls** 4. **Implement correlation IDs** from day one 5. **Set up basic log and metric aggregation** 6. **Create synthetic transactions** for critical user journeys 7. **Defer distributed tracing** until complexity warrants it (or use a managed service) 8. **Define SLOs** and alert based on them, not individual low-level metrics --- ## Key Takeaways 1. **Observability is a property, not a product**: Focus on outcomes, not tools 2. **Log aggregation is non-negotiable**: Implement before microservices, or suffer 3. **Correlation IDs pay dividends**: Easy now, painful later 4. **High-cardinality matters**: Traditional metrics tools will limit your ability to investigate 5. **Alert fatigue kills**: Fewer, better alerts beat many noisy ones 6. **Semantic monitoring shifts the question**: From "are there errors?" to "is the system working?" 7. **Testing in production is normal**: Embrace canaries, synthetics, and chaos engineering 8. **Tools should be democratic**: If only experts can use them, you've limited your team 9. **Human expertise isn't going away**: AI augments; it doesn't replace --- ## Recommended Further Reading - *Observability Engineering* by Charity Majors, Liz Fong-Jones, and George Miranda (O'Reilly) - *Site Reliability Engineering* edited by Betsy Beyer et al. (O'Reilly) — with caveat that it's Google-centric - *The Site Reliability Workbook* edited by Betsy Beyer et al. (O'Reilly) --- ## Connection to Other Chapters - **Chapter 6 (Workflow)**: Correlation IDs first introduced in saga discussion - **Chapter 8 (Deployment)**: Canary releases and progressive delivery - **Chapter 9 (Testing)**: Synthetic transactions reuse end-to-end test infrastructure - **Chapter 11 (Security)**: Log aggregation creates security considerations (data sensitivity, access controls) - **Chapter 12 (Resiliency)**: Chaos engineering and fault injection covered in depth - **Chapter 15 (Organizational Structures)**: Platform teams often own observability infrastructure