05 - Implementing Microservice Communication

- [[#The Big Idea|The Big Idea]] - [[#Looking for the Ideal Technology|Looking for the Ideal Technology]] - [[#Technology Options Overview|Technology Options Overview]] - [[#Remote Procedure Calls (RPC)|Remote Procedure Calls (RPC)]] - [[#Remote Procedure Calls (RPC)#Advantages|Advantages]] - [[#Remote Procedure Calls (RPC)#Challenges|Challenges]] - [[#Remote Procedure Calls (RPC)#Author's Recommendation|Author's Recommendation]] - [[#REST|REST]] - [[#REST#HTTP Gives You for Free|HTTP Gives You for Free]] - [[#REST#HATEOAS (Hypermedia)|HATEOAS (Hypermedia)]] - [[#REST#Where REST Excels|Where REST Excels]] - [[#GraphQL|GraphQL]] - [[#GraphQL#The Problem It Solves|The Problem It Solves]] - [[#GraphQL#Challenges|Challenges]] - [[#GraphQL#Where to Use It|Where to Use It]] - [[#Message Brokers|Message Brokers]] - [[#Message Brokers#Queues vs Topics|Queues vs Topics]] - [[#Message Brokers#Guaranteed Delivery|Guaranteed Delivery]] - [[#Message Brokers#Kafka Special Features|Kafka Special Features]] - [[#Schemas: The Author's Strong Opinion|Schemas: The Author's Strong Opinion]] - [[#Schemas: The Author's Strong Opinion#Schema Types by Technology|Schema Types by Technology]] - [[#Schemas: The Author's Strong Opinion#Structural vs Semantic Breakages|Structural vs Semantic Breakages]] - [[#Schemas: The Author's Strong Opinion#Schema Comparison Tools|Schema Comparison Tools]] - [[#Handling Breaking Changes|Handling Breaking Changes]] - [[#Handling Breaking Changes#The Goal: Independent Deployability|The Goal: Independent Deployability]] - [[#Handling Breaking Changes#Five Strategies to Avoid Breaking Changes|Five Strategies to Avoid Breaking Changes]] - [[#Handling Breaking Changes#When Breaking Changes Are Unavoidable|When Breaking Changes Are Unavoidable]] - [[#Handling Breaking Changes#Expand and Contract Pattern|Expand and Contract Pattern]] - [[#Handling Breaking Changes#Semantic Versioning|Semantic Versioning]] - [[#Client Libraries: A Double-Edged Sword|Client Libraries: A Double-Edged Sword]] - [[#Client Libraries: A Double-Edged Sword#The Problem|The Problem]] - [[#Client Libraries: A Double-Edged Sword#The AWS Model (Recommended)|The AWS Model (Recommended)]] - [[#Client Libraries: A Double-Edged Sword#Netflix's Approach|Netflix's Approach]] - [[#Service Discovery|Service Discovery]] - [[#Service Discovery#DNS (Simple but Limited)|DNS (Simple but Limited)]] - [[#Service Discovery#Dynamic Registries|Dynamic Registries]] - [[#Service Discovery#Kubernetes Service Discovery|Kubernetes Service Discovery]] - [[#API Gateways vs Service Meshes|API Gateways vs Service Meshes]] - [[#API Gateways vs Service Meshes#API Gateway: Do's and Don'ts|API Gateway: Do's and Don'ts]] - [[#API Gateways vs Service Meshes#Service Mesh Features|Service Mesh Features]] - [[#API Gateways vs Service Meshes#How Service Meshes Work|How Service Meshes Work]] - [[#API Gateways vs Service Meshes#Do You Need a Service Mesh?|Do You Need a Service Mesh?]] - [[#Documenting Services|Documenting Services]] - [[#Documenting Services#Explicit Schemas Help, But Aren't Enough|Explicit Schemas Help, But Aren't Enough]] - [[#Documenting Services#Tools|Tools]] - [[#Documenting Services#The "Humane Registry"|The "Humane Registry"]] - [[#How MusicCorp Compares to Chapter 5 Recommendations|How MusicCorp Compares to Chapter 5 Recommendations]] - [[#Discussion Questions|Discussion Questions]] - [[#Key Quotes|Key Quotes]] - [[#Recommended Reading|Recommended Reading]] ## The Big Idea This chapter is about the **practical technology choices** for implementing the communication styles from Chapter 4. The key principle: let your communication style guide technology selection, not the other way around. ## Looking for the Ideal Technology Five criteria for evaluating communication technology: | Criterion | Why It Matters | |-----------|---------------| | **Backward Compatibility** | Adding fields shouldn't break clients | | **Explicit Interface** | Clear contract between service and consumers | | **Technology Agnostic** | Don't lock yourself into one stack | | **Simple for Consumers** | Easy adoption without tight coupling | | **Hide Implementation** | Internal changes shouldn't break clients | --- ## Technology Options Overview ``` ┌──────────────────────────────────────────────────────────────────┐ │ Communication Technology │ ├─────────────────┬─────────────────┬─────────────────┬────────────┤ │ RPC │ REST │ GraphQL │ Brokers │ │ (gRPC, SOAP) │ (HTTP + JSON) │ (Queries) │ (Kafka, │ │ │ │ │ RabbitMQ) │ ├─────────────────┼─────────────────┼─────────────────┼────────────┤ │ Sync req-resp │ Sync req-resp │ Sync req-resp │ Async │ │ Binary protocol │ Text protocol │ Query language │ Events │ │ Schema required │ Schema optional │ Schema required │ Pub/Sub │ └─────────────────┴─────────────────┴─────────────────┴────────────┘ ``` --- ## Remote Procedure Calls (RPC) Makes remote calls look like local calls. Examples: gRPC, SOAP, Thrift. ### Advantages - Automatic client stub generation from schema - Binary protocols = smaller payloads, faster serialization - Strong typing and IDE support ### Challenges - **Technology coupling**: Some (like Java RMI) lock you into a platform - **Local ≠ Remote**: Network failures, latency, and marshaling costs are hidden - **Brittleness**: Adding/removing fields can break client stubs (especially Java RMI) ### Author's Recommendation > > "If I was looking at options in this space, **gRPC would be at the top of my list**." gRPC excels when you control both client and server. For wide interoperability, prefer REST. --- ## REST Architectural style built on resources, representations, and HTTP verbs. ### HTTP Gives You for Free - Caching (Varnish, CDNs) - Load balancing (nginx, HAProxy) - Security (TLS, auth mechanisms) - Well-understood error codes (4xx, 5xx) ### HATEOAS (Hypermedia) The theory: clients discover endpoints via links, not hardcoded URLs. ```xml <album> <name>Give Blood</name> <link rel="/artist" href="/artist/theBrakes" /> <link rel="/instantpurchase" href="/instantPurchase/1234" /> </album> ``` **Reality check**: Author admits HATEOAS is "rarely practiced" and hasn't seen evidence it delivers enough value for the effort. ### Where REST Excels - External APIs (wide client compatibility) - Caching-heavy workloads - When you need maximum interoperability --- ## GraphQL Client-defined queries that aggregate data from multiple services. ### The Problem It Solves Mobile app needs customer info + last 5 orders. Without GraphQL: - 2 API calls (Customer + Orders) - Over-fetching (gets all fields, only needs a few) - Wastes bandwidth and battery With GraphQL: One query, exactly the fields needed. ### Challenges - Expensive queries can hammer the server (no query planner like SQL) - Caching is complex (can't use HTTP caching easily) - Works better for reads than writes - Can reinforce "microservices as database wrappers" mindset ### Where to Use It - Mobile clients (constrained bandwidth) - External APIs that need flexibility (e.g., GitHub) - **NOT** for general microservice-to-microservice communication --- ## Message Brokers Middleware for asynchronous communication. Examples: RabbitMQ, Kafka, AWS SQS/SNS. ### Queues vs Topics | Queues | Topics | |--------|--------| | Point-to-point | Pub/sub | | One consumer group | Multiple consumer groups | | Load distribution | Event broadcast | | Sender knows destination | Sender doesn't know who's listening | ### Guaranteed Delivery The killer feature: broker holds messages until delivered, even if downstream is unavailable. > **Warning**: "Guaranteed delivery" means different things to different brokers. Read the docs carefully! ### Kafka Special Features - **Message permanence**: Messages stored forever (not just until consumed) - **Massive scale**: 50,000+ producers/consumers on one cluster (Netflix) - **Stream processing**: KSQL for real-time transformations - **Ordering**: Guaranteed within a partition (not across partitions) --- ## Schemas: The Author's Strong Opinion > "I think that having an explicit schema more than offsets any perceived benefit of having schemaless communication." ### Schema Types by Technology | Technology | Schema Format | |------------|---------------| | REST (JSON) | JSON Schema, OpenAPI | | REST (XML) | XSD | | gRPC | Protocol Buffers | | SOAP | WSDL | | Kafka | Avro (often), Protocol Buffers | | Events | CloudEvents, AsyncAPI | ### Structural vs Semantic Breakages | Type | Example | How to Catch | |------|---------|--------------| | **Structural** | Remove a field | Schema comparison tools | | **Semantic** | `calculate(a,b)` changes from add to multiply | Testing only | ### Schema Comparison Tools - **Protolock**: Protocol buffers - **json-schema-diff-validator**: JSON Schema - **openapi-diff**: OpenAPI - **Confluent Schema Registry**: JSON Schema, Avro, Protocol Buffers --- ## Handling Breaking Changes ### The Goal: Independent Deployability Never force consumers to upgrade in lockstep with you. ### Five Strategies to Avoid Breaking Changes 1. **Expansion changes**: Only add, never remove 2. **Tolerant reader**: Consumers ignore unknown fields 3. **Right technology**: gRPC's field numbers handle additions gracefully 4. **Explicit interface**: Clear schema = clear boundaries 5. **Catch breaks early**: Schema comparison in CI ### When Breaking Changes Are Unavoidable | Option | Description | Author's Take | |--------|-------------|---------------| | **Lockstep deployment** | Everyone upgrades together | "Flies in the face of independent deployability" | | **Coexist versions** | Run V1 and V2 simultaneously | Problematic (branched code, shared state) | | **Emulate old interface** | V2 service exposes both V1 and V2 endpoints | **Preferred approach** | ### Expand and Contract Pattern ``` Phase 1: Expand (add V2 endpoint, keep V1) └── Consumers migrate at their own pace Phase 2: Contract (remove V1 when no longer used) └── Track usage to know when safe ``` ### Semantic Versioning `MAJOR.MINOR.PATCH` - MAJOR: Breaking changes - MINOR: New backward-compatible features - PATCH: Bug fixes --- ## Client Libraries: A Double-Edged Sword ### The Problem If the same team writes server AND client library, logic leaks into the client. ### The AWS Model (Recommended) - AWS exposes raw SOAP/REST APIs - SDKs are written by **different teams** (or community) - Clients control when to upgrade ### Netflix's Approach Client libraries handle: - Service discovery - Failure modes - Logging - Retry logic But even Netflix admits this has led to "problematic coupling." --- ## Service Discovery How do microservices find each other? ### DNS (Simple but Limited) ``` accounts.musiccorp.net → 192.168.1.10 accounts-uat.musiccorp.net → 192.168.2.10 ``` **Problem**: TTL caching means stale entries. Solution: Point DNS to a load balancer. ### Dynamic Registries | Tool | Key Features | |------|--------------| | **Consul** | HTTP API, built-in DNS, health checks | | **etcd** | Bundled with Kubernetes | | **ZooKeeper** | "Better solutions exist nowadays" | ### Kubernetes Service Discovery - Pods register with metadata - Services pattern-match to find pods - Built-in, no extra tools needed --- ## API Gateways vs Service Meshes ``` ┌──────────────────────┐ External │ │ Clients ───────▶│ API Gateway │───── North-South │ (perimeter) │ └──────────────────────┘ │ ┌────────▼────────┐ │ │ ┌──────┴──────┐ ┌───────┴───────┐ │ Service │ │ Service │ │ Mesh │ │ Mesh │──── East-West │ (proxy) │ │ (proxy) │ └──────┬──────┘ └───────┬───────┘ │ │ ┌──────▼──────┐ ┌───────▼───────┐ │ Microservice│ │ Microservice │ │ A │ │ B │ └─────────────┘ └───────────────┘ ``` ### API Gateway: Do's and Don'ts **Do:** - Route external requests to internal services - Handle API keys, rate limiting, logging - Expose developer portal **Don't:** - Aggregate calls (use GraphQL or BFF pattern instead) - Rewrite protocols ("turn SOAP into REST") - Put business logic in the gateway > "Keeping smarts in our microservices helps [independent deployability]. If we now also have to make changes in intermediate layers, things become more problematic." ### Service Mesh Features - Mutual TLS (mTLS) - Correlation IDs - Service discovery - Load balancing - Consistent behavior across languages ### How Service Meshes Work Local proxy (often Envoy) runs alongside each microservice instance: ``` Order Processor → Local Proxy → Network → Local Proxy → Payment ``` The proxy handles retries, TLS, tracing—microservice doesn't know it's there. ### Do You Need a Service Mesh? Author's advice for years was: "If you can wait 6 months, wait 6 months." Now (2024): Space has matured. Consider if: - Running on Kubernetes - Have many microservices (not just 5) - Multiple programming languages - Need consistent cross-cutting behavior --- ## Documenting Services ### Explicit Schemas Help, But Aren't Enough Schemas show structure. Documentation explains behavior. ### Tools | Type | Options | |------|---------| | REST APIs | OpenAPI + portal (Ambassador, SwaggerUI) | | Events | AsyncAPI, CloudEvents | | Service Catalog | Spotify Backstage, Ambassador Service Catalog | ### The "Humane Registry" More than a wiki—pull in live data: - Service discovery info - Health status - API documentation - Team ownership **Example**: Financial Times' Biz Ops calculates a "System Operability Score" based on completeness of metadata, health checks, etc. --- ## How MusicCorp Compares to Chapter 5 Recommendations | Book Recommendation | Our Implementation | Status | |---------------------|-------------------|--------| | **REST for sync calls** | Flask REST APIs (JSON) | Done | | **Message broker for events** | Kafka (confluent-kafka) | Done | | **Explicit schemas** | No formal schema (JSON by convention) | Gap | | **OpenAPI documentation** | Swagger UI at /docs | Done | | **Schema comparison in CI** | Not implemented | Gap | | **Kubernetes service discovery** | K8s Services with DNS | Done | | **Correlation IDs** | X-Correlation-ID header | Done | | **Tolerant reader pattern** | Implicit (dict.get()) | Partial | | **Service mesh** | Not implemented | Not yet | | **API Gateway** | nginx Ingress | Done | | **Service catalog/registry** | Not implemented | Gap | --- ## Discussion Questions 1. **Schema first or code first?** We have working REST APIs with OpenAPI specs generated from code. Should we write specs first (more work upfront) or continue generating from code? 2. **Kafka benefits**: We migrated from Redis pub/sub to Kafka. This gives us message persistence, replay capability, and better scalability. Future improvements could include schema registry integration and dead letter queues. 3. **Breaking change handling**: If we need to add a required field to `order.placed` event, how do we handle it? What's our strategy for versioning events? 4. **The API gateway question**: We now have nginx Ingress routing external traffic. What additional gateway features might we need as we scale? 5. **Service mesh ROI**: The book says 5 microservices don't justify a service mesh. We have 5. What would need to change for it to make sense? 6. **Consumer-driven contracts**: The book mentions Pact for testing contracts. Should we implement this? How would it work with our event-driven architecture? 7. **The GraphQL question**: We don't have a BFF or GraphQL. If we built a mobile app that needed data from multiple services, would we add GraphQL or create a BFF? --- ## Key Quotes > "I think that having an explicit schema more than offsets any perceived benefit of having schemaless communication." > "Keep middleware dumb, smarts in endpoints." > "Lockstep deployment flies in the face of independent deployability." > "If you're having to support a wide variety of other applications that might need to talk to your microservices, [REST] would likely be a better fit [than gRPC]." > "Do you need a service mesh? ...If you have five microservices, I don't think you can easily justify a service mesh." --- ## Recommended Reading - *REST in Practice* by Jim Webber et al. - *Designing Event-Driven Systems* by Ben Stopford (Kafka deep dive) - *Kafka: The Definitive Guide* by Neha Narkhede et al.