It is 2 AM. Your order service starts responding slowly. Within seconds, the payment service backs up because it is waiting on the order service. Then the inventory service stalls because it depends on payments. Then the notification service. Then the API gateway. Within 3 minutes, your entire platform is down. All because one service got slow.

This is a cascading failure. And the circuit breaker pattern exists to stop it.

Table of Contents

  1. What Is the Circuit Breaker Pattern?
  2. Why Cascading Failures Are So Dangerous
  3. How the Circuit Breaker State Machine Works
  4. Implementing a Circuit Breaker From Scratch
  5. Production Libraries You Should Use
  6. Circuit Breaker vs Other Resilience Patterns
  7. Building a Complete Fault Tolerance Stack
  8. How Netflix Stops One Service From Killing Everything
  9. Circuit Breakers in Service Meshes
  10. Monitoring and Alerting
  11. Common Mistakes
  12. Lessons Learned

What Is the Circuit Breaker Pattern?

The circuit breaker pattern is borrowed from electrical engineering. In your house, a circuit breaker trips when it detects excessive current. It cuts the circuit to prevent a fire. Once the problem is fixed, you flip the breaker back on.

In software, the idea is the same. A circuit breaker wraps calls to an external service and monitors for failures. When failures cross a threshold, the breaker “trips” and stops making calls to that service. Instead of waiting for timeouts and piling up threads, your service fails fast and returns an error or a fallback response immediately.

Michael Nygard popularized this pattern in his book Release It!. Martin Fowler later wrote about it in a widely referenced blog post. Netflix then made it mainstream by building it into their Hystrix library, which they used to protect every inter-service call in their microservices architecture.

The core mechanic is straightforward:

flowchart LR
    A[Your Service] --> B{Circuit Breaker}
    B -->|Closed| C[Downstream Service]
    B -->|Open| D[Fail Fast / Fallback]
    C -->|Success| E[Return Response]
    C -->|Failure| F[Track Failure Count]
    F --> B

    style B fill:#fef3c7,stroke:#d97706,stroke-width:3px
    style D fill:#fee2e2,stroke:#dc2626,stroke-width:2px
    style E fill:#dcfce7,stroke:#16a34a,stroke-width:2px

When the breaker is closed, requests flow through normally. When it is open, requests fail immediately without even reaching the downstream service. This is the key insight: failing fast is better than failing slow.

Why Cascading Failures Are So Dangerous

To understand why circuit breakers matter, you need to understand what happens without them.

Say you have four microservices that depend on each other:

graph LR
    A[API Gateway] --> B[Order Service]
    B --> C[Payment Service]
    C --> D[Fraud Detection]
    B --> E[Inventory Service]

    style A fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
    style B fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
    style C fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
    style D fill:#fee2e2,stroke:#dc2626,stroke-width:2px
    style E fill:#dbeafe,stroke:#3b82f6,stroke-width:2px

The Fraud Detection service hits a bug and starts responding in 30 seconds instead of 200ms. Here is what happens next:

Time What Happens
T+0s Fraud Detection slows to 30s response time
T+5s Payment Service threads pile up waiting for Fraud Detection
T+15s Payment Service exhausts its thread pool (200 threads, all waiting)
T+20s Order Service calls to Payment start timing out
T+30s Order Service exhausts its own thread pool
T+45s API Gateway starts returning 503 to all users
T+60s Entire platform is down

One slow service took down four services in 60 seconds. The Fraud Detection service did not even crash. It just got slow. And that is actually worse than crashing, because a crashed service returns errors immediately. A slow service holds onto resources.

This is what engineers mean by cascading failure. The failure cascades upstream through the dependency chain, like dominoes falling. Each service consumes its resources waiting for the one below it.

sequenceDiagram
    participant GW as API Gateway
    participant OS as Order Service
    participant PS as Payment Service
    participant FD as Fraud Detection

    GW->>OS: Create Order
    OS->>PS: Process Payment
    PS->>FD: Check Fraud
    
    Note over FD: Responding slowly (30s)
    
    Note over PS: Thread waiting...<br/>Thread pool filling up
    Note over OS: Thread waiting...<br/>Thread pool filling up
    Note over GW: Thread waiting...<br/>Thread pool filling up

    FD-->>PS: Response (after 30s)
    
    Note over GW,FD: By now, hundreds of threads<br/>are blocked across all services

Without a circuit breaker, your services are too polite. They keep waiting, keep retrying, keep hoping the downstream service will respond. With a circuit breaker, your service recognizes the problem early and stops making things worse.

If you have dealt with thundering herd problems before, you will recognize a similar pattern here. A surge of requests overwhelms a system, and the system’s own retry behavior amplifies the problem. Circuit breakers cut that feedback loop.

How the Circuit Breaker State Machine Works

A circuit breaker is a state machine with three states: Closed, Open, and Half-Open.

stateDiagram-v2
    [*] --> Closed
    Closed --> Open : Failure threshold exceeded
    Open --> HalfOpen : Timeout expires
    HalfOpen --> Closed : Test requests succeed
    HalfOpen --> Open : Test requests fail

Closed State (Normal Operation)

This is the default state. All requests pass through to the downstream service. The circuit breaker monitors every call and tracks the outcome in a sliding window.

Two types of sliding windows are commonly used:

  • Count-based: Tracks the last N calls (e.g., the last 10 requests)
  • Time-based: Tracks all calls in the last N seconds (e.g., the last 60 seconds)

When the failure rate in the window crosses a configured threshold (say 50%), the breaker trips and transitions to Open.

Important: The breaker waits for a minimum number of calls before evaluating. If you set the threshold to 50% and only 2 calls have happened, 1 failure should not trip the breaker. You need enough data to make a meaningful decision.

Open State (Failing Fast)

The breaker is tripped. Every incoming request fails immediately with an error or returns a fallback response. No calls are made to the downstream service.

This does two things:

  1. Protects your service: Your threads and connections are not wasted waiting for a service that is down
  2. Protects the failing service: You stop piling on requests, giving it room to recover

The breaker stays open for a configured timeout (typically 30 to 60 seconds). After the timeout, it moves to Half-Open.

Half-Open State (Testing Recovery)

This is the probe state. The breaker lets a small number of test requests through to the downstream service.

  • If the test requests succeed, the breaker closes and normal traffic resumes
  • If any test request fails, the breaker opens again and resets the timeout

This is the circuit breaker’s self-healing mechanism. You do not need to manually reset it. It tests recovery automatically on a schedule.

Here is the complete flow with all the decision points:

flowchart TD
    A[Request Arrives] --> B{Breaker State?}
    
    B -->|Closed| C[Forward to Service]
    C --> D{Response?}
    D -->|Success| E[Record Success]
    D -->|Failure| F[Record Failure]
    E --> G[Return Response]
    F --> H{Failure Rate<br/>Above Threshold?}
    H -->|No| I[Return Error]
    H -->|Yes| J[Trip Breaker to Open]
    J --> I
    
    B -->|Open| K{Timeout<br/>Expired?}
    K -->|No| L[Fail Fast]
    K -->|Yes| M[Move to Half-Open]
    M --> N[Allow Test Request]
    
    B -->|Half-Open| N
    N --> O{Test Result?}
    O -->|Success| P[Close Breaker]
    O -->|Failure| Q[Reopen Breaker]
    P --> G
    Q --> L

    style J fill:#fee2e2,stroke:#dc2626,stroke-width:2px
    style L fill:#fee2e2,stroke:#dc2626,stroke-width:2px
    style P fill:#dcfce7,stroke:#16a34a,stroke-width:2px
    style G fill:#dcfce7,stroke:#16a34a,stroke-width:2px
    style M fill:#fef3c7,stroke:#d97706,stroke-width:2px
    style N fill:#fef3c7,stroke:#d97706,stroke-width:2px

Configuration Parameters

Every circuit breaker has these knobs:

Parameter What It Controls Typical Default
failureRateThreshold Percentage of failures to trip the breaker 50%
slidingWindowSize Number of calls (or seconds) to evaluate 10-20 calls
minimumNumberOfCalls Minimum calls before evaluating failure rate 5
waitDurationInOpenState How long to stay open before testing 30-60 seconds
permittedCallsInHalfOpen Number of test calls in half-open state 3-5
slowCallRateThreshold Percentage of slow calls to trip the breaker 80%
slowCallDurationThreshold What counts as a “slow” call 2-5 seconds

The slow call thresholds are often overlooked. A service that responds in 10 seconds is worse than one that returns errors in 50ms. Slow responses tie up your threads. Fast errors free them immediately. Your circuit breaker should trip on slow calls too, not just errors.

Implementing a Circuit Breaker From Scratch

Before reaching for a library, it helps to understand how a circuit breaker works internally. Here is a minimal implementation in Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import time
from enum import Enum
from collections import deque
from threading import Lock

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30,
                 half_open_max_calls=3, window_size=10):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        self.window_size = window_size
        
        self.state = State.CLOSED
        self.failures = deque(maxlen=window_size)
        self.last_failure_time = None
        self.half_open_calls = 0
        self.lock = Lock()
    
    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == State.OPEN:
                if self._timeout_expired():
                    self.state = State.HALF_OPEN
                    self.half_open_calls = 0
                else:
                    raise CircuitOpenError(
                        f"Circuit is open. Retry after "
                        f"{self._seconds_until_retry():.0f}s"
                    )
            
            if self.state == State.HALF_OPEN:
                if self.half_open_calls >= self.half_open_max_calls:
                    raise CircuitOpenError("Half-open call limit reached")
                self.half_open_calls += 1
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        with self.lock:
            if self.state == State.HALF_OPEN:
                self.half_open_calls -= 1
                if self.half_open_calls <= 0:
                    self.state = State.CLOSED
                    self.failures.clear()
            self.failures.append(True)
    
    def _on_failure(self):
        with self.lock:
            self.failures.append(False)
            self.last_failure_time = time.time()
            
            if self.state == State.HALF_OPEN:
                self.state = State.OPEN
                return
            
            failure_count = self.failures.count(False)
            if failure_count >= self.failure_threshold:
                self.state = State.OPEN
    
    def _timeout_expired(self):
        if self.last_failure_time is None:
            return True
        return time.time() - self.last_failure_time >= self.recovery_timeout
    
    def _seconds_until_retry(self):
        elapsed = time.time() - self.last_failure_time
        return max(0, self.recovery_timeout - elapsed)

class CircuitOpenError(Exception):
    pass

Using it:

1
2
3
4
5
6
7
8
9
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30)

def call_payment_service(order_id):
    try:
        return breaker.call(payment_client.charge, order_id)
    except CircuitOpenError:
        return {"status": "pending", "message": "Payment processing delayed"}
    except Exception as e:
        return {"status": "error", "message": str(e)}

This is a simplified version. Production circuit breakers add sliding window tracking, metrics emission, event listeners, and thread safety across distributed instances. That is why you should use a library.

Production Libraries You Should Use

Do not build your own circuit breaker for production. Use a battle-tested library.

Resilience4j (Java / Spring Boot)

Resilience4j replaced Netflix Hystrix as the standard circuit breaker library for Java. Hystrix entered maintenance mode in 2018 and is no longer actively developed.

Configuration in application.yml:

1
2
3
4
5
6
7
8
9
10
11
12
resilience4j:
  circuitbreaker:
    instances:
      paymentService:
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        failureRateThreshold: 50
        waitDurationInOpenState: 30s
        permittedNumberOfCallsInHalfOpenState: 3
        slowCallRateThreshold: 80
        slowCallDurationThreshold: 2s

Usage:

1
2
3
4
5
6
7
8
9
10
11
12
@Service
public class PaymentService {

    @CircuitBreaker(name = "paymentService", fallbackMethod = "fallback")
    public PaymentResponse charge(Order order) {
        return paymentClient.charge(order);
    }

    private PaymentResponse fallback(Order order, Throwable t) {
        return new PaymentResponse("pending", "Payment queued for retry");
    }
}

Resilience4j also provides Retry, Bulkhead, RateLimiter, and TimeLimiter modules that compose well with the circuit breaker.

sony/gobreaker (Go)

The most popular circuit breaker library for Go, maintained by Sony.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
package main

import (
    "fmt"
    "net/http"
    "time"

    "github.com/sony/gobreaker/v2"
)

var cb *gobreaker.CircuitBreaker[[]byte]

func init() {
    cb = gobreaker.NewCircuitBreaker[[]byte](gobreaker.Settings{
        Name:        "payment-service",
        MaxRequests: 3,
        Interval:    60 * time.Second,
        Timeout:     30 * time.Second,
        ReadyToTrip: func(counts gobreaker.Counts) bool {
            failureRatio := float64(counts.TotalFailures) / 
                            float64(counts.Requests)
            return counts.Requests >= 5 && failureRatio >= 0.5
        },
        OnStateChange: func(name string, from, to gobreaker.State) {
            fmt.Printf("Circuit breaker %s: %s -> %s\n", name, from, to)
        },
    })
}

func CallPaymentService(orderID string) ([]byte, error) {
    body, err := cb.Execute(func() ([]byte, error) {
        resp, err := http.Get(
            fmt.Sprintf("http://payment-service/charge/%s", orderID),
        )
        if err != nil {
            return nil, err
        }
        defer resp.Body.Close()
        
        if resp.StatusCode >= 500 {
            return nil, fmt.Errorf("server error: %d", resp.StatusCode)
        }
        
        return io.ReadAll(resp.Body)
    })
    return body, err
}

pybreaker (Python)

A clean, decorator-based circuit breaker for Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pybreaker
import requests

payment_breaker = pybreaker.CircuitBreaker(
    fail_max=5,
    reset_timeout=30,
    exclude=[requests.exceptions.HTTPError],
)

@payment_breaker
def charge_payment(order_id, amount):
    response = requests.post(
        "https://payment-service/charge",
        json={"order_id": order_id, "amount": amount},
        timeout=5,
    )
    response.raise_for_status()
    return response.json()

# Usage with fallback
def process_payment(order_id, amount):
    try:
        return charge_payment(order_id, amount)
    except pybreaker.CircuitBreakerError:
        return {"status": "queued", "message": "Payment will be retried"}

Quick Comparison

Feature Resilience4j gobreaker pybreaker
Language Java Go Python
Sliding Window Count + Time Count Count
Slow Call Detection Yes Custom via ReadyToTrip No
Metrics Micrometer, Prometheus Callbacks Listeners
Distributed State No (use Redis) No (use Redis) Redis support built-in
Active Maintenance Yes Yes Yes

Circuit Breaker vs Other Resilience Patterns

The circuit breaker is one of several resilience patterns. Understanding when to use which is critical.

Circuit Breaker vs Retry

flowchart LR
    subgraph Retry["Retry Pattern"]
        R1[Request Fails] --> R2[Wait] --> R3[Try Again]
        R3 --> R4{Success?}
        R4 -->|No| R2
        R4 -->|Yes| R5[Done]
    end
    
    subgraph CB["Circuit Breaker Pattern"]
        C1[Multiple Failures] --> C2[Stop All Requests]
        C2 --> C3[Wait for Timeout]
        C3 --> C4[Test with Probe]
        C4 -->|Fail| C2
        C4 -->|Pass| C5[Resume Traffic]
    end

    style R1 fill:#fef3c7,stroke:#d97706,stroke-width:2px
    style R5 fill:#dcfce7,stroke:#16a34a,stroke-width:2px
    style C2 fill:#fee2e2,stroke:#dc2626,stroke-width:2px
    style C5 fill:#dcfce7,stroke:#16a34a,stroke-width:2px
Aspect Retry Circuit Breaker
Handles Transient failures (brief glitches) Sustained failures (service is down)
Behavior Keeps trying the same call Stops trying, fails fast
Scope Single request All requests to that service
Risk Can overwhelm a struggling service None, protects the downstream service
Use when Service occasionally hiccups Service is consistently failing

Use them together: Retry handles individual failures. When retries keep failing, the circuit breaker trips and stops all traffic. This prevents retries from becoming a thundering herd of repeated requests hammering a dying service.

Circuit Breaker vs Bulkhead

The bulkhead pattern isolates resources (thread pools, connection pools) per downstream service. If your payment service is slow and consuming all 200 threads, a bulkhead ensures your inventory service still has its own dedicated 50 threads.

  • Circuit breaker detects the problem (failure rate is high) and reacts (stop calling)
  • Bulkhead contains the problem (one slow service cannot starve others)

They solve different parts of the same problem. Use both.

Circuit Breaker vs Rate Limiter

A rate limiter controls how many requests a client can make. A circuit breaker controls whether requests should be made at all based on the downstream service’s health.

  • Rate limiter protects your service from too many incoming requests
  • Circuit breaker protects your service from calling unhealthy downstream services

Building a Complete Fault Tolerance Stack

Individual patterns are useful. Combined, they form a robust defense. Here is how they layer together:

flowchart TD
    A[Incoming Request] --> B[Rate Limiter]
    B -->|Allowed| C[Bulkhead]
    B -->|Rejected| D[429 Too Many Requests]
    C -->|Thread Available| E[Circuit Breaker]
    C -->|Pool Full| F[503 Service Unavailable]
    E -->|Closed| G[Timeout + Retry]
    E -->|Open| H[Fallback Response]
    G -->|Success| I[Return Response]
    G -->|All Retries Failed| J[Record Failure in Breaker]
    J --> H

    style B fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
    style C fill:#fef3c7,stroke:#d97706,stroke-width:2px
    style E fill:#fef3c7,stroke:#d97706,stroke-width:2px
    style H fill:#fee2e2,stroke:#dc2626,stroke-width:2px
    style I fill:#dcfce7,stroke:#16a34a,stroke-width:2px

The order matters:

  1. Rate Limiter goes first. Reject excessive traffic before it consumes any resources.
  2. Bulkhead goes second. Isolate the remaining traffic into per-service thread pools.
  3. Circuit Breaker goes third. Check if the downstream service is healthy before calling.
  4. Timeout wraps the actual call. Do not wait forever.
  5. Retry handles transient failures within the timeout budget.
  6. Fallback returns a degraded but functional response when everything else fails.

Here is what this looks like in Resilience4j:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@Service
public class OrderService {

    @Bulkhead(name = "paymentService")
    @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
    @Retry(name = "paymentService")
    @TimeLimiter(name = "paymentService")
    public CompletableFuture<PaymentResponse> processPayment(Order order) {
        return CompletableFuture.supplyAsync(
            () -> paymentClient.charge(order)
        );
    }

    private CompletableFuture<PaymentResponse> paymentFallback(
            Order order, Throwable t) {
        // Queue the payment for async processing
        paymentQueue.send(order);
        return CompletableFuture.completedFuture(
            new PaymentResponse("queued", "Payment processing delayed")
        );
    }
}

The decorators execute from bottom to top: TimeLimiter wraps the call first, then Retry wraps that, then CircuitBreaker wraps the retries, and Bulkhead wraps everything.

How Netflix Stops One Service From Killing Everything

Netflix runs over 1,000 microservices. Any service can fail at any time. Their approach to fault tolerance is a textbook example of defense in depth.

Netflix built Hystrix (now in maintenance mode, succeeded by Resilience4j) to wrap every inter-service call. Here is what their architecture looks like:

flowchart TD
    A[User Request] --> B[API Gateway / Zuul]
    B --> C{Hystrix Command}
    
    C --> D[Thread Pool Isolation]
    D --> E{Circuit Breaker}
    
    E -->|Closed| F[Call Downstream Service]
    E -->|Open| G[Execute Fallback]
    
    F --> H{Response?}
    H -->|Success| I[Return to User]
    H -->|Timeout / Error| J[Increment Failure Count]
    J --> K{Threshold Exceeded?}
    K -->|Yes| L[Trip Circuit Open]
    K -->|No| G
    L --> G
    
    G --> M{Fallback Type}
    M --> N[Cached Response]
    M --> O[Default Value]
    M --> P[Different Service]

    style C fill:#fef3c7,stroke:#d97706,stroke-width:3px
    style L fill:#fee2e2,stroke:#dc2626,stroke-width:2px
    style I fill:#dcfce7,stroke:#16a34a,stroke-width:2px

Key design decisions Netflix made:

Thread pool isolation (bulkhead): Each downstream service gets its own thread pool. If the recommendation service is slow and uses all 20 of its threads, it does not affect the 20 threads allocated to the user profile service.

Fallback everything: Every Hystrix command has a fallback. If the personalized recommendation service is down, show generic trending movies. If the user’s viewing history is unavailable, show the default homepage. The user gets a degraded experience, not a broken one.

Request collapsing: When multiple threads request the same data within a short window, Hystrix batches them into a single backend call. This prevents a thundering herd of duplicate requests.

Monitoring with Hystrix Dashboard: Netflix monitors circuit breaker state in real time. When a breaker trips, they know immediately which service is failing and how many requests are being short-circuited.

The lesson from Netflix is not to copy their exact implementation. It is to adopt the philosophy: assume every dependency will fail, and design for it.

Circuit Breakers in Service Meshes

If you are running a service mesh like Istio, you get circuit breaking at the infrastructure level without changing your application code.

Istio’s Envoy sidecar proxy sits between your services and handles circuit breaking transparently:

flowchart LR
    subgraph PodA["Pod A"]
        A1[Your Service] --> A2[Envoy Proxy]
    end
    
    subgraph PodB["Pod B"]
        B2[Envoy Proxy] --> B1[Payment Service]
    end
    
    A2 -->|Circuit Breaker<br/>at Proxy Level| B2

    style A2 fill:#dbeafe,stroke:#3b82f6,stroke-width:2px
    style B2 fill:#dbeafe,stroke:#3b82f6,stroke-width:2px

Istio DestinationRule for circuit breaking:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

The outlierDetection section is Istio’s circuit breaker. When a service instance returns 5 consecutive 5xx errors, Istio ejects it from the load balancing pool for 30 seconds. After that, it is allowed back in. If it fails again, it is ejected for longer.

This is powerful because:

  • No code changes: Your application does not know the circuit breaker exists
  • Per-instance granularity: If 1 out of 10 pods is unhealthy, only that pod gets ejected
  • Consistent behavior: Every service in the mesh gets the same protection

The tradeoff: you have less control over fallback behavior. Application-level circuit breakers let you return cached data or degraded responses. Mesh-level circuit breakers can only reject or reroute traffic.

For most teams, the answer is both. Use the service mesh for basic outlier detection and connection management. Use application-level circuit breakers for smart fallbacks.

Monitoring and Alerting

A circuit breaker is only useful if you know when it trips. Here are the metrics you should track:

Key Metrics

Metric What to Watch Alert Threshold
Circuit breaker state State changes (Closed, Open, Half-Open) Alert on every Open transition
Failure rate Percentage of failed calls > 10% over 5 minutes
Rejected calls Requests rejected by open breaker Any sustained rejections
Call duration (p99) Latency of calls through the breaker > 2x normal p99
State transition frequency How often the breaker flips > 3 transitions in 5 minutes

What Your Dashboard Should Show

Track these for each circuit breaker instance:

1
2
3
4
5
6
# Prometheus metrics (Resilience4j)
resilience4j_circuitbreaker_state{name="paymentService"}
resilience4j_circuitbreaker_calls_seconds_count{name="paymentService", kind="successful"}
resilience4j_circuitbreaker_calls_seconds_count{name="paymentService", kind="failed"}
resilience4j_circuitbreaker_not_permitted_calls_total{name="paymentService"}
resilience4j_circuitbreaker_failure_rate{name="paymentService"}

A circuit breaker that keeps flipping between Open and Closed is a sign of an unstable downstream service. It might be partially recovering, then failing again. This pattern often indicates the recovery timeout is too short or the half-open test is not representative.

If you use distributed tracing, make sure your circuit breaker events show up in your traces. When a request fails because a breaker was open, the trace should clearly show that the failure was a circuit break, not a downstream error. This saves hours of debugging.

Common Mistakes

1. Sharing a Circuit Breaker Across Multiple Endpoints

If your payment service has /charge, /refund, and /status endpoints, do not use one circuit breaker for all three. The /status endpoint might be working fine while /charge is failing. One breaker for all endpoints means a failure in charging blocks status checks too.

Use separate circuit breakers per endpoint, or at minimum, per distinct failure domain.

2. Not Implementing Fallbacks

A circuit breaker without a fallback just converts slow errors into fast errors. That is better than nothing, but you can do more. Think about what your service can return when the downstream is unavailable:

  • Cached data: Return the last known good response
  • Default values: Show a generic recommendation instead of a personalized one
  • Queued processing: Accept the request and process it later via a message queue
  • Degraded functionality: Show the page without the component that requires the failing service

3. Setting Thresholds Too Low

If you set the failure threshold to 2 out of 10 requests, your circuit breaker will trip on normal jitter. Network calls fail sometimes. Set your threshold high enough to ignore noise but low enough to catch real problems. 50% failure rate with a minimum of 5 calls is a reasonable starting point.

4. Not Counting Slow Calls as Failures

A service that responds in 15 seconds is worse than one that returns a 500 error in 100ms. The 500 error frees your thread immediately. The slow response holds it for 15 seconds. Configure your circuit breaker to treat slow calls as failures.

5. Ignoring the Half-Open State

Some teams configure the circuit breaker and forget about the half-open state. They set permittedCallsInHalfOpen to 1, so the breaker sends a single test request. If that one request happens to hit a still-recovering instance, the breaker reopens and the service stays degraded longer than necessary. Allow 3 to 5 probe requests for a more reliable recovery signal.

Lessons Learned

1. Slow Is Worse Than Down

A crashed service returns errors in milliseconds. A slow service holds threads for seconds. That thread-holding behavior is what causes cascading failures. Always set timeouts on your outgoing calls, and configure your circuit breaker to trip on slow responses, not just errors.

2. Defense in Depth Is Not Optional

Netflix does not rely on circuit breakers alone. They layer circuit breakers with thread pool isolation (bulkhead), request coalescing, rate limiting, timeouts, and fallbacks. No single pattern covers all failure modes. A circuit breaker will not help if a single slow service consumes your entire thread pool before the failure rate threshold is reached. You need a bulkhead for that.

3. Monitor the Breaker, Not Just the Service

Tracking the downstream service’s health is not enough. You need to know the state of every circuit breaker in your system. A breaker that is open means users are seeing degraded service. A breaker that keeps flipping means your recovery timeout needs tuning. Make circuit breaker state a first-class metric in your dashboards.

4. Tune Your Thresholds Per Service

A payment service and a recommendation service have very different failure tolerances. Payments need aggressive protection (low threshold, short timeout) because failures cost you money. Recommendations can tolerate higher error rates because a failed recommendation does not block the user’s workflow. Use different configurations for different downstream services.

5. Fallbacks Are a Product Decision

What to show when a service is down is not an engineering decision alone. It is a product decision. Should you show cached prices that might be stale? Should you hide the recommendation section entirely? Should you show a “try again later” message? Work with your product team to define fallback behavior before the outage happens.

6. Test Circuit Breaker Behavior Before Production

Do not wait for a real outage to discover how your circuit breakers behave. Test them:

  • Inject failures and verify the breaker trips
  • Verify fallback responses are correct and useful
  • Test recovery: does the breaker close properly when the downstream service comes back?
  • Load test with circuit breakers open to confirm your fallback path handles the traffic

If you use chaos engineering tools, circuit breakers are one of the first things to validate. Netflix built Chaos Monkey specifically to test these scenarios in production.


Further Reading

Conclusion

The circuit breaker pattern is one of the most important patterns for building reliable distributed systems. Without it, one slow service can take down your entire platform in minutes. With it, failures are contained, resources are protected, and your users get a degraded experience instead of a broken one.

Start simple. Pick a library for your language (Resilience4j for Java, gobreaker for Go, pybreaker for Python). Wrap your most critical downstream calls. Set reasonable defaults: 50% failure threshold, 10-call sliding window, 30-second recovery timeout, 3 probe calls in half-open state. Add a fallback that returns something useful.

Then layer your defenses. Add timeouts to every outgoing call. Add retries with exponential backoff and jitter. Add bulkheads to isolate your thread pools. Add monitoring so you know when breakers trip.

These are not patterns you should adopt because they sound impressive in a system design interview. These are patterns you adopt because at 2 AM, when a service goes down, you want your system to handle it gracefully without waking you up. Netflix, Amazon, and Google all learned this the hard way. You do not have to.