The circuit breaker pattern prevents cascading failures in distributed systems. When a service depends on another service that is failing, the circuit breaker detects the failures and stops sending requests to the failing service, allowing it time to recover. This article covers the circuit breaker state machine, implementation with Resilience4j, monitoring, and recovery strategies.


The Circuit Breaker State Machine


A circuit breaker has three states: Closed, Open, and Half-Open.



                    +-----------+

         +--------->|  CLOSED   |<---------+

         |          | (正常)    |          |

         |          +-----+-----+          |

         |                |                |

    [Recovery]      [Failure          [Success]

     complete]       threshold]      threshold]

         |                |                |

         |          +-----v-----+          |

         +----------|   OPEN    |----------+

                    | (断开)    |

                    +-----+-----+

                          |

                    [Timeout]

                          |

                    +-----v-----+

                    | HALF-OPEN |

                    | (半开)    |

                    +-----------+


Closed State


In the closed state, the circuit breaker allows requests to pass through to the remote service. It tracks the failure rate. When the failure rate exceeds a threshold, the circuit breaker trips to the open state.


  • All calls pass through.
  • Metrics are tracked (failure count, failure rate, slow call rate).
  • When the failure threshold is exceeded, the breaker opens.

  • Open State


    In the open state, the circuit breaker immediately rejects requests without calling the remote service. This prevents overwhelming a failing service and allows it time to recover.


  • All calls fail fast with a `CircuitBreakerOpenException`.
  • A timer starts. When it expires, the breaker transitions to half-open.
  • The open duration should be long enough for the service to recover but short enough to minimize downtime.

  • Half-Open State


    In the half-open state, the circuit breaker allows a limited number of trial requests. If these requests succeed, the breaker closes. If they fail, the breaker reopens.


  • A configurable number of trial calls are allowed.
  • Success threshold reached: transition to closed.
  • Any failure: transition back to open.

  • Implementation with Resilience4j


    Resilience4j is a lightweight, easy-to-use fault tolerance library for Java. It provides circuit breaker, rate limiter, retry, bulkhead, and time limiter modules.


    Basic Circuit Breaker Configuration


    
    import io.github.resilience4j.circuitbreaker.CircuitBreaker;
    
    import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
    
    import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
    
    import java.time.Duration;
    
    
    
    // Create a circuit breaker configuration
    
    CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    
        .failureRateThreshold(50)                              // Open when 50% of calls fail
    
        .slowCallRateThreshold(50)                             // Open when 50% are slow
    
        .slowCallDurationThreshold(Duration.ofSeconds(2))      // Call > 2s is slow
    
        .waitDurationInOpenState(Duration.ofSeconds(30))       // Time in open before half-open
    
        .permittedNumberOfCallsInHalfOpenState(3)               // Trial calls in half-open
    
        .minimumNumberOfCalls(10)                              // Min calls before evaluating
    
        .slidingWindowSize(20)                                 // Window for metrics
    
        .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
    
        .recordExceptions(IOException.class, TimeoutException.class)
    
        .ignoreExceptions(BusinessException.class)              // Don't count business errors
    
        .build();
    
    
    
    // Create the circuit breaker
    
    CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
    
    CircuitBreaker circuitBreaker = registry.circuitBreaker("paymentService");
    
    

    Decorating Calls


    
    // Decorate a function with circuit breaker
    
    Supplier<String> decoratedSupplier = CircuitBreaker
    
        .decorateSupplier(circuitBreaker, () -> paymentService.processPayment(request));
    
    
    
    // Call with circuit breaker protection
    
    try {
    
        String result = decoratedSupplier.get();
    
    } catch (CircuitBreakerOpenException e) {
    
        // Circuit is open, handle gracefully
    
        return fallbackResponse();
    
    }
    
    

    Spring Boot Integration


    
    import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
    
    import org.springframework.stereotype.Service;
    
    
    
    @Service
    
    public class PaymentService {
    
        
    
        @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
    
        public PaymentResponse processPayment(PaymentRequest request) {
    
            return paymentGateway.charge(request);
    
        }
    
        
    
        public PaymentResponse paymentFallback(PaymentRequest request, Throwable t) {
    
            log.warn("Payment service unavailable, using fallback. Error: {}", t.getMessage());
    
            return PaymentResponse.failed("Payment temporarily unavailable, please retry");
    
        }
    
    }
    
    

    
    # application.yml
    
    resilience4j.circuitbreaker:
    
      configs:
    
        default:
    
          failureRateThreshold: 50
    
          waitDurationInOpenState: 30s
    
          permittedNumberOfCallsInHalfOpenState: 3
    
          slidingWindowSize: 20
    
          minimumNumberOfCalls: 10
    
      instances:
    
        paymentService:
    
          baseConfig: default
    
        inventoryService:
    
          failureRateThreshold: 30
    
          waitDurationInOpenState: 60s
    
    

    Python Implementation (Simple)


    
    import time
    
    import threading
    
    from enum import Enum
    
    
    
    class CircuitState(Enum):
    
        CLOSED = "closed"
    
        OPEN = "open"
    
        HALF_OPEN = "half_open"
    
    
    
    class CircuitBreaker:
    
        def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_max_calls=3):
    
            self.failure_threshold = failure_threshold
    
            self.recovery_timeout = recovery_timeout
    
            self.half_open_max_calls = half_open_max_calls
    
            
    
            self.state = CircuitState.CLOSED
    
            self.failure_count = 0
    
            self.last_failure_time = None
    
            self.half_open_calls = 0
    
            self.lock = threading.Lock()
    
        
    
        def call(self, func, fallback_func=None, *args, **kwargs):
    
            with self.lock:
    
                if self.state == CircuitState.OPEN:
    
                    if time.time() - self.last_failure_time >= self.recovery_timeout:
    
                        self.state = CircuitState.HALF_OPEN
    
                        self.half_open_calls = 0
    
                    else:
    
                        return self._handle_open(fallback_func)
    
                
    
                if self.state == CircuitState.HALF_OPEN:
    
                    if self.half_open_calls >= self.half_open_max_calls:
    
                        return self._handle_open(fallback_func)
    
                    self.half_open_calls += 1
    
            
    
            try:
    
                result = func(*args, **kwargs)
    
                self._on_success()
    
                return result
    
            except Exception as e:
    
                self._on_failure()
    
                return self._handle_open(fallback_func)
    
        
    
        def _on_success(self):
    
            with self.lock:
    
                if self.state == CircuitState.HALF_OPEN:
    
                    self.state = CircuitState.CLOSED
    
                    self.failure_count = 0
    
                elif self.state == CircuitState.CLOSED:
    
                    self.failure_count = max(0, self.failure_count - 1)
    
        
    
        def _on_failure(self):
    
            with self.lock:
    
                self.failure_count += 1
    
                self.last_failure_time = time.time()
    
                if self.failure_count >= self.failure_threshold:
    
                    self.state = CircuitState.OPEN
    
        
    
        def _handle_open(self, fallback_func):
    
            if fallback_func:
    
                return fallback_func()
    
            raise CircuitBreakerOpenException("Circuit breaker is open")
    
    

    Monitoring Circuit Breakers


    Exposing Metrics


    Resilience4j integrates with Micrometer for metrics export:


    
    // Register circuit breaker metrics with Micrometer
    
    MeterRegistry meterRegistry = new SimpleMeterRegistry();
    
    CircuitBreakerMetrics circuitBreakerMetrics = 
    
        circuitBreaker.getMetrics();
    
    
    
    // Record state changes
    
    circuitBreaker.getEventPublisher()
    
        .onStateTransition(event -> {
    
            log.info("Circuit breaker state changed: {} -> {}", 
    
                event.getStateTransition().getFromState(),
    
                event.getStateTransition().getToState());
    
        })
    
        .onFailureRateExceeded(event -> {
    
            log.warn("Failure rate exceeded threshold: {}", 
    
                event.getFailureRate());
    
        });
    
    

    Key Metrics to Monitor


  • **State**: Is the circuit closed, open, or half-open?
  • **Failure rate**: Current failure rate within the sliding window.
  • **Call count**: Total calls, successful calls, failed calls, and ignored calls.
  • **Not permitted calls**: Number of calls rejected while the breaker is open.
  • **Buffered calls**: Number of buffered calls in the current sliding window.

  • 
    # Prometheus metric format (from Resilience4j exporter)
    
    resilience4j_circuitbreaker_state{name="paymentService",state="closed"} 1
    
    resilience4j_circuitbreaker_calls{name="paymentService",kind="successful"} 142
    
    resilience4j_circuitbreaker_calls{name="paymentService",kind="failed"} 8
    
    resilience4j_circuitbreaker_not_permitted_calls{name="paymentService"} 34
    
    

    Grafana Alert Rules


    
    - alert: CircuitBreakerOpen
    
      expr: resilience4j_circuitbreaker_state{state="open"} == 1
    
      for: 5m
    
      annotations:
    
        summary: "Circuit breaker {{ $labels.name }} is open"
    
    
    
    - alert: HighFailureRate
    
      expr: resilience4j_circuitbreaker_failure_rate > 40
    
      for: 2m
    
      annotations:
    
        summary: "Circuit breaker {{ $labels.name }} failure rate is {{ $value }}%"
    
    

    Recovery Strategies


    Gradual Recovery


    When a circuit breaker closes after recovery, the previously failing service may be overwhelmed by the sudden return of all traffic. Use gradual recovery:


    
    // Gradual recovery: allow 50% of usual traffic initially, ramp up
    
    CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    
        .waitDurationInOpenState(Duration.ofSeconds(30))
    
        .permittedNumberOfCallsInHalfOpenState(10)  // Allow 10 trial calls
    
        .slidingWindowSize(20)
    
        .build();
    
    

    Fallback Strategies


  • **Cache**: Return cached response from the last successful call.
  • **Default value**: Return a safe default (empty list, zero).
  • **Degraded functionality**: Return partial results (e.g., show product page without recommendations).
  • **Queue for retry**: Store the request and retry later.

  • 
    @CircuitBreaker(name = "recommendationService", fallbackMethod = "recommendationFallback")
    
    public List<Product> getRecommendations(String userId) {
    
        return recommendationClient.fetch(userId);
    
    }
    
    
    
    public List<Product> recommendationFallback(String userId, Throwable t) {
    
        if (t instanceof CircuitBreakerOpenException) {
    
            log.info("Recommendations unavailable, returning popular products instead");
    
            return popularProductsCache.get("global");
    
        }
    
        return Collections.emptyList();
    
    }
    
    

    Bulkhead Pattern


    Combine circuit breakers with bulkheads to isolate failure domains:


    
    import io.github.resilience4j.bulkhead.Bulkhead;
    
    import io.github.resilience4j.bulkhead.BulkheadConfig;
    
    
    
    BulkheadConfig bulkheadConfig = BulkheadConfig.custom()
    
        .maxConcurrentCalls(10)
    
        .maxWaitDuration(Duration.ofMillis(500))
    
        .build();
    
    
    
    Bulkhead bulkhead = Bulkhead.of("paymentService", bulkheadConfig);
    
    
    
    // Combined: circuit breaker wraps bulkhead wraps function
    
    Supplier<String> decorated = Decorators
    
        .ofSupplier(() -> paymentService.processPayment(request))
    
        .withCircuitBreaker(circuitBreaker)
    
        .withBulkhead(bulkhead)
    
        .decorate();
    
    

    Common Pitfalls


  • **Too short timeout**: Circuit opens too frequently due to normal latency spikes. Set thresholds based on observed p99 latency.
  • **Too long open duration**: Service recovers quickly but the circuit stays open. Monitor recovery patterns and tune accordingly.
  • **No fallback**: An open circuit breaker still needs to return something useful to the caller.
  • **All-or-nothing thinking**: Not all dependencies need circuit breakers. Use them for external APIs, databases, and critical services.
  • **Ignoring half-open**: The half-open state is critical for recovery. Do not skip it.

  • Conclusion


    The circuit breaker pattern prevents cascading failures by detecting when a remote service is failing and stopping calls to it. Implement the three-state state machine (closed, open, half-open) with libraries like Resilience4j. Monitor circuit breaker state in production dashboards. Combine with fallbacks, bulkheads, and gradual recovery for a comprehensive resilience strategy. Circuit breakers are not a silver bullet, but they are an essential tool in building systems that degrade gracefully instead of failing catastrophically.