The circuit breaker pattern prevents cascading failures in distributed systems. When a service depends on another service that is failing, the circuit breaker detects the failures and stops sending requests to the failing service, allowing it time to recover. This article covers the circuit breaker state machine, implementation with Resilience4j, monitoring, and recovery strategies.
The Circuit Breaker State Machine
A circuit breaker has three states: Closed, Open, and Half-Open.
+-----------+
+--------->| CLOSED |<---------+
| | (正常) | |
| +-----+-----+ |
| | |
[Recovery] [Failure [Success]
complete] threshold] threshold]
| | |
| +-----v-----+ |
+----------| OPEN |----------+
| (断开) |
+-----+-----+
|
[Timeout]
|
+-----v-----+
| HALF-OPEN |
| (半开) |
+-----------+
Closed State
In the closed state, the circuit breaker allows requests to pass through to the remote service. It tracks the failure rate. When the failure rate exceeds a threshold, the circuit breaker trips to the open state.
Open State
In the open state, the circuit breaker immediately rejects requests without calling the remote service. This prevents overwhelming a failing service and allows it time to recover.
Half-Open State
In the half-open state, the circuit breaker allows a limited number of trial requests. If these requests succeed, the breaker closes. If they fail, the breaker reopens.
Implementation with Resilience4j
Resilience4j is a lightweight, easy-to-use fault tolerance library for Java. It provides circuit breaker, rate limiter, retry, bulkhead, and time limiter modules.
Basic Circuit Breaker Configuration
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;
// Create a circuit breaker configuration
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Open when 50% of calls fail
.slowCallRateThreshold(50) // Open when 50% are slow
.slowCallDurationThreshold(Duration.ofSeconds(2)) // Call > 2s is slow
.waitDurationInOpenState(Duration.ofSeconds(30)) // Time in open before half-open
.permittedNumberOfCallsInHalfOpenState(3) // Trial calls in half-open
.minimumNumberOfCalls(10) // Min calls before evaluating
.slidingWindowSize(20) // Window for metrics
.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.recordExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(BusinessException.class) // Don't count business errors
.build();
// Create the circuit breaker
CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
CircuitBreaker circuitBreaker = registry.circuitBreaker("paymentService");
Decorating Calls
// Decorate a function with circuit breaker
Supplier<String> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> paymentService.processPayment(request));
// Call with circuit breaker protection
try {
String result = decoratedSupplier.get();
} catch (CircuitBreakerOpenException e) {
// Circuit is open, handle gracefully
return fallbackResponse();
}
Spring Boot Integration
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import org.springframework.stereotype.Service;
@Service
public class PaymentService {
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResponse processPayment(PaymentRequest request) {
return paymentGateway.charge(request);
}
public PaymentResponse paymentFallback(PaymentRequest request, Throwable t) {
log.warn("Payment service unavailable, using fallback. Error: {}", t.getMessage());
return PaymentResponse.failed("Payment temporarily unavailable, please retry");
}
}
# application.yml
resilience4j.circuitbreaker:
configs:
default:
failureRateThreshold: 50
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
slidingWindowSize: 20
minimumNumberOfCalls: 10
instances:
paymentService:
baseConfig: default
inventoryService:
failureRateThreshold: 30
waitDurationInOpenState: 60s
Python Implementation (Simple)
import time
import threading
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_max_calls=3):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = None
self.half_open_calls = 0
self.lock = threading.Lock()
def call(self, func, fallback_func=None, *args, **kwargs):
with self.lock:
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time >= self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
else:
return self._handle_open(fallback_func)
if self.state == CircuitState.HALF_OPEN:
if self.half_open_calls >= self.half_open_max_calls:
return self._handle_open(fallback_func)
self.half_open_calls += 1
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
return self._handle_open(fallback_func)
def _on_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
elif self.state == CircuitState.CLOSED:
self.failure_count = max(0, self.failure_count - 1)
def _on_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
def _handle_open(self, fallback_func):
if fallback_func:
return fallback_func()
raise CircuitBreakerOpenException("Circuit breaker is open")
Monitoring Circuit Breakers
Exposing Metrics
Resilience4j integrates with Micrometer for metrics export:
// Register circuit breaker metrics with Micrometer
MeterRegistry meterRegistry = new SimpleMeterRegistry();
CircuitBreakerMetrics circuitBreakerMetrics =
circuitBreaker.getMetrics();
// Record state changes
circuitBreaker.getEventPublisher()
.onStateTransition(event -> {
log.info("Circuit breaker state changed: {} -> {}",
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
})
.onFailureRateExceeded(event -> {
log.warn("Failure rate exceeded threshold: {}",
event.getFailureRate());
});
Key Metrics to Monitor
# Prometheus metric format (from Resilience4j exporter)
resilience4j_circuitbreaker_state{name="paymentService",state="closed"} 1
resilience4j_circuitbreaker_calls{name="paymentService",kind="successful"} 142
resilience4j_circuitbreaker_calls{name="paymentService",kind="failed"} 8
resilience4j_circuitbreaker_not_permitted_calls{name="paymentService"} 34
Grafana Alert Rules
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state{state="open"} == 1
for: 5m
annotations:
summary: "Circuit breaker {{ $labels.name }} is open"
- alert: HighFailureRate
expr: resilience4j_circuitbreaker_failure_rate > 40
for: 2m
annotations:
summary: "Circuit breaker {{ $labels.name }} failure rate is {{ $value }}%"
Recovery Strategies
Gradual Recovery
When a circuit breaker closes after recovery, the previously failing service may be overwhelmed by the sudden return of all traffic. Use gradual recovery:
// Gradual recovery: allow 50% of usual traffic initially, ramp up
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(10) // Allow 10 trial calls
.slidingWindowSize(20)
.build();
Fallback Strategies
@CircuitBreaker(name = "recommendationService", fallbackMethod = "recommendationFallback")
public List<Product> getRecommendations(String userId) {
return recommendationClient.fetch(userId);
}
public List<Product> recommendationFallback(String userId, Throwable t) {
if (t instanceof CircuitBreakerOpenException) {
log.info("Recommendations unavailable, returning popular products instead");
return popularProductsCache.get("global");
}
return Collections.emptyList();
}
Bulkhead Pattern
Combine circuit breakers with bulkheads to isolate failure domains:
import io.github.resilience4j.bulkhead.Bulkhead;
import io.github.resilience4j.bulkhead.BulkheadConfig;
BulkheadConfig bulkheadConfig = BulkheadConfig.custom()
.maxConcurrentCalls(10)
.maxWaitDuration(Duration.ofMillis(500))
.build();
Bulkhead bulkhead = Bulkhead.of("paymentService", bulkheadConfig);
// Combined: circuit breaker wraps bulkhead wraps function
Supplier<String> decorated = Decorators
.ofSupplier(() -> paymentService.processPayment(request))
.withCircuitBreaker(circuitBreaker)
.withBulkhead(bulkhead)
.decorate();
Common Pitfalls
Conclusion
The circuit breaker pattern prevents cascading failures by detecting when a remote service is failing and stopping calls to it. Implement the three-state state machine (closed, open, half-open) with libraries like Resilience4j. Monitor circuit breaker state in production dashboards. Combine with fallbacks, bulkheads, and gradual recovery for a comprehensive resilience strategy. Circuit breakers are not a silver bullet, but they are an essential tool in building systems that degrade gracefully instead of failing catastrophically.