Circuit breakers, bulkheads, and retries in Spring Boot. What they do, how to wire them, and why retry without backoff is a DDoS against yourself.
The Night Everything Caught Fire
It's 2:47 AM. Your phone is vibrating itself off the nightstand. Slack is a wall of red. The on-call engineer's message reads: "Payment service is down. Orders are backing up. Everything is slow. I think the recommendation service is also dead? Maybe the whole thing?"
Here's what happened: the recommendation service β the one that suggests "customers also bought" products that nobody asked for β started responding slowly. Not down. Just slow. A 200ms response became 5 seconds, then 10, then 30.
Your order service calls the recommendation service on every checkout. It waits. And waits. Every request holds a thread open. Your Tomcat thread pool has 200 threads. Within three minutes, 200 threads are hanging, waiting for recommendations. The order service can no longer accept any requests β including the ones that don't need recommendations at all. A customer trying to buy a $4,000 laptop is being told "service unavailable" because a service that was going to suggest they also buy a laptop sleeve is taking too long to suggest the laptop sleeve.
The payment service, which calls the order service, starts timing out. Its threads fill up. The notification service, which calls the payment service to check transaction status, starts timing out. Its threads fill up. In under five minutes, your entire platform is down because a non-critical service got a little bit slow.
This is a cascading failure. And if you don't have resilience patterns in place, this will happen to you. Probably on a Friday.
π₯ Anatomy of a Cascade
Cascading failures are the most dangerous failure mode in distributed systems because they're counterintuitive. You'd expect a failure to stay proportional to its cause β a small service gets slow, so a small part of the system is affected. Right?
Wrong.
In a microservice architecture, a single slow service can take down your entire platform faster than a complete outage would. Here's the paradox: a dead service is less dangerous than a slow one. If a service is completely dead, the connection fails immediately. Your thread is freed in milliseconds. Life goes on.
But a slow service is a thread vampire. It holds connections open, drains your thread pool one hanging request at a time, and by the time you notice, your capacity is gone. Your healthy services are collateral damage β perfectly functional code that can't serve anyone because some other service is hogging all the resources.
This isn't theoretical. On October 20, 2025, AWS had a cascading failure in us-east-1. A DNS resolution bug made DynamoDB endpoints unreachable. That triggered failures in EC2, Lambda, IAM, and CloudWatch β all of which depended on DynamoDB internally. When DNS was eventually fixed, millions of clients simultaneously retried their connections, creating what the post-mortems called a "retry storm" that overwhelmed the recovering systems and extended a potential 30-minute outage to over 15 hours. Fortnite, Snapchat, Robinhood, Ring, Alexa β all went dark. Their code was fine. The infrastructure underneath just couldn't handle the thundering herd of retries.
The pattern is always the same: one failure β resource exhaustion β cascade β total outage. Circuit breakers, bulkheads, and retries exist specifically to break this chain. Watertight compartments for your services. Fuses for your call chains. Controlled retreat instead of uncontrolled collapse.
β‘ Circuit Breakers: The Fuse Box for Your Microservices
The Concept
Think of an electrical fuse. Too much current flows through, the fuse blows, the circuit opens, your house doesn't burn down. You lose the toaster, not the entire kitchen.
A software circuit breaker does the same thing. It wraps calls to an external service and monitors for failures. When failures exceed a threshold, the circuit opens β all subsequent calls fail immediately without even attempting the operation. No thread is wasted. No connection is held. The caller gets an instant failure and can execute a fallback.
After a cooldown period, the circuit enters a half-open state: a small number of test requests are allowed through. If they succeed, the circuit closes and normal traffic resumes. If they fail, the circuit opens again.
Three states, dead simple, and it prevents your entire platform from going dark because one downstream service is having a bad day.
CLOSED β [failures exceed threshold] β OPEN
β |
| [timeout expires]
| β
βββββ [test calls succeed] βββββ HALF-OPEN
|
[test calls fail] β OPENImplementation in Spring Boot
Resilience4j is the standard library here. Netflix's Hystrix is in maintenance mode β it served its purpose, but Resilience4j is lighter, more modular, and designed for modern Spring Boot.
First, the dependencies:
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>Configuration in application.yml:
resilience4j:
circuitbreaker:
instances:
paymentService:
registerHealthIndicator: true
slidingWindowType: COUNT_BASED
slidingWindowSize: 10
minimumNumberOfCalls: 5
failureRateThreshold: 50
slowCallRateThreshold: 80
slowCallDurationThreshold: 3s
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: trueThese numbers matter. Get them wrong and the breaker either trips constantly during normal traffic or never trips when it should:
- slidingWindowSize: 10 β The breaker evaluates the last 10 calls. Too small and you'll trip on normal variance. Too large and you'll react too slowly to actual failures.
- failureRateThreshold: 50 β If 5 out of 10 calls fail, the circuit opens. This is the "how bad is bad enough" knob.
- slowCallDurationThreshold: 3s β Any call taking longer than 3 seconds counts as slow. Because a call that takes 30 seconds to fail is worse than one that fails in 30 milliseconds.
- slowCallRateThreshold: 80 β If 80% of calls are slow, that's effectively a failure even if they technically "succeed."
- waitDurationInOpenState: 30s β How long to wait before letting test requests through again. This is the cooldown before the breaker starts probing again.
- permittedNumberOfCallsInHalfOpenState: 3 β Only 3 test requests during half-open. You're probing, not flooding.
Now the service:
@Service
@Slf4j
public class OrderService {
private final PaymentClient paymentClient;
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResponse processPayment(PaymentRequest request) {
log.info("Calling payment service for order {}", request.getOrderId());
return paymentClient.charge(request);
}
private PaymentResponse paymentFallback(PaymentRequest request, Throwable ex) {
log.warn("Payment circuit open for order {}. Reason: {}",
request.getOrderId(), ex.getMessage());
// Don't just return an error. Give the user something useful.
return PaymentResponse.builder()
.orderId(request.getOrderId())
.status(PaymentStatus.PENDING)
.message("Payment queued for processing. You will be charged shortly.")
.build();
}
}The fallback method is the part most teams phone in, and it's the part that actually matters. A good fallback isn't just a fancy error message β it's a working alternative. Queue the payment for async processing. Return cached data. Disable non-critical features. The user shouldn't know the backend is melting.
What Your Circuit Breaker Should NOT Do
Don't wrap everything. Circuit breakers add overhead. Use them on calls to external services, third-party APIs, and other microservices. Don't use them on local method calls or database queries (those need timeouts and connection pool limits, not breakers).
Don't set thresholds too low. A circuit breaker that trips on 2 out of 5 failures will spend half its life open during normal traffic. Network blips happen. 404s happen. Your circuit breaker should react to patterns, not noise.
Don't forget to monitor. A circuit breaker silently eating errors in the open state is worse than no circuit breaker at all, because now you don't even know there's a problem. Expose breaker state through Actuator and alert on state transitions.
π’ Bulkheads: Watertight Compartments for Your Services
The Concept
The Titanic had bulkheads. They were supposed to contain flooding to individual compartments. The problem was the bulkheads didn't extend high enough β water poured over the top of one compartment into the next, and the next, and the next. The ship sank because the isolation wasn't complete.
In software, a bulkhead isolates resources so that one failing dependency can't consume everything and drag the whole system down. The most common implementation is thread pool isolation: instead of sharing a single thread pool across all outbound calls, you assign separate thread pools (or concurrency limits) to each dependency.
Without bulkheads:
Order Service (200 threads shared)
βββ β Payment Service (slow) ... 150 threads stuck waiting
βββ β Inventory Service ... 40 threads stuck waiting for their turn
βββ β Recommendation Service ... 10 threads stuck
βββ β β No threads left for ANY requestsWith bulkheads:
Order Service (200 threads total)
βββ [Bulkhead: 50 threads] β Payment Service (slow) ... 50 stuck, that's the max
βββ [Bulkhead: 50 threads] β Inventory Service ... running fine
βββ [Bulkhead: 20 threads] β Recommendation Service ... running fine
βββ β 80 threads still free for other work β
The payment service is slow? Fine. It gets its 50 threads and not one more. The rest of the system keeps serving customers.
Implementation in Spring Boot
Resilience4j offers two types of bulkheads: semaphore (limits concurrent calls) and thread pool (isolates into a separate execution context). Semaphore is simpler and lower overhead. Thread pool gives true isolation but adds complexity.
Configuration:
resilience4j:
bulkhead:
instances:
paymentService:
maxConcurrentCalls: 50
maxWaitDuration: 500ms
recommendationService:
maxConcurrentCalls: 20
maxWaitDuration: 100ms
inventoryService:
maxConcurrentCalls: 50
maxWaitDuration: 200ms
thread-pool-bulkhead:
instances:
paymentServiceAsync:
maxThreadPoolSize: 25
coreThreadPoolSize: 15
queueCapacity: 50
keepAliveDuration: 60sThe service:
@Service
@Slf4j
public class CheckoutService {
private final PaymentClient paymentClient;
private final RecommendationClient recoClient;
@Bulkhead(name = "paymentService", fallbackMethod = "paymentBulkheadFallback")
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentCircuitFallback")
public PaymentResponse processPayment(PaymentRequest request) {
return paymentClient.charge(request);
}
@Bulkhead(name = "recommendationService", fallbackMethod = "recoFallback")
public List<Product> getRecommendations(String customerId) {
return recoClient.getRecommendations(customerId);
}
private List<Product> recoFallback(String customerId, Throwable ex) {
log.warn("Recommendation bulkhead full or circuit open: {}", ex.getMessage());
// Non-critical service β return popular products from cache
return popularProductsCache.getTopProducts(10);
}
private PaymentResponse paymentBulkheadFallback(PaymentRequest req, Throwable ex) {
log.error("Payment bulkhead full β all {} slots occupied", 50);
return PaymentResponse.pending(req.getOrderId(),
"High traffic β your payment is queued for processing.");
}
}Sizing Your Bulkheads
This is where most teams get it wrong:
Start with your thread pool math. If your Tomcat has 200 threads and you have 4 downstream dependencies, don't give each one 50 threads (200 / 4 = 50). You need headroom. A dependency at full bulkhead capacity should leave enough threads for the rest of the system to function.
Rule of thumb:
Bulkhead size = (Expected peak concurrent calls) x 1.3
Total allocated <= 70% of your application's thread poolIf you're allocating more than 70% of your threads to bulkheads, you don't have enough capacity. Either scale up or re-evaluate your dependencies.
Size by criticality, not equality. Your payment service deserves more capacity than your recommendation service. Your recommendation service is "nice to have." Your payment service is "the business literally stops without this." Allocate accordingly.
Monitor and tune. Start with generous limits and tighten over time based on actual traffic patterns. Bulkhead rejections are your signal that either the limit is too tight or the dependency is too slow. Both are useful information.
π Retries: The Pattern That Will DDoS You If You Get It Wrong
The Concept
Retries are the simplest resilience pattern. A call fails? Try again. Networks are unreliable, services have momentary hiccups, and a retry often succeeds where the first attempt failed.
Simple, right?
Also the fastest way to take down your own infrastructure if you do it wrong.
Why Naive Retries Kill Recovering Services
Imagine your payment service handles 10,000 requests per second at peak. Something hiccups β maybe a deployment, maybe a brief network partition β and 1% of requests start failing. That's 100 failed requests per second.
Now, every caller retries those 100 failed requests immediately. That's 10,100 requests per second. The service is already struggling with 10,000. Now it has 10,100. More failures. More retries.
Each retry adds to the load. Each added load causes more failures. More retries. The math is exponential:
Second 1: 10,000 requests β 100 failures β 100 retries
Second 2: 10,100 requests β 200 failures β 200 retries
Second 3: 10,300 requests β 400 failures β 400 retries
Second 4: 10,700 requests β 800 failures β 800 retries
Second 5: 11,500 requests β service collapses entirelyFive seconds. That's how fast aggressive retries can turn a minor hiccup into a complete meltdown. And if you have multiple services retrying independently? Multiply that by every caller in your mesh. Congratulations, you've taken yourself offline.
This is exactly what happened during the October 2025 AWS outage. When DNS resolution was restored, millions of EC2 instances and Lambda functions simultaneously retried their connections. The connection flood overwhelmed the recovering DynamoDB control plane, DNS failed again, and the cycle repeated. What should have been a quick DNS fix dragged on for hours because every client on the internet retried at the same time.
The Three Laws of Safe Retries
1. Exponential Backoff
Never retry immediately. Wait. And wait longer each time.
Attempt 1: wait 1 second
Attempt 2: wait 2 seconds
Attempt 3: wait 4 seconds
Attempt 4: wait 8 secondsThis gives the failing service breathing room. Each successive retry is less aggressive than the last.
2. Jitter
Exponential backoff alone has a problem: if 1,000 clients all fail at the same instant, they'll all retry at the same instants β 1s, 2s, 4s, 8s β creating synchronized thundering herds. Jitter adds randomness to the delay:
Actual delay = baseDelay x 2^attempt x random(0.5, 1.5)Now 1,000 clients spread their retries across a window instead of all hitting at the exact same millisecond. The load becomes a wave instead of a spike.
3. Retry Budgets
Even with backoff and jitter, unlimited retries can overwhelm a system. A retry budget limits the total retry traffic as a percentage of overall traffic. Google's SRE book recommends: if your retry rate exceeds 10% of your total request volume, stop retrying entirely and fail fast. The service needs time to recover, and more retries are making it worse, not better.
Implementation in Spring Boot
resilience4j:
retry:
instances:
paymentService:
maxAttempts: 3
waitDuration: 1s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
randomizedWaitFactor: 0.5
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.client.HttpServerErrorException
ignoreExceptions:
- com.codyssey.exceptions.BusinessValidationException
- org.springframework.web.client.HttpClientErrorExceptionPay attention to that last part. Only retry on transient errors. An IOException (network blip) is worth retrying. A 400 Bad Request is not β sending the same bad request three times will produce three bad responses and waste everyone's time. A 404 Not Found isn't going to find itself on the third try.
The service:
@Service
@Slf4j
public class PaymentService {
@Retry(name = "paymentService", fallbackMethod = "paymentRetryFallback")
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentCircuitFallback")
@Bulkhead(name = "paymentService")
public PaymentResponse charge(PaymentRequest request) {
log.info("Attempting payment for order {}", request.getOrderId());
return paymentGateway.processPayment(request);
}
private PaymentResponse paymentRetryFallback(PaymentRequest request, Throwable ex) {
log.error("Payment failed after 3 attempts for order {}: {}",
request.getOrderId(), ex.getMessage());
// All retries exhausted β queue for async processing
paymentQueue.enqueue(request);
return PaymentResponse.queued(request.getOrderId());
}
}Don't Forget: Retries Create Duplicates
Retries have a side effect that most tutorials gloss over: they can cause duplicate operations. If your first request actually succeeded but the response was lost due to a network partition, your retry will execute the operation again. Your customer gets charged twice. Your inventory gets decremented twice. Your notification gets sent twice.
Every operation that can be retried must be idempotent. Use idempotency keys:
public PaymentResponse processPayment(PaymentRequest request) {
// Check if this payment was already processed
Optional<PaymentResponse> existing = paymentRepository
.findByIdempotencyKey(request.getIdempotencyKey());
if (existing.isPresent()) {
log.info("Duplicate payment detected for key {}. Returning existing result.",
request.getIdempotencyKey());
return existing.get();
}
// Process the payment
PaymentResponse response = gateway.charge(request);
// Store the result keyed by idempotency key
paymentRepository.saveWithIdempotencyKey(
request.getIdempotencyKey(), response);
return response;
}No idempotency key? No retry. Period.
π§© Combining the Three: The Resilience Stack
These patterns aren't alternatives. They're layers. In production, you stack them:
Request β Bulkhead β Circuit Breaker β Retry β Actual CallResilience4j applies decorators from the outside in. The outermost decorator runs first, the innermost one sits closest to your actual call. By default, Bulkhead wraps the outside, Retry wraps the inside:
- The Bulkhead is checked first (outermost). If all slots are occupied, the request is rejected immediately. No resources wasted.
- The Circuit Breaker is checked next. If the circuit is open, it fails fast without attempting the call.
- The Retry wraps the actual call (innermost). If the call fails with a retryable exception, it tries again with backoff β but only if the circuit breaker is still closed and the bulkhead still has capacity.
You can customize the order if needed:
resilience4j:
retry:
retryAspectOrder: 2
circuitbreaker:
circuitBreakerAspectOrder: 1
bulkhead:
bulkheadAspectOrder: 0Lower number = higher priority = evaluated first.
Combined configuration for a payment dependency:
resilience4j:
circuitbreaker:
instances:
paymentGateway:
slidingWindowSize: 20
minimumNumberOfCalls: 10
failureRateThreshold: 50
slowCallRateThreshold: 80
slowCallDurationThreshold: 2s
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 5
automaticTransitionFromOpenToHalfOpenEnabled: true
retry:
instances:
paymentGateway:
maxAttempts: 3
waitDuration: 1s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
randomizedWaitFactor: 0.5
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
bulkhead:
instances:
paymentGateway:
maxConcurrentCalls: 40
maxWaitDuration: 500msHere's the service with all three stacked:
@Service
@Slf4j
public class PaymentOrchestrator {
private final PaymentGatewayClient gateway;
@Bulkhead(name = "paymentGateway")
@CircuitBreaker(name = "paymentGateway", fallbackMethod = "fallback")
@Retry(name = "paymentGateway")
public PaymentResult processPayment(PaymentRequest request) {
return gateway.charge(request);
}
private PaymentResult fallback(PaymentRequest request, Throwable ex) {
if (ex instanceof CallNotPermittedException) {
log.warn("Circuit OPEN for payment gateway");
return PaymentResult.circuitOpen(request.getOrderId());
}
if (ex instanceof BulkheadFullException) {
log.warn("Bulkhead FULL β payment gateway at capacity");
return PaymentResult.overloaded(request.getOrderId());
}
log.error("Payment failed after retries: {}", ex.getMessage());
return PaymentResult.queuedForRetry(request.getOrderId());
}
}The fallback checks the exception type because different failure modes need different responses. A full bulkhead means "we're busy, try again in a moment." An open circuit means "the service is down, don't bother." A retry exhaustion means "we tried three times and it's not working."
π« The Anti-Patterns (How Teams Get This Wrong)
π₯ The Naive Retry
// This code will end careers
for (int i = 0; i < 10; i++) {
try {
return httpClient.call(url);
} catch (Exception e) {
// Retry immediately, no backoff, no jitter,
// no mercy for the downstream service
}
}No backoff. No jitter. No exception filtering. This turns every transient glitch into a retry storm. You're not retrying β you're hammering a service that's already on its knees.
π§ The Silent Circuit Breaker
A circuit breaker that opens and nobody knows about it. No metrics. No alerts. No dashboards. Your service is silently returning fallback responses for 45 minutes and nobody realizes the payment gateway has been down the entire time.
Fix: Expose circuit breaker state through Actuator. Alert on every state transition. Dashboard everything.
management:
endpoints:
web:
exposure:
include: health,circuitbreakers,circuitbreakerevents
health:
circuitbreakers:
enabled: trueπ Fallback Theatre
A fallback that just returns a different error message isn't a fallback β it's a costume change for the same failure. "Service unavailable" in a slightly nicer font is still "service unavailable."
A real fallback does something useful:
- Returns cached data (even if it's slightly stale)
- Queues the operation for async processing
- Turns off the non-critical feature without the user noticing
- Serves a degraded but functional response
π The Shared Thread Pool
Having a circuit breaker but no bulkhead is like having a fire alarm but no sprinkler system. The breaker will eventually trip, but in the 10-30 seconds it takes to accumulate enough failures to cross the threshold, a slow dependency can drain your entire thread pool. Bulkhead first. Then circuit breaker. Then retry. That's the order for a reason.
β° The Missing Timeout
None of these patterns help if your HTTP client is configured to wait 60 seconds for a response. A bulkhead with 50 slots and a 60-second timeout means you can have 50 threads hanging for a full minute before the bulkhead even starts rejecting. Set aggressive timeouts:
@Bean
public RestClient paymentRestClient() {
return RestClient.builder()
.baseUrl("https://payment.internal")
.requestFactory(new JdkClientHttpRequestFactory(
HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(2))
.build()))
.build();
}Your timeout should be based on the dependency's expected response time plus a reasonable buffer. If the payment gateway normally responds in 200ms, a 2-second timeout is generous. A 60-second timeout is a foot gun.
π The Decision Framework
When you're wiring up a new service dependency, ask these questions:
1. What happens if this dependency dies completely? β If the answer is "our service also dies" β you need a circuit breaker with a meaningful fallback.
2. What happens if this dependency gets slow? β If the answer is "our threads fill up" β you need a bulkhead and aggressive timeouts.
3. Is this a transient or persistent failure mode? β Transient (network blip, brief overload) β retry with backoff and jitter. β Persistent (service is down, bad deployment) β circuit breaker.
4. Is the operation idempotent? β Yes β safe to retry. β No β do NOT retry. Use a circuit breaker with a queue-based fallback instead.
5. How critical is this dependency? β Critical (payment, auth) β large bulkhead, aggressive circuit breaker, retries, full fallback strategy. β Non-critical (recommendations, analytics) β small bulkhead, circuit breaker that opens fast, simple cached fallback.
6. Can the user tolerate degradation? β Yes β return cached/partial data, queue the operation. β No β fail fast with a clear message and retry guidance.
π Monitoring: The Part Everyone Skips
Resilience patterns without monitoring are just decorations. You've added the annotations, you've configured the YAML, and now nobody's watching what the patterns are actually doing. You need to track:
| Metric | What It Tells You | Alert When |
|---|---|---|
| Circuit breaker state | Which services are healthy | Any breaker enters OPEN state |
| Bulkhead active calls | Current concurrency per dependency | > 80% of max capacity |
| Bulkhead rejected calls | When you're hitting limits | Any rejections in production |
| Retry count | How often transient failures occur | Retry rate > 10% of total requests |
| Fallback invocations | How often degraded mode is active | Any sustained fallback usage |
| Response time percentiles | Latency trends before they become outages | p99 > 2x normal |
Resilience4j publishes all of these through Micrometer, which integrates with Prometheus, Grafana, Datadog, or whatever your team already uses. Set it up once and forget about it.
π― What Actually Matters
Resilience patterns exist because distributed systems fail in ways that monoliths don't. In a monolith, if the recommendation module throws an exception, the catch block handles it and the order goes through. In microservices, that same failure involves network timeouts, thread exhaustion, connection pool drainage, and cascading outages that turn a $0.02/month recommendation service into a platform-wide incident.
The patterns themselves are simple enough to explain over lunch: circuit breakers stop hopeless calls, bulkheads contain the blast radius, and retries handle the transient stuff (but only with backoff and jitter β without those, retries ARE the outage).
Stack all three. Monitor everything. Test under failure conditions β not just happy paths. If you've never injected a 10-second delay into a dependency during a load test, you don't know how your system behaves under stress. You only think you know.
The goal isn't to prevent failures. Failures in distributed systems aren't edge cases. They're Tuesday. The goal is to build systems that lose the right things β that drop the recommendation carousel instead of the entire checkout flow, that queue payments instead of dropping them, that return cached data instead of a 500 error.
The best incident response is the one that never pages anyone. And the best resilience pattern is the one that turns a 2 AM catastrophe into a Grafana blip that your team reviews over coffee the next morning.
Now go check your services. How many of your HTTP clients have default timeouts? How many of your retries have backoff? How many of your dependencies share a single thread pool? If the answer to any of those is "I'm not sure," you have work to do. And I'd suggest doing it before 2:47 AM makes the decision for you.