📚 Series Navigation:
← Previous: Part 9 - 10,000 Threads and a Dream
👉 You are here: Part 10 - Can You See Me Now?
Next: Part 11 - Trust, But Verify →
📋 Introduction
It's 2 AM. Your phone buzzes with a PagerDuty alert. The weather service is returning 500 errors. You SSH into the server, tail the logs, and see... nothing useful. Just a wall of INFO messages that tell you the application started successfully three weeks ago. Somewhere between "application started" and "everything is on fire," something went wrong. But what? And when? And why?
You check the metrics dashboard. It doesn't exist. You try to trace a failing request through the system. There's no trace. You're debugging a production outage the same way you'd debug a cave: with a flickering candle and a lot of hope.
If this sounds like your last Tuesday, you're not alone. This is what running a microservice without observability looks like.
The hard truth: a microservice you can't observe is a microservice you can't operate. It might work perfectly today. But the moment something goes wrong — and it will — you need three things: logs that tell you what happened, metrics that tell you how much is happening, and traces that tell you where it happened across your distributed system.
These aren't nice-to-haves. They're the difference between a 5-minute fix and a 5-hour war room.
In this article, we'll explore how our Weather Microservice implements all three pillars of observability. We'll build structured logging with correlation IDs that follow requests across service boundaries, custom business metrics that tell us what our application is actually doing, and distributed tracing that shows us exactly where time is being spent. And we'll do it with a framework that makes sure you never forget a pillar.
Grab your binoculars. It's time to make the invisible visible. ☕
🔭 The TRACE Framework
First, our observability blueprint — the TRACE framework:
| Letter | Principle | Description |
|---|---|---|
| T | Three Pillars | Implement all three: logs, metrics, and traces — they complement each other |
| R | Request Correlation | Every request gets a unique ID that flows through every log line and trace span |
| A | Automatic Instrumentation | Let the framework instrument HTTP, JPA, and cache operations automatically |
| C | Custom Metrics | Track business-specific metrics beyond generic HTTP stats |
| E | Environment-aware Config | Different verbosity levels for dev, test, and production environments |
🔥 Critical Insight: The three pillars work together. Logs tell you what happened, metrics tell you how often, and traces tell you where the time went. You need all three. Two out of three leaves a blind spot that will bite you during the next outage.
🔬 Pillar 1: Structured Logging
Logging seems simple. You sprinkle log.info() calls around your code, and you're done, right? Not quite. In a microservices world, logging without structure is like filing paperwork by throwing it into a pile on the floor. Sure, all the information is there, but good luck finding what you need during an outage.
The Correlation ID Filter
The single most important thing you can do for microservice logging is to assign every request a unique correlation ID that appears in every log line. This lets you filter an entire request's journey through your system with a single search query.
// CorrelationIdFilter.java
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class CorrelationIdFilter implements Filter {
private static final String CORRELATION_ID_HEADER = "X-Correlation-ID";
private static final String CORRELATION_ID_MDC_KEY = "correlationId";
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
if (!(request instanceof HttpServletRequest httpRequest)
|| !(response instanceof HttpServletResponse httpResponse)) {
chain.doFilter(request, response);
return;
}
try {
// Get or generate correlation ID
String correlationId = httpRequest.getHeader(CORRELATION_ID_HEADER);
if (correlationId == null || correlationId.isBlank()) {
correlationId = UUID.randomUUID().toString();
}
// Add to MDC for logging
MDC.put(CORRELATION_ID_MDC_KEY, correlationId);
// Add to response headers
httpResponse.setHeader(CORRELATION_ID_HEADER, correlationId);
chain.doFilter(request, response);
} finally {
// Clean up MDC to prevent memory leaks in thread pools
MDC.remove(CORRELATION_ID_MDC_KEY);
}
}
}
Several design decisions make this filter production-ready:
@Order(Ordered.HIGHEST_PRECEDENCE) — This filter runs first, before security, logging, or any other filter. Why? Because we need the correlation ID in the MDC before any other filter generates a log line. If the security filter rejects a request, we still want to know which correlation ID it was.
Accept or Generate — The filter checks for an incoming X-Correlation-ID header first. If an upstream service (like an API gateway) already generated one, we reuse it. This is how correlation flows across service boundaries. If there's no header, we generate a fresh UUID.
MDC (Mapped Diagnostic Context) — SLF4J's MDC is a thread-local map that gets automatically included in every log statement. By putting the correlation ID in MDC, every log line from every class during that request will include the correlation ID — without modifying a single logger call.
The finally Block — This is critical. MDC uses ThreadLocal storage. If you don't clean up, the correlation ID from one request could leak into the next request on the same thread. With virtual threads this matters even more — virtual threads get recycled frequently. Always clean up your MDC.
Response Header — We echo the correlation ID back in the response header. This means the caller (whether it's a frontend, another service, or a developer using curl) can see exactly which correlation ID was used. If something goes wrong, they can hand you that ID and you can trace the entire request.
✅ Pro tip: In a real microservices architecture, pass the correlation ID to downstream service calls too. Add it as a header when making RestClient calls. This creates a single thread you can pull to unravel any request across any number of services.
HTTP Request/Response Logging
Knowing what your service received and what it returned is invaluable for debugging. The LoggingFilter captures this:
// LoggingFilter.java
@Component
@Slf4j
public class LoggingFilter implements Filter {
private static final int MAX_PAYLOAD_LENGTH = 1000;
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
if (!(request instanceof HttpServletRequest httpRequest)
|| !(response instanceof HttpServletResponse httpResponse)) {
chain.doFilter(request, response);
return;
}
// Wrap request and response to cache content
ContentCachingRequestWrapper requestWrapper = new ContentCachingRequestWrapper(httpRequest);
ContentCachingResponseWrapper responseWrapper = new ContentCachingResponseWrapper(httpResponse);
Instant startTime = Instant.now();
try {
logRequest(requestWrapper);
chain.doFilter(requestWrapper, responseWrapper);
} finally {
Duration duration = Duration.between(startTime, Instant.now());
logResponse(requestWrapper, responseWrapper, duration);
// IMPORTANT: Copy cached response content to actual response
responseWrapper.copyBodyToResponse();
}
}
}
A few clever patterns worth calling out:
ContentCachingRequestWrapper / ContentCachingResponseWrapper — In the servlet API, request and response bodies are streams that can only be read once. If you read the request body in the filter, your controller gets an empty body. These wrappers cache the content so it can be read multiple times — once for logging, once for the actual handler.
copyBodyToResponse() — This is the line most people forget. The ContentCachingResponseWrapper intercepts the response body so you can read it for logging. But if you don't call copyBodyToResponse() at the end, the actual response sent to the client will be empty. This is one of those bugs that works perfectly in tests (where you're checking status codes) and silently breaks in production (where clients get empty responses).
Duration Tracking — The filter measures how long each request takes from start to finish. This gives you per-request timing right in your logs, without needing a metrics system:
HTTP Response: GET /api/weather/current - Status: 200 - Duration: 145ms
Payload Truncation — Request and response bodies are truncated to 1000 characters. You want enough to debug, but you don't want a 10MB JSON response filling up your log storage. The truncate method ensures you get the head of the payload with a ... (truncated) suffix.
Conditional Body Logging — Bodies are only logged at DEBUG level, and only for specific methods (POST/PUT/PATCH for requests, 4xx/5xx for responses). This prevents verbose logging in production while still being available when you need it.
The Log Pattern: Making Correlation IDs Visible
Having a correlation ID in MDC is useless if your log pattern doesn't display it:
<!-- logback-spring.xml -->
<property name="LOG_PATTERN"
value="%d{yyyy-MM-dd HH:mm:ss.SSS} [%X{correlationId:-}] [%thread] %-5level %logger{36} - %msg%n"/>
The %X{correlationId:-} syntax pulls the correlation ID from MDC. The :- provides an empty default if no correlation ID exists (for application startup logs, scheduled tasks, etc.). Every log line now includes the correlation ID:
2025-01-15 14:30:22.456 [a1b2c3d4-e5f6-7890-abcd-ef1234567890] [virtual-1] INFO c.w.controller.WeatherController - Fetching weather for London
2025-01-15 14:30:22.512 [a1b2c3d4-e5f6-7890-abcd-ef1234567890] [virtual-1] INFO c.w.client.WeatherApiClient - Calling external weather API
2025-01-15 14:30:22.745 [a1b2c3d4-e5f6-7890-abcd-ef1234567890] [virtual-1] INFO c.w.service.WeatherService - Weather data retrieved successfully
One correlation ID. Three log lines. Three different classes. Perfect traceability.
Structured JSON Logging for Production
Human-readable logs are great for development. But in production, you're shipping logs to ELK, Datadog, or Splunk — systems that need machine-parseable data. That's where Logstash JSON encoding comes in:
<!-- logback-spring.xml -->
<appender name="JSON_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${LOG_PATH}/application.json</file>
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<customFields>{"service":"weather-service","environment":"${SPRING_PROFILES_ACTIVE:-default}"}</customFields>
<includeCallerData>true</includeCallerData>
<includeMdc>true</includeMdc>
<includeContext>true</includeContext>
<includeTags>true</includeTags>
<includeStructuredArguments>true</includeStructuredArguments>
<includeKeyValuePairs>true</includeKeyValuePairs>
<shortenedLoggerNameLength>36</shortenedLoggerNameLength>
<timestampPattern>yyyy-MM-dd'T'HH:mm:ss.SSSXXX</timestampPattern>
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>${LOG_PATH}/application.%d{yyyy-MM-dd}.%i.json</fileNamePattern>
<maxFileSize>10MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>1GB</totalSizeCap>
</rollingPolicy>
</appender>
Each log line becomes a JSON object with structured fields. The LogstashEncoder automatically includes:
@timestamp— ISO 8601 format for reliable time parsinglevel— Log level as a searchable fieldlogger_name— The class that generated the logmessage— The log messagecorrelationId— From MDC, automatically included becauseincludeMdcis trueserviceandenvironment— Custom fields identifying the source
A single log line in production looks like:
{
"@timestamp": "2025-01-15T14:30:22.456+00:00",
"level": "INFO",
"logger_name": "c.w.controller.WeatherController",
"message": "Fetching weather for London",
"correlationId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"service": "weather-service",
"environment": "prod",
"thread_name": "virtual-1"
}
Now you can query correlationId = "a1b2c3d4..." in your log aggregation tool and see every log line from that request, across every service, in chronological order.
Error-Specific Log Files
Not all log data is equally urgent. The Weather Microservice separates error logs into dedicated files with longer retention:
<appender name="ERROR_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${LOG_PATH}/error.log</file>
<filter class="ch.qos.logback.classic.filter.ThresholdFilter">
<level>ERROR</level>
</filter>
<!-- ... -->
<rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<maxHistory>90</maxHistory> <!-- 90 days vs 30 for regular logs -->
<totalSizeCap>500MB</totalSizeCap>
</rollingPolicy>
</appender>
Error logs get 90 days of retention (vs 30 for general logs) because you often don't discover production issues until weeks later. A customer reports a problem from last month, and you need those error logs to investigate.
Async Appenders: Don't Let Logging Slow You Down
Writing to disk is I/O. In a high-throughput service, synchronous log writing can become a bottleneck. Async appenders solve this by buffering log events in a queue and writing them from a background thread:
<appender name="ASYNC_FILE" class="ch.qos.logback.classic.AsyncAppender">
<appender-ref ref="FILE"/>
<queueSize>512</queueSize>
<discardingThreshold>0</discardingThreshold>
</appender>
The discardingThreshold: 0 is important — by default, Logback's async appender drops DEBUG and TRACE events when the queue reaches 80% capacity. Setting it to 0 means we never discard any log events. In a weather service, we'd rather have the queue back up briefly than lose log data.
Profile-Specific Log Configuration
Different environments need different logging verbosity:
<!-- logback-spring.xml -->
<springProfile name="dev">
<logger name="com.weatherspring" level="DEBUG"/>
<logger name="org.springframework.web" level="DEBUG"/>
<logger name="org.hibernate.SQL" level="DEBUG"/>
<logger name="org.hibernate.type.descriptor.sql.BasicBinder" level="TRACE"/>
<root level="INFO">
<appender-ref ref="CONSOLE"/>
<appender-ref ref="ASYNC_FILE"/>
</root>
</springProfile>
<springProfile name="prod">
<logger name="com.weatherspring" level="INFO"/>
<logger name="org.springframework" level="WARN"/>
<logger name="org.hibernate" level="WARN"/>
<root level="INFO">
<appender-ref ref="JSON_CONSOLE"/>
<appender-ref ref="ASYNC_JSON_FILE"/>
<appender-ref ref="ASYNC_JSON_ERROR_FILE"/>
</root>
</springProfile>
In dev: Human-readable console output, DEBUG level for application code, SQL queries logged with parameter values (BasicBinder at TRACE shows the actual bound values like binding parameter [1] as [VARCHAR] - [London]).
In prod: JSON console output (for container log collectors), INFO level for application code, WARN for framework noise, separate error log files with longer retention.
In test: Minimal logging. The WeatherApiClient and GlobalExceptionHandler are set to OFF to keep test output clean — you don't need to see expected exceptions in your test results.
🤔 Design decision: Why use <springProfile> in logback-spring.xml instead of Logback's native <if> conditional? Because Spring Boot processes <springProfile> during application startup, giving you access to the active Spring profile. Native Logback conditionals don't have access to Spring's profile system. The tradeoff is that the file must be named logback-spring.xml (not logback.xml) for Spring to process these tags.
📊 Pillar 2: Custom Business Metrics
Spring Boot Actuator with Micrometer gives you dozens of metrics out of the box — HTTP request counts, JVM memory usage, garbage collection stats, connection pool sizes. These are essential, but they tell you about the infrastructure, not the business.
Do you know how many weather API calls you're making per minute? What's your cache hit rate? How many locations have users created this week? These are business metrics — the numbers that tell you whether your application is actually working.
The MetricsService
The Weather Microservice centralizes all custom business metrics in a single service:
// MetricsService.java
@Service
@Slf4j
public class MetricsService {
private final Counter weatherApiCallsTotal;
private final Counter forecastApiCallsTotal;
private final Counter cacheHitsTotal;
private final Counter cacheMissesTotal;
private final Counter locationsCreatedTotal;
private final Counter weatherRecordsSavedTotal;
private final Counter forecastRecordsSavedTotal;
private final Timer externalApiResponseTime;
public MetricsService(MeterRegistry meterRegistry) {
this.weatherApiCallsTotal =
Counter.builder("weather.api.calls.total")
.description("Total number of weather API calls")
.tag("api", "weatherapi")
.register(meterRegistry);
this.forecastApiCallsTotal =
Counter.builder("forecast.api.calls.total")
.description("Total number of forecast API calls")
.tag("api", "weatherapi")
.register(meterRegistry);
this.cacheHitsTotal =
Counter.builder("cache.hits.total")
.description("Total number of cache hits")
.register(meterRegistry);
this.cacheMissesTotal =
Counter.builder("cache.misses.total")
.description("Total number of cache misses")
.register(meterRegistry);
this.locationsCreatedTotal =
Counter.builder("locations.created.total")
.description("Total number of locations created")
.register(meterRegistry);
this.weatherRecordsSavedTotal =
Counter.builder("weather.records.saved.total")
.description("Total number of weather records saved to database")
.register(meterRegistry);
this.forecastRecordsSavedTotal =
Counter.builder("forecast.records.saved.total")
.description("Total number of forecast records saved to database")
.register(meterRegistry);
this.externalApiResponseTime =
Timer.builder("external.api.response.time")
.description("External API response time")
.tag("api", "weatherapi")
.register(meterRegistry);
}
}
Here's what's going on with these design choices:
Counters vs. Timers
The service uses two types of Micrometer meters:
Counters — Monotonically increasing values. They only go up. Perfect for counting events: API calls made, cache hits, records saved. You use counters with rate() functions in your dashboards to see "how many per second."
Timers — Record the duration of operations. The externalApiResponseTime timer tracks how long external API calls take. Timers automatically calculate percentiles, mean, max, and count — giving you both "how many" and "how fast" from a single meter.
The Tag System
Notice the tag("api", "weatherapi") on the API call counters. Tags (also called labels in Prometheus) let you add dimensions to your metrics without creating separate metric names. If you later add a second weather provider, you'd use tag("api", "openweathermap") and the same metric name, then filter or group by tag in your dashboards.
Builder Pattern in Constructor
All metrics are created in the constructor and stored as fields. This is a deliberate pattern:
- Metrics are registered once — No risk of accidentally registering the same metric twice (which would throw an error)
- Fail-fast — If Micrometer configuration is wrong, you'll know at startup, not at the first API call
- No overhead on hot paths — The
increment()calls are simple field accesses, no registry lookups
Recording Methods: Clean API for Business Code
The service exposes clean recording methods that other services call:
public void recordWeatherApiCall() {
weatherApiCallsTotal.increment();
}
public void recordCacheHit() {
cacheHitsTotal.increment();
}
public void recordWeatherRecordsSaved(int count) {
weatherRecordsSavedTotal.increment(count);
}
public void recordExternalApiResponseTime(long durationMillis) {
externalApiResponseTime.record(durationMillis, TimeUnit.MILLISECONDS);
}
public <T> T recordExternalApiCall(java.util.function.Supplier<T> task) {
return externalApiResponseTime.record(task);
}
The recordExternalApiCall(Supplier<T>) method is the one I like most — it wraps any operation and automatically measures its duration:
// Usage in WeatherApiClient:
WeatherApiResponse response = metricsService.recordExternalApiCall(
() -> restClient.get()
.uri("/current.json?key={key}&q={location}", apiKey, location)
.retrieve()
.body(WeatherApiResponse.class)
);
One line, automatic timing, no manual stopwatch needed.
Exposing Metrics to Prometheus
All these custom metrics are automatically available at the Prometheus endpoint thanks to the actuator configuration:
# application.yml
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus,circuitbreakers,circuitbreakerevents,shutdown
enabled-by-default: false
endpoint:
health:
enabled: true
show-details: when-authorized
roles: ACTUATOR_ADMIN
prometheus:
enabled: true
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active}
distribution:
percentiles-histogram:
http.server.requests: true
Key configuration decisions:
enabled-by-default: false — All endpoints are disabled by default, then we explicitly enable the ones we need. This follows the principle of least privilege — only expose what you need.
show-details: when-authorized — Health check details (database status, disk space, etc.) are only shown to authenticated users with the ACTUATOR_ADMIN role. Unauthenticated requests just see {"status": "UP"}.
percentiles-histogram: http.server.requests: true — This enables percentile histograms for HTTP server requests, allowing Prometheus to calculate p50, p95, p99 latencies. Without this, you only get count and sum — useful for averages, but averages hide tail latency problems.
Global tags — application and environment are added to every metric automatically. When you have multiple services reporting to the same Prometheus instance, these tags let you filter dashboards by service and environment.
The Dashboard You'd Build
With these metrics exposed, here's what a useful Grafana dashboard would include:
| Panel | Metric | Purpose |
|---|---|---|
| API Call Rate |
rate(weather_api_calls_total[5m]) |
How many API calls per second |
| Cache Hit Ratio |
cache_hits_total /(cache_hits_total +cache_misses_total) |
Is the cache working? |
| External API Latency (p99) |
histogram_quantile(0.99,external_api_response_time) |
How fast is the weather API? |
| Error Rate | rate(http_server_requests_seconds_count{status=~"5.."}[5m]) |
How many 5xx errors per second |
| Records Saved | rate(weather_records_saved_total[1h]) |
Data collection throughput |
| Locations Created |
increase(locations_created_total[24h]) |
Daily user activity |
🔥 Critical Insight: The cache hit ratio is the most important business metric in this service. If it drops below 70%, something is wrong — either the TTLs are too short, the cache keys are wrong, or the cache was evicted. Set an alert on it.
🌐 Pillar 3: Distributed Tracing
Logs tell you what happened. Metrics tell you how much. But in a distributed system, where a request spent its time is the hardest question to answer. That's where distributed tracing comes in.
The Weather Microservice uses Micrometer Tracing with Zipkin as the trace collector.
Tracing Configuration
# application.yml
management:
tracing:
sampling:
probability: ${TRACING_SAMPLE_RATE:0.1}
zipkin:
tracing:
endpoint: ${ZIPKIN_URL:http://localhost:9411/api/v2/spans}
Sampling rate: 0.1 (10%) — In production, tracing every request is expensive. Each trace generates data that needs to be sent to Zipkin, stored, and indexed. At 10%, you trace one in ten requests — enough to identify patterns without overwhelming the tracing infrastructure. For debugging specific issues, you can temporarily set TRACING_SAMPLE_RATE=1.0 via environment variable.
Zipkin endpoint via environment variable — The Zipkin URL is configurable because it differs between environments. In Docker Compose, it's http://zipkin:9411/api/v2/spans. In Kubernetes, it might be http://zipkin.observability.svc.cluster.local:9411/api/v2/spans.
Zipkin in Docker Compose
The docker-compose.yml includes Zipkin as a service, with the weather service depending on it:
# docker-compose.yml
services:
weather-service:
# ...
environment:
- TRACING_SAMPLE_RATE=${TRACING_SAMPLE_RATE:-1.0}
- ZIPKIN_URL=http://zipkin:9411/api/v2/spans
depends_on:
zipkin:
condition: service_healthy
zipkin:
image: openzipkin/zipkin:latest
container_name: zipkin
ports:
- "9411:9411"
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:9411/health"]
interval: 10s
timeout: 3s
retries: 3
start_period: 10s
Notice the TRACING_SAMPLE_RATE defaults to 1.0 in Docker Compose (trace everything) vs 0.1 in production. During development, you want to see every trace to verify your instrumentation is working correctly.
The depends_on with condition: service_healthy ensures the weather service doesn't start until Zipkin is ready to receive traces. Without this, the first few traces after startup would be lost.
What Gets Traced Automatically
With Micrometer Tracing on the classpath, Spring Boot automatically instruments:
- HTTP Server requests — Every incoming request gets a trace span
- RestClient calls — Outgoing HTTP calls to the Weather API get child spans
- JPA/JDBC operations — Database queries get their own spans
- Cache operations — Cache lookups and puts are tracked
This means a single request to /api/weather/current?location=London generates a trace showing:
[HTTP GET /api/weather/current] 245ms
+-- [Cache lookup: currentWeather] 1ms (MISS)
+-- [HTTP GET weatherapi.com/v1/current.json] 180ms
+-- [JPA: INSERT weather_record] 12ms
+-- [Cache put: currentWeather] 1ms
You can see instantly that the external API call takes 180ms out of 245ms total — that's where you'd focus optimization efforts.
Correlation Between Pillars
Here's the payoff. The trace ID generated by Micrometer Tracing is the same format as the correlation ID we put in MDC. This means:
- You see an error in your logs with correlation ID
a1b2c3d4... - You search for that ID in Zipkin and see the full trace
- You check your metrics dashboard and see a spike in error rate at the same timestamp
Three pillars, one story. That's observability.
⚙️ Putting It All Together: The Observability Pipeline
Let's trace a complete request through all three pillars to see how they work together:
1. Request Arrives
The CorrelationIdFilter runs first (highest precedence):
- Generates correlation ID:
a1b2c3d4-e5f6-7890-abcd-ef1234567890 - Adds it to MDC
- Adds it to the response header
2. Request Logged
The LoggingFilter runs next:
- Logs:
HTTP Request: GET /api/weather/current?location=London - The correlation ID appears in this log line automatically (from MDC)
3. Trace Span Created
Micrometer Tracing creates a root span for the HTTP request, with the trace ID visible in Zipkin.
4. Business Logic Executes
The WeatherService calls the external API:
metricsService.recordWeatherApiCall()— Counter incrementedmetricsService.recordExternalApiCall(() -> ...)— Timer started- A child trace span is created for the RestClient call
- Log lines from
WeatherApiClientinclude the correlation ID
5. Cache Updated
The @Cacheable annotation checks the cache:
- Cache miss:
metricsService.recordCacheMiss() - After API call, result is cached
metricsService.recordCacheHit()on subsequent requests
6. Response Logged
Back in the LoggingFilter:
- Logs:
HTTP Response: GET /api/weather/current - Status: 200 - Duration: 245ms - Duration is measured
- Correlation ID still in the log line
7. Cleanup
The CorrelationIdFilter's finally block removes the correlation ID from MDC, preventing leaks to the next request.
📝 The Observability Checklist
Use this checklist for every microservice you build:
Logging
- [ ] Correlation ID filter at highest precedence
- [ ] MDC cleanup in
finallyblocks - [ ] Structured JSON logging for production
- [ ] Separate error log files with longer retention
- [ ] Async appenders to prevent I/O bottlenecks
- [ ] Profile-specific log levels (DEBUG for dev, INFO for prod)
- [ ] Request/response logging with body truncation
- [ ] Correlation ID passed to downstream services
Metrics
- [ ] Custom business metrics beyond HTTP stats
- [ ] Cache hit/miss ratios tracked
- [ ] External API call counts and latencies
- [ ] Percentile histograms enabled for HTTP requests
- [ ] Global tags for application and environment
- [ ] Prometheus endpoint exposed and secured
- [ ] Actuator endpoints selectively enabled
Tracing
- [ ] Distributed tracing enabled (Zipkin/Jaeger)
- [ ] Sample rate configured per environment
- [ ] Trace IDs correlated with log correlation IDs
- [ ] Automatic instrumentation for HTTP, JPA, cache
- [ ] Zipkin health-checked before service starts
🎯 Common Pitfalls and How to Avoid Them
Pitfall 1: Logging Sensitive Data
// ❌ Never do this
log.info("User authenticated with password: {}", password);
log.debug("API response: {}", fullResponseWithApiKey);
// ✅ Do this instead
log.info("User authenticated successfully for: {}", username);
log.debug("API response status: {}, size: {} bytes", status, responseSize);
Log the fact that something happened, not the data. Correlation IDs let you match logs to requests without including sensitive payloads.
Pitfall 2: Metrics Cardinality Explosion
// ❌ High cardinality — creates a new time series for every user
Counter.builder("api.calls")
.tag("userId", userId) // Millions of unique values!
.register(registry);
// ✅ Low cardinality — bounded set of values
Counter.builder("api.calls")
.tag("endpoint", "/weather/current") // Fixed set
.tag("status", "success") // Only success/failure
.register(registry);
Every unique tag combination creates a new time series in Prometheus. If you tag by user ID, you'll create millions of time series and your Prometheus instance will run out of memory.
Pitfall 3: Forgetting MDC Cleanup
// ❌ MDC leak if exception occurs between put and remove
MDC.put("correlationId", id);
chain.doFilter(request, response);
MDC.remove("correlationId"); // Never reached if doFilter throws!
// ✅ Always use try/finally
MDC.put("correlationId", id);
try {
chain.doFilter(request, response);
} finally {
MDC.remove("correlationId");
}
Pitfall 4: Sampling Rate Confusion
A 10% sample rate means 90% of requests have NO trace data. If you're debugging a specific request, you need the correlation ID from logs, not a trace. Set alerts based on metrics (which capture 100% of requests), investigate with logs (which capture 100% of requests), and use traces for understanding timing distribution.
🎓 Conclusion: Making the Invisible Visible
Remember that 2 AM PagerDuty alert from the intro? With the observability setup we've built, here's how that scenario plays out differently:
-
The three pillars are complementary — Logs tell you what, metrics tell you how much, traces tell you where. You need all three.
-
Correlation IDs are the glue — A single UUID connecting logs, traces, and support tickets. Implement the filter first, at highest precedence.
-
MDC makes correlation automatic — Put the ID in MDC once, and every log line includes it. But always clean up in
finallyblocks. -
Structured JSON logging for machines, text for humans — Use
<springProfile>to switch between console output (dev) and JSON output (prod). -
Custom metrics tell the business story — HTTP request counts are infrastructure metrics. Cache hit rates, API call counts, and records saved are business metrics.
-
Centralize metrics in a service — One class, all counters and timers, clean recording methods. No Micrometer code scattered across business logic.
-
Sample traces, log everything — Tracing has overhead. Set production sampling to 10-20%. But always log 100% — storage is cheap, lost debug data is expensive.
-
Secure your observability endpoints — Actuator endpoints expose sensitive data. Enable only what you need, require authentication for details.
-
Async appenders prevent I/O bottlenecks — Don't let logging slow down your request processing. Use async appenders with
discardingThreshold: 0. -
Environment-aware configuration — Dev needs DEBUG SQL queries. Prod needs JSON with Zipkin. Test needs silence. Use profiles to configure each independently.
Coming Next Week:
Part 11: Trust, But Verify - A Testing Strategy That Actually Works 🧪
📚 Series Progress
✅ Part 1: The Blueprint Before the Build
✅ Part 2: Spring Boot Alchemy
✅ Part 3: REST Assured
✅ Part 4: The Data Foundation
✅ Part 5: When the World Breaks
✅ Part 6: Cache Me If You Can
✅ Part 7: Guarding the Gates
✅ Part 8: Fail Gracefully
✅ Part 9: 10,000 Threads and a Dream
✅ Part 10: Can You See Me Now? ← You just finished this!
⬜ Part 11: Trust, But Verify
⬜ Part 12: Ship It
⬜ Part 13: To Production and Beyond
Happy observing, and may your correlation IDs always lead you to the root cause. ☕