3 min read

Why Logs Alone Aren’t Enough: Monitoring, Metrics, and Tracing

Why Logs Alone Aren’t Enough: Monitoring, Metrics, and Tracing
Photo by Luke Chesser / Unsplash

If you’ve ever debugged a messy production issue, you know the first instinct: “Check the logs.”
And sure, logs are useful. They tell you what happened and sometimes even why. But here’s the uncomfortable truth: logs alone won’t save you when systems grow bigger, more distributed, and more complex.

Let’s talk about why that’s the case, and why you also need metrics and tracing in your toolbox.

The Limits of Logs

Logs capture events. They’re textual breadcrumbs we leave behind to understand the flow of the application.

Example:

@Slf4j
public class PaymentService {
    public void processPayment(String userId, double amount) {
        log.info("Processing payment for user: {} with amount: {}", userId, amount);

        if (amount <= 0) {
            log.error("Invalid payment amount for user: {}", userId);
            throw new IllegalArgumentException("Amount must be positive");
        }

        // do processing
        log.info("Payment processed successfully for user: {}", userId);
    }
}

Looks neat, right? But when you have 50 microservices, each dumping thousands of lines of logs every minute, you’re in trouble.

  • Searching across services becomes painful.
  • You can’t easily answer: “How many payments failed in the last hour?”
  • You don’t know the latency between services.
  • Logs don’t show dependencies across systems.

You can also read my other post on logs here to understand different types of logging.

A Developer’s Guide to Logs: Structured vs Unstructured
Learn the difference between structured and unstructured logs. Real examples, clear explanations, and tips to improve debugging and monitoring.

That’s where metrics and tracing come in.

Metrics: Numbers That Tell a Story

Metrics answer questions logs can’t. They’re quantitative measurements: counts, latencies, error rates, memory usage, etc.

For example, instead of grepping through logs to find failed payments, you could expose a metric:

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.Metrics;

public class PaymentService {
    private static final Counter failedPayments = 
        Counter.builder("payments_failed_total")
               .description("Number of failed payment attempts")
               .register(Metrics.globalRegistry);

    public void processPayment(String userId, double amount) {
        try {
            if (amount <= 0) throw new IllegalArgumentException("Invalid amount");
            // processing logic
        } catch (Exception e) {
            failedPayments.increment();
            throw e;
        }
    }
}

Now you can visualize this in Grafana and instantly see if payment failures spike. Much faster than scrolling through a wall of logs.

Metrics are perfect for:

  • Error rates (5xx per second)
  • Latency (response time of an endpoint)
  • Resource usage (heap memory, CPU)

They answer: “Is the system healthy?”

Tracing: Following the Journey

Okay, but what about distributed systems? Let’s say a payment request goes like this:

Frontend → API Gateway → Payment Service → Bank Adapter → Database

If the user complains it’s “slow,” where do you even look? Logs might say each service “did something,” but they don’t connect the dots.

Distributed tracing solves this. You attach a trace ID to a request, and every service passes it along. That way, you can reconstruct the full journey.

Example with Spring + Sleuth:

@RestController
public class PaymentController {
    private final Tracer tracer;

    public PaymentController(Tracer tracer) {
        this.tracer = tracer;
    }

    @GetMapping("/pay")
    public String pay() {
        Span span = tracer.nextSpan().name("payment-flow").start();
        try (Tracer.SpanInScope ws = tracer.withSpan(span.start())) {
            // do payment
            return "Payment success!";
        } finally {
            span.end();
        }
    }
}

When visualized in Jaeger or Zipkin, you’ll see exactly:

  • Request entered API Gateway at t=0ms
  • Payment Service took 120ms
  • Bank Adapter waited 300ms
  • DB query executed in 15ms

Now you know the bottleneck: the bank adapter.

Why All Three Together

  • Logs: Show details of events (errors, info messages).
  • Metrics: Give you trends and numbers (failure counts, latencies).
  • Tracing: Connects the dots across services to show the bigger picture.

Relying on just logs is like trying to navigate a city with only street signs. You’ll get somewhere, but slowly. Metrics give you a map. Tracing shows you the route taken.

Practical Real-World Flow

Imagine production alerts you at 2AM: “Payment system degraded.”

  • Metrics: You open Grafana, see error rate spiked from 0.1% to 8% in the last 10 minutes.
  • Tracing: You open Jaeger, find most requests spend 90% of their time waiting on the Bank Adapter.
  • Logs: You dig into that service’s logs and see:
ERROR: Timeout while calling external bank API

Boom. You’ve identified the issue in under 5 minutes. Logs alone would’ve had you searching across 5 services for hours.

Final Thoughts

Logs are a starting point, not the finish line. If you’re serious about running reliable systems, you need logs + metrics + tracing working together. That’s the holy trinity of observability.

So the next time you find yourself sprinkling log.info everywhere, pause and ask: “Would a metric or a trace help me debug this faster?” Chances are, the answer is yes.