4 minute read

The rise of distributed tracing has fundamentally changed how we understand application behaviour in complex systems. With platforms like Datadog offering sophisticated tracing capabilities, a question increasingly emerges in architecture discussions: do we still need traditional logging when we have comprehensive tracing? Having implemented observability solutions across various network infrastructure projects, I’ve developed a clear perspective on this question.

The Seductive Promise of Tracing

Distributed tracing offers compelling capabilities that can make logging seem redundant at first glance. When properly implemented, tracing provides:

End-to-end request visibility: Traces show the complete journey of requests as they traverse microservices, APIs, databases and other components—revealing latencies, dependencies and bottlenecks with remarkable clarity.

Hierarchical context: Spans and sub-spans create a nested structure that mirrors the logical operations within your system, creating a natural representation of how work flows through your architecture.

Performance insights: The timing data inherent in traces provides immediate visibility into performance outliers without additional instrumentation.

Correlation without effort: Once your tracing infrastructure is established, correlation between components happens automatically—no more manually piecing together logs from different services.

When I first implemented comprehensive tracing across a large-scale network management platform at Juniper, the immediate insights were revelatory. Performance bottlenecks that had remained hidden despite extensive logging suddenly became obvious. Intermittent issues that had evaded detection through log analysis were quickly diagnosed when we could see the full request context.

The Silent Gaps in Tracing Coverage

However, as our reliance on tracing grew, we began to encounter significant blind spots where tracing alone proved insufficient:

System boundaries: Tracing excels at following requests through instrumented code, but frequently fails to capture interactions with external systems that don’t support your tracing protocol. When our applications interacted with legacy network equipment using proprietary protocols, these interactions became tracing blind spots.

Background processing: Many critical operations occur outside the context of request handling. Scheduled jobs, queue processors, and maintenance tasks often execute without an initiating trace context. In networking environments, this includes critical functions like route table optimisation, configuration synchronisation, and topology discovery processes.

Application lifecycle events: Startup sequences, configuration loading, and graceful shutdown processes typically happen outside trace contexts but contain crucial diagnostic information. When our controller application experienced intermittent startup failures, logs, not traces, led us to the root cause.

Errors that prevent trace propagation: Perhaps most crucially, when errors prevent trace context propagation, you lose visibility precisely when you need it most. Library initialisation failures, connectivity issues, and resource exhaustion scenarios often prevent traces from being created or propagated.

The Complementary Nature of Logs and Traces

Rather than viewing logs and traces as competing approaches, we’ve found they serve different and complementary purposes in a comprehensive observability strategy:

Logs excel at:

  • Capturing point-in-time events with rich contextual details
  • Recording system state, configuration changes, and operational activities
  • Providing visibility into operations that exist outside request flows
  • Capturing the details of failures, including stack traces and error contexts
  • Serving as the resilient “last line” of observability that works even when other systems fail

Traces excel at:

  • Visualising request flows across distributed systems
  • Identifying performance bottlenecks and latency issues
  • Establishing causal relationships between operations
  • Understanding dependencies and service interactions
  • Providing the “big picture” of system behaviour

In our production environments, the most valuable insights often emerge from the correlation between these data types. A trace might identify a performance degradation when traffic passes through a particular service, while the corresponding logs reveal that the service was operating with a degraded cache following a recent deployment.

Implementation Strategy for the Real World

Based on our experience implementing observability across complex network infrastructure, I recommend a pragmatic approach that leverages both logging and tracing:

Correlation through context: Ensure that logs include trace IDs when they’re generated within a traced context. This creates bidirectional navigation between logs and traces in tools like Datadog.

Structured logging: Use structured logging formats that are machine-parseable, making it easier to correlate and analyse log data alongside traces. JSON logging with consistent field names significantly improves analysis capabilities.

Log level discipline: Be judicious with log levels, using ERROR for exceptional conditions, WARN for potential issues, INFO for significant state changes, and DEBUG for detailed troubleshooting information. This prevents log noise while ensuring important events are captured.

Strategic instrumentation: Instrument critical paths with both logging and tracing, while using logging alone for areas where tracing provides less value (such as background jobs or system maintenance tasks).

Contextual enrichment: Enrich both logs and traces with consistent metadata such as service names, environment information, deployment versions, and customer/tenant identifiers to enable powerful filtering and correlation.

The Cost Consideration

A common argument against maintaining both logging and tracing is the cost of data ingestion and storage, particularly when using commercial observability platforms. While this concern is valid, it often stems from inefficient implementation rather than inherent limitations.

By implementing strategic sampling for high-volume traces, appropriate log levels, and careful filtering of data before transmission to Datadog, we’ve managed to achieve comprehensive observability while keeping costs predictable. The operational benefits and reduced mean time to recovery (MTTR) have more than justified the investment.

Looking Forward

As observability platforms continue to evolve, we’re seeing increasing convergence between logging, tracing, and metrics data. Datadog and similar platforms now offer unified views that blend these data types, recognising that each provides unique and valuable perspectives on system behaviour.

The future isn’t choosing between logs or traces, but rather implementing both in a coordinated, efficient manner that maximises insight while minimising overhead. The most sophisticated observability implementations leverage the strengths of each approach while providing seamless correlation between them.

In conclusion, while tracing has revolutionised our understanding of distributed systems, logging remains an essential component of a comprehensive observability strategy. Rather than abandoning logs in favour of traces, focus on implementing each technology appropriately and creating connections between them. Your future self, inevitably debugging a production issue at an inconvenient hour, will thank you for maintaining these complementary perspectives on system behaviour.