Introduction
Distributed tracing provides end-to-end visibility into requests as they traverse multiple services. Unlike logs (which are service-local) and metrics (which are aggregate), traces capture the causal relationship between operations in a distributed system. OpenTelemetry has become the industry standard for instrumentation, offering a unified API for traces, metrics, and logs. This article covers implementing distributed tracing with OpenTelemetry in production.
Core Concepts: Traces, Spans, and Context
A trace represents a complete request flow. Each unit of work within a trace is a span, carrying metadata about timing, status, and parent-child relationships:
import { trace, Span, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("payment-service");
async function processPayment(orderId: string, amount: number) {
// Create a new span as the root of a sub-operation
const span = tracer.startSpan("process-payment", {
attributes: {
"payment.order_id": orderId,
"payment.amount": amount,
"payment.currency": "USD",
},
});
try {
const result = await chargePaymentGateway(orderId, amount);
span.setStatus({ code: SpanStatusCode.OK });
span.setAttribute("payment.transaction_id", result.transactionId);
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
}
Context Propagation
Propagation carries trace context across service boundaries. For HTTP services, the `W3C TraceContext` format is standard:
// Instrument outgoing HTTP requests
import { context, propagation } from "@opentelemetry/api";
import * as http from "http";
function makeRequest(url: string, headers: Record<string, string>) {
// Inject current context into outgoing headers
const activeContext = context.active();
const carrier: Record<string, string> = {};
propagation.inject(activeContext, carrier);
const allHeaders = { ...headers, ...carrier };
return http.get(url, { headers: allHeaders });
}
For message queues, propagate context through message headers:
// Producer: inject context into message
import { propagation } from "@opentelemetry/api";
function publishMessage(topic: string, payload: any) {
const carrier: Record<string, string> = {};
propagation.inject(context.active(), carrier);
const message = {
value: JSON.stringify(payload),
headers: {
...carrier,
"content-type": "application/json",
},
};
return kafkaProducer.send({ topic, messages: [message] });
}
// Consumer: extract context from message
import { propagation, context } from "@opentelemetry/api";
kafkaConsumer.on("message", (message) => {
const extractedContext = propagation.extract(
context.active(),
message.headers
);
context.with(extractedContext, async () => {
// This operation is now part of the parent trace
const span = tracer.startSpan("process-order");
// Process message...
span.end();
});
});
Sampling Strategies
Sampling controls the volume of traces collected. Use head-based sampling for simplicity or tail-based for intelligent selection:
# OpenTelemetry Collector: tail-based sampling
processors:
tail_sampling:
decision_wait: 30s
num_traces: 10000
expected_new_traces_per_sec: 100
policies:
- name: error-sampling
type: status_code
config:
status_code: ERROR
sampling_percentage: 100
- name: latency-sampling
type: latency
config:
threshold_ms: 500
sampling_percentage: 50
- name: probabilistic
type: probabilistic
config:
sampling_percentage: 5
For head-based sampling in application code:
import { SamplingDecision } from "@opentelemetry/api";
import { Sampler, SpanKind, Attributes } from "@opentelemetry/api";
class CustomSampler implements Sampler {
shouldSample(
context: Context,
traceId: string,
spanName: string,
spanKind: SpanKind,
attributes: Attributes
) {
// Always sample error-prone operations
if (spanName.startsWith("payment.")) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Sample 10% of health checks
if (spanName === "health-check") {
return { decision: SamplingDecision.DROP };
}
// Default probabilistic sampling
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
}
Visualization with Jaeger and Zipkin
Jaeger provides rich trace visualization and analysis capabilities:
# docker-compose.yml for Jaeger
services:
jaeger:
image: jaegertracing/all-in-one:latest
environment:
COLLECTOR_OTLP_ENABLED: "true"
ports:
- "16686:16686" # UI
- "4318:4318" # OTLP HTTP
Configure the OpenTelemetry Collector to forward traces to Jaeger:
receivers:
otlp:
protocols:
http:
endpoint: "0.0.0.0:4318"
exporters:
jaeger:
endpoint: "jaeger:14250"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
exporters: [jaeger]
Baggage Propagation
Baggage carries non-sampling key-value pairs across service boundaries for contextual information:
import { propagation } from "@opentelemetry/api";
// Set baggage in the entry service
propagation.setBaggage(context.active(),
propagation.createBaggage({
"user.id": { value: userId },
"session.region": { value: region },
"request.source": { value: source },
})
);
Access baggage in downstream services without modifying API contracts:
import { propagation, getBaggage } from "@opentelemetry/api";
function getCurrentUserId(): string | undefined {
const baggage = getBaggage(context.active());
return baggage?.getEntry("user.id")?.value;
}
Correlation with Logs and Metrics
Link traces to logs using `trace_id` and `span_id`:
import { trace } from "@opentelemetry/api";
function enrichLogger(logger: Logger): Logger {
const span = trace.getActiveSpan();
return logger.child({
trace_id: span?.spanContext().traceId,
span_id: span?.spanContext().spanId,
trace_flags: span?.spanContext().traceFlags,
});
}
Emit metrics with trace context for full observability:
import { metrics } from "@opentelemetry/api";
const meter = metrics.getMeter("payment-service");
const requestCounter = meter.createCounter("payment.requests", {
description: "Count of payment requests",
});
function trackPayment(status: string) {
const spanContext = trace.getActiveSpan()?.spanContext();
requestCounter.add(1, {
status,
trace_id: spanContext?.traceId,
});
}
Production Configuration
Deploy the OpenTelemetry Collector as a sidecar or DaemonSet for centralized configuration:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
spec:
template:
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel/config.yaml"]
ports:
- containerPort: 4318 # OTLP HTTP
Instrumentation should be additive and never break business logic. Start with critical paths (payment, auth, order creation) and expand coverage iteratively. A well-instrumented system reduces mean time to diagnosis from hours to minutes.