Adding Observability with Claude Code

Logging and metrics are the first things teams skip. Claude makes adding them fast enough that there's no excuse. Here's the approach that works.

Observability is the first thing teams skip when moving fast and the first thing they regret when something breaks in production and they can't figure out what happened.

Claude Code makes adding it fast enough to do right from the start.

Structured logging

Add structured logging to this service. Requirements:
- JSON format (not plain text)
- Every log entry includes: timestamp, level, service name, request ID
- Log levels: error, warn, info, debug
- Error logs include: error message, stack trace, context
- No sensitive data in logs (passwords, tokens, PII)

Use the existing logger pattern from lib/logger.ts if it exists,
or create one that follows this pattern.

Structured logging is parseable by log aggregation tools. Plain text logs are searchable but not queryable — you can find lines containing "error" but you can't ask "show me all errors from service X in the last hour."

What to log

Add logging to this API handler. Log:
- Request received: method, path, auth status (not the token)
- Validation result: pass/fail, which fields failed
- Business logic decision points: what was decided and why
- External calls: service called, latency, success/failure
- Response sent: status code, response time

Don't log request/response bodies by default — flag where
that might be needed for debugging with a TODO.

The pattern of logging decision points is the most useful and most often skipped. When you're debugging a production issue, you want to know "why did the service make this choice?" not just "what happened."

Metrics

Add application metrics to this service. Capture:
- Request count by endpoint, method, status code
- Request latency (p50, p95, p99) by endpoint
- Error rate by endpoint
- Active requests (gauge)
- [Any business-specific metrics that matter here]

Use Prometheus-compatible counters and histograms.
Export via /metrics endpoint.

The Prometheus format is the standard for most modern observability stacks. Starting with it means your metrics work with Grafana, Datadog, and most alerting tools without conversion.

Tracing

Add distributed tracing to this request path. Requirements:
- Generate a trace ID on request entry if none exists
- Propagate trace ID through all downstream calls
- Log trace ID on every log line in the request path
- Add timing spans for: database queries, external API calls, 
  business logic phases

Use OpenTelemetry format for compatibility.

Trace IDs are the most underrated debugging tool. With them, you can find every log line from a single request across multiple services. Without them, debugging multi-service issues means matching timestamps and hoping.

Alerting setup

Write alerting rules for this service. Alert when:
- Error rate exceeds [X]% over 5 minutes
- p95 latency exceeds [X]ms
- Service is unavailable (health check fails 3 times)
- [Business-specific: e.g., payment failure rate exceeds X%]

For each alert include: threshold, what it means, 
how to investigate, how to mitigate.

The "how to investigate, how to mitigate" part is a runbook stub. Claude generates it based on context. Filling in these stubs is what separates alerts that wake someone up at 3am with a clear action from alerts that wake someone up at 3am with no idea what to do.

The observability audit

Audit the observability of this codebase. Find:
- Code paths with no logging
- Errors that are silently swallowed
- External calls with no latency tracking
- Business logic decisions with no audit trail

Don't fix yet — just find them.

Observability prompts — logging patterns, metrics setup, tracing, alerting — are in the Agent Prompt Playbook. $29.