Observability That People Actually Use

SLOs, burn-rate alerts, and why your dashboard graveyard is a product problem, not a tooling one.

8 min

ObservabilitySRE

Most teams have too many dashboards and not enough signal. The problem isn't Grafana; it's that observability is treated as a tooling exercise, not a product.

SLOs first

Pick 3–5 user-visible SLOs per service. Examples: "99.9% of API requests under 500ms", "99.95% of orders processed within 5 seconds". That's it. Don't SLO on CPU.

Burn-rate alerts, not thresholds

Threshold alerts (error rate > 1% for 5 minutes) fire all the time and teach on-call to ignore pages. Burn-rate alerts fire when you're actually going to miss the SLO, and they scale to both fast and slow burns.

The point of an alert is to interrupt someone. If it shouldn't wake you up, it shouldn't page you.

Tie alerts to runbooks

Every alert links to a runbook. Every runbook has a current owner. If it doesn't, delete the alert.

Traces, not logs

Structured logs are fine. Traces are better. With OpenTelemetry you get both: distributed traces for end-to-end latency, spans carry enough context to replace most log statements, and metrics fall out for free.

What I pick

OpenTelemetry for traces and metrics, Prometheus + Grafana for visualization, Loki or your cloud vendor for logs. A lightweight service catalog that maps each service to its SLOs, on-call owner, and runbook. The point isn't the tools — it's the discipline.