Monitoring and Observability in Production
Monitoring and Observability in Production
The three pillars of observability
Observability relies on three complementary types of data:
graph TB
subgraph "The 3 Pillars"
M[Metrics<br/>Aggregated numerical data]
L[Logs<br/>Timestamped text events]
T[Traces<br/>Request paths]
end
subgraph "Tools"
M --> PROM[Prometheus / Grafana]
L --> ELK[ELK Stack / Loki]
T --> JAE[Jaeger / Tempo]
end
subgraph "Outcome"
PROM --> A[Alerting]
ELK --> D[Debugging]
JAE --> P[Performance]
end
Prometheus: metrics collection
Application instrumentation
import { Counter, Histogram, Registry } from 'prom-client';
const register = new Registry();
// HTTP request counter
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status'],
registers: [register],
});
// Latency histogram
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register],
});
// Express middleware
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({
method: req.method,
route: req.route?.path || req.path,
});
res.on('finish', () => {
httpRequestsTotal.inc({
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode.toString(),
});
end();
});
next();
});
// /metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Prometheus configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts/*.yml"
scrape_configs:
- job_name: 'api'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: $1
Alerting rules
# alerts/api.yml
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "HTTP error rate > 5%"
description: "{{ $labels.instance }} has an error rate of {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency > 2s"
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
Grafana: visualization and dashboards
JSON dashboard (provisioned via code)
{
"dashboard": {
"title": "API Overview",
"panels": [
{
"title": "Requests per second",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (status)",
"legendFormat": "HTTP {{ status }}"
}
]
},
{
"title": "Latency P50 / P95 / P99",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P99"
}
]
}
]
}
}
Structured logging
JSON logs with context
import { createLogger, format, transports } from 'winston';
const logger = createLogger({
level: 'info',
format: format.combine(
format.timestamp(),
format.json(),
),
defaultMeta: {
service: 'api',
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV,
},
transports: [
new transports.Console(),
],
});
// Usage with context
logger.info('Order processed', {
orderId: order.id,
userId: user.id,
amount: order.total,
duration: elapsed,
traceId: req.headers['x-trace-id'],
});
Aggregation with Loki + Grafana
# Loki - promtail config
scrape_configs:
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- json:
expressions:
level: level
service: service
traceId: traceId
- labels:
level:
service:
- timestamp:
source: timestamp
format: RFC3339
LogQL queries in Grafana:
# All API errors
{service="api"} | json | level="error"
# Slow requests (> 2s)
{service="api"} | json | duration > 2000
# Errors grouped by message
sum by (message) (count_over_time({service="api"} | json | level="error" [1h]))
Distributed Tracing
OpenTelemetry
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://tempo:4318/v1/traces',
}),
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new PgInstrumentation(),
],
});
sdk.start();
SLOs and Error Budgets
Defining SLOs
| Service | SLI | SLO | Error Budget (30d) |
|---|---|---|---|
| API | Availability | 99.9% | 43.2 min |
| API | P95 Latency | < 500ms | 5% slow requests |
| Payment | Success rate | 99.95% | 21.6 min |
Prometheus for SLOs
# 99.9% availability SLO
- record: slo:api:availability:ratio
expr: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
)
- alert: SLOBudgetBurning
expr: slo:api:availability:ratio < 0.999
for: 1h
labels:
severity: critical
annotations:
summary: "Error budget exhausted for the API"
Best practices
- Alert on symptoms, not on causes
- Use SLOs to prioritize actions
- Structured logs in JSON with a
traceIdfor correlation - Dashboards per service with the 4 golden signals: latency, traffic, errors, saturation
- Runbooks associated with each alert
- On-call rotation with clear escalation paths