Monitoring and Observability in Production

Monitoring and Observability in Production

The three pillars of observability

Observability relies on three complementary types of data:

graph TB
    subgraph "The 3 Pillars"
        M[Metrics<br/>Aggregated numerical data]
        L[Logs<br/>Timestamped text events]
        T[Traces<br/>Request paths]
    end

    subgraph "Tools"
        M --> PROM[Prometheus / Grafana]
        L --> ELK[ELK Stack / Loki]
        T --> JAE[Jaeger / Tempo]
    end

    subgraph "Outcome"
        PROM --> A[Alerting]
        ELK --> D[Debugging]
        JAE --> P[Performance]
    end

Prometheus: metrics collection

Application instrumentation

import { Counter, Histogram, Registry } from 'prom-client';

const register = new Registry();

// HTTP request counter
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [register],
});

// Latency histogram
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

// Express middleware
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    route: req.route?.path || req.path,
  });

  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode.toString(),
    });
    end();
  });

  next();
});

// /metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Prometheus configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: 'api'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: $1

Alerting rules

# alerts/api.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "HTTP error rate > 5%"
          description: "{{ $labels.instance }} has an error rate of {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 2s"

      - alert: PodCrashLooping
        expr: |
          rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

Grafana: visualization and dashboards

JSON dashboard (provisioned via code)

{
  "dashboard": {
    "title": "API Overview",
    "panels": [
      {
        "title": "Requests per second",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (status)",
            "legendFormat": "HTTP {{ status }}"
          }
        ]
      },
      {
        "title": "Latency P50 / P95 / P99",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      }
    ]
  }
}

Structured logging

JSON logs with context

import { createLogger, format, transports } from 'winston';

const logger = createLogger({
  level: 'info',
  format: format.combine(
    format.timestamp(),
    format.json(),
  ),
  defaultMeta: {
    service: 'api',
    version: process.env.APP_VERSION,
    environment: process.env.NODE_ENV,
  },
  transports: [
    new transports.Console(),
  ],
});

// Usage with context
logger.info('Order processed', {
  orderId: order.id,
  userId: user.id,
  amount: order.total,
  duration: elapsed,
  traceId: req.headers['x-trace-id'],
});

Aggregation with Loki + Grafana

# Loki - promtail config
scrape_configs:
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    pipeline_stages:
      - json:
          expressions:
            level: level
            service: service
            traceId: traceId
      - labels:
          level:
          service:
      - timestamp:
          source: timestamp
          format: RFC3339

LogQL queries in Grafana:

# All API errors
{service="api"} | json | level="error"

# Slow requests (> 2s)
{service="api"} | json | duration > 2000

# Errors grouped by message
sum by (message) (count_over_time({service="api"} | json | level="error" [1h]))

Distributed Tracing

OpenTelemetry

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://tempo:4318/v1/traces',
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
    new PgInstrumentation(),
  ],
});

sdk.start();

SLOs and Error Budgets

Defining SLOs

Service SLI SLO Error Budget (30d)
API Availability 99.9% 43.2 min
API P95 Latency < 500ms 5% slow requests
Payment Success rate 99.95% 21.6 min

Prometheus for SLOs

# 99.9% availability SLO
- record: slo:api:availability:ratio
  expr: |
    1 - (
      sum(rate(http_requests_total{status=~"5.."}[30d]))
      / sum(rate(http_requests_total[30d]))
    )

- alert: SLOBudgetBurning
  expr: slo:api:availability:ratio < 0.999
  for: 1h
  labels:
    severity: critical
  annotations:
    summary: "Error budget exhausted for the API"

Best practices

  1. Alert on symptoms, not on causes
  2. Use SLOs to prioritize actions
  3. Structured logs in JSON with a traceId for correlation
  4. Dashboards per service with the 4 golden signals: latency, traffic, errors, saturation
  5. Runbooks associated with each alert
  6. On-call rotation with clear escalation paths