Skip to content

Operations#

Day-2 concerns. Run it, observe it, back it up, recover from incidents.

  • Observability — Prometheus metrics catalogue, OpenTelemetry spans, log routing
  • Retention — TTLs per artefact, retention daemon, override policies
  • Incident response — P0-P3 definitions, on-call playbook, RCA template
  • Backup & restore — Postgres + object storage + JSONL journals

Ops maturity ladder#

swarm gives you:

  • Level 1 (bare minimum): healthchecks + Prometheus + per-agent logs → you know when it's broken
  • Level 2 (production): alerting on drift / cost / permission-denial spikes + retention daemon → you know why it's broken
  • Level 3 (enterprise): OTel distributed tracing + audit-log SIEM integration + incident playbook → you meet SLAs
  • Level 4 (regulated): compliance-profile-enforced + regulator-format incident reports + forensic preservation → you pass audits

Start at Level 1 on day one. Move up a level per quarter.