Operations#
Day-2 concerns. Run it, observe it, back it up, recover from incidents.
- Observability — Prometheus metrics catalogue, OpenTelemetry spans, log routing
- Retention — TTLs per artefact, retention daemon, override policies
- Incident response — P0-P3 definitions, on-call playbook, RCA template
- Backup & restore — Postgres + object storage + JSONL journals
Ops maturity ladder#
swarm gives you:
- Level 1 (bare minimum): healthchecks + Prometheus + per-agent logs → you know when it's broken
- Level 2 (production): alerting on drift / cost / permission-denial spikes + retention daemon → you know why it's broken
- Level 3 (enterprise): OTel distributed tracing + audit-log SIEM integration + incident playbook → you meet SLAs
- Level 4 (regulated): compliance-profile-enforced + regulator-format incident reports + forensic preservation → you pass audits
Start at Level 1 on day one. Move up a level per quarter.