Observability#

Three pillars: metrics (Prometheus), traces (OpenTelemetry), logs (structured JSON).

Metrics (Prometheus)#

swarm exposes /metrics on the API (no auth required; scope via network policy).

Catalogue (selected)#

Metric	Type	Labels	What it tells you
`swarm_pipeline_runs_total`	counter	`state`, `template`, `profile`	Throughput, success rate
`swarm_pipeline_duration_seconds`	histogram	`template`, `profile`	p50/p99 pipeline latency
`swarm_pipeline_concurrent_runs`	gauge	—	Current concurrency; compare to `SWARM_MAX_CONCURRENT_PIPELINES`
`swarm_agent_runs_total`	counter	`agent`, `state`	Per-agent invocation + success rate
`swarm_tool_calls_total`	counter	`tool`, `state`	Per-tool dispatch count
`swarm_tool_denied_total`	counter	`tool`, `reason`	Permission denials (legacy pre-W7; still emitted for dashboard compat)
`swarm_permission_decisions_total`	counter	`behavior`, `source`	Unified permission engine; one row per evaluated decision
`swarm_hook_runs_total`	counter	`event`, `plugin`	Hook invocations by event + plugin
`swarm_cron_job_runs_total`	counter	`job`, `status`	Cron execution history
`swarm_cron_job_duration_seconds`	histogram	`job`	Cron runtime
`swarm_batch_runs_total`	counter	`run`, `status`	Batch job starts / completes
`swarm_batch_records_processed_total`	counter	`run`	Throughput
`swarm_llm_calls_total`	counter	`provider`, `model`, `tier`, `state`	LLM usage
`swarm_llm_tokens_total`	counter	`provider`, `model`, `direction` (input/output/cached)	Token consumption
`swarm_llm_cost_usd_total`	counter	`provider`, `model`	Cost accumulation
`swarm_llm_call_duration_seconds`	histogram	`provider`, `model`	LLM latency
`swarm_retention_pruned_total`	counter	`artefact_kind`	Retention sweep output
`swarm_plugin_shell_executions_total`	counter	`plugin`, `exit_code`	Shell-hook invocations (if enabled)
`swarm_http_requests_total`	counter	`method`, `path`, `status`	HTTP traffic
`swarm_http_request_duration_seconds`	histogram	`method`, `path`	HTTP p50/p99

Recommended alerts#

# Prometheus Alertmanager rules
groups:
  - name: swarm
    rules:

      - alert: SwarmPipelineFailureRateHigh
        expr: |
          rate(swarm_pipeline_runs_total{state="failed"}[15m])
          / rate(swarm_pipeline_runs_total[15m]) > 0.1
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "swarm pipeline failure rate > 10%"

      - alert: SwarmAPIP99HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(swarm_http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels: { severity: warning }

      - alert: SwarmPermissionDenialSpike
        expr: |
          rate(swarm_permission_decisions_total{behavior="deny"}[10m]) > 10
        for: 5m
        labels: { severity: info }
        annotations:
          summary: "swarm denial rate spike — investigate"

      - alert: SwarmLLMCostPerHourHigh
        expr: |
          rate(swarm_llm_cost_usd_total[1h]) * 3600 > 50
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "swarm LLM spend > $50/hr"

      - alert: SwarmCronJobFailed
        expr: |
          increase(swarm_cron_job_runs_total{status="failed"}[1h]) > 0
        labels: { severity: info }

Grafana dashboard#

Ship a default dashboard at deploy/grafana/swarm-overview.json:

Pipeline throughput + duration heatmap
Permission denial breakdown by source
LLM cost (per hour / per day / per provider)
Cron job success rate
Agent-level tool call counts

Traces (OpenTelemetry)#

Enable by setting:

OTEL_SERVICE_NAME=swarm-api
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.internal:4317
OTEL_RESOURCE_ATTRIBUTES=env=prod,region=ap-south-1

swarm emits spans for:

pipeline.run — root span for each pipeline
agent.invoke — per-agent
llm.call — per LLM call
tool.dispatch — per tool call
permission.check — per decision
hook.run — per hook event
http.request — every API request

Trace attributes include run_id, agent, tool, model, tokens_in, tokens_out, cost_usd, rule_source (for permission checks).

Example trace#

pipeline.run [1.8s]
├─ agent.invoke (data_profiler) [220ms]
│   ├─ llm.call (sonnet) [180ms] {tokens_in: 420, tokens_out: 92, cost: $0.003}
│   └─ tool.dispatch (profile_data) [30ms]
├─ agent.invoke (algorithm_selector) [340ms]
│   ├─ llm.call (sonnet) [320ms]
│   └─ tool.dispatch (query_algorithm_registry) [12ms]
├─ agent.invoke (trainer) [1,100ms]
│   ├─ permission.check (train_classifier) [2ms] {behavior: allow, source: agent_allowlist}
│   └─ tool.dispatch (train_classifier) [1,080ms]
└─ agent.invoke (model_evaluator) [140ms]
    └─ tool.dispatch (evaluate_model) [135ms]

Send to: Jaeger, Tempo, Honeycomb, Datadog, AWS X-Ray — anything OTLP-compatible.

Logs (structured JSON)#

Default format in production:

{
  "ts": "2026-04-15T14:22:03.421Z",
  "level": "INFO",
  "logger": "ml_team.core.agent_runner",
  "event": "agent.turn.complete",
  "run_id": "7f8e9a2b",
  "agent": "trainer",
  "iteration": 2,
  "tool_calls": 3,
  "tokens": {"in": 1240, "out": 420},
  "duration_ms": 1804,
  "correlation_id": "req_xyz123"
}

Every log line has run_id + agent + correlation_id — trace a request end-to-end by grepping any of them.

Routing#

Local dev: stderr, human-readable (SWARM_LOG_FORMAT=text)
Production K8s: stdout JSON → Fluent Bit → Loki / Cloudwatch / Elasticsearch
Audit-critical logs: also written to run_events table (query-able via SQL)

Log levels#

Level	What goes here
DEBUG	Internal state traces (per LLM call internals, per DB query)
INFO	Notable events (agent turn complete, pipeline run start, hook fired)
WARNING	Recoverable issues (retry, fallback, skill match failure)
ERROR	Failed operations, caller should know
CRITICAL	System-level incidents (DB unreachable, all LLM providers down)

Default: INFO. Lower to DEBUG only during investigation.

Cost observability#

swarm tracks LLM cost per-call. Query:

# Per-day cost last 30 days
swarm reports costs --from 2026-03-15 --to 2026-04-15

# By agent
swarm reports costs --from 2026-04-01 --group-by agent

Or SQL:

SELECT
  date_trunc('day', ts)    AS day,
  provider,
  agent,
  SUM(cost_usd)            AS cost
FROM run_events
WHERE kind = 'llm_call'
  AND ts > now() - interval '30 days'
GROUP BY day, provider, agent
ORDER BY day DESC, cost DESC;

Budget cap via SWARM_BUDGET_PER_PIPELINE_USD. Exceeding triggers pipeline pause with operator notification.

Healthchecks#

GET /healthz — always returns 200 if process is alive (liveness probe)
GET /livez — detailed (DB connectivity, object storage reachable, LLM providers reachable)
GET /metrics — Prometheus format

Kubernetes probe config (in Helm):

livenessProbe:
  httpGet: { path: /healthz, port: 8000 }
  initialDelaySeconds: 10
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /livez, port: 8000 }
  initialDelaySeconds: 20
  periodSeconds: 15
  failureThreshold: 3

SIEM integration#

Enterprise customers want security events in their SIEM (Splunk, Elastic, Datadog).

Three options:

Prometheus metric + Alertmanager → SIEM webhook — best for volume metrics
permission_denials table via CDC — Debezium or similar → Kafka → SIEM
Structured log tail — Fluent Bit / Filebeat / Vector from API pod stdout

Most common: option 3 for logs + option 1 for metrics.

Next#

Incident response — what to do when alerts fire
Retention — how long metrics / traces / logs live