Observability#
Three pillars: metrics (Prometheus), traces (OpenTelemetry), logs (structured JSON).
Metrics (Prometheus)#
swarm exposes /metrics on the API (no auth required; scope via network policy).
Catalogue (selected)#
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
swarm_pipeline_runs_total |
counter | state, template, profile |
Throughput, success rate |
swarm_pipeline_duration_seconds |
histogram | template, profile |
p50/p99 pipeline latency |
swarm_pipeline_concurrent_runs |
gauge | — | Current concurrency; compare to SWARM_MAX_CONCURRENT_PIPELINES |
swarm_agent_runs_total |
counter | agent, state |
Per-agent invocation + success rate |
swarm_tool_calls_total |
counter | tool, state |
Per-tool dispatch count |
swarm_tool_denied_total |
counter | tool, reason |
Permission denials (legacy pre-W7; still emitted for dashboard compat) |
swarm_permission_decisions_total |
counter | behavior, source |
Unified permission engine; one row per evaluated decision |
swarm_hook_runs_total |
counter | event, plugin |
Hook invocations by event + plugin |
swarm_cron_job_runs_total |
counter | job, status |
Cron execution history |
swarm_cron_job_duration_seconds |
histogram | job |
Cron runtime |
swarm_batch_runs_total |
counter | run, status |
Batch job starts / completes |
swarm_batch_records_processed_total |
counter | run |
Throughput |
swarm_llm_calls_total |
counter | provider, model, tier, state |
LLM usage |
swarm_llm_tokens_total |
counter | provider, model, direction (input/output/cached) |
Token consumption |
swarm_llm_cost_usd_total |
counter | provider, model |
Cost accumulation |
swarm_llm_call_duration_seconds |
histogram | provider, model |
LLM latency |
swarm_retention_pruned_total |
counter | artefact_kind |
Retention sweep output |
swarm_plugin_shell_executions_total |
counter | plugin, exit_code |
Shell-hook invocations (if enabled) |
swarm_http_requests_total |
counter | method, path, status |
HTTP traffic |
swarm_http_request_duration_seconds |
histogram | method, path |
HTTP p50/p99 |
Recommended alerts#
# Prometheus Alertmanager rules
groups:
- name: swarm
rules:
- alert: SwarmPipelineFailureRateHigh
expr: |
rate(swarm_pipeline_runs_total{state="failed"}[15m])
/ rate(swarm_pipeline_runs_total[15m]) > 0.1
for: 15m
labels: { severity: warning }
annotations:
summary: "swarm pipeline failure rate > 10%"
- alert: SwarmAPIP99HighLatency
expr: |
histogram_quantile(0.99,
rate(swarm_http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels: { severity: warning }
- alert: SwarmPermissionDenialSpike
expr: |
rate(swarm_permission_decisions_total{behavior="deny"}[10m]) > 10
for: 5m
labels: { severity: info }
annotations:
summary: "swarm denial rate spike — investigate"
- alert: SwarmLLMCostPerHourHigh
expr: |
rate(swarm_llm_cost_usd_total[1h]) * 3600 > 50
for: 15m
labels: { severity: warning }
annotations:
summary: "swarm LLM spend > $50/hr"
- alert: SwarmCronJobFailed
expr: |
increase(swarm_cron_job_runs_total{status="failed"}[1h]) > 0
labels: { severity: info }
Grafana dashboard#
Ship a default dashboard at deploy/grafana/swarm-overview.json:
- Pipeline throughput + duration heatmap
- Permission denial breakdown by source
- LLM cost (per hour / per day / per provider)
- Cron job success rate
- Agent-level tool call counts
Traces (OpenTelemetry)#
Enable by setting:
OTEL_SERVICE_NAME=swarm-api
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.internal:4317
OTEL_RESOURCE_ATTRIBUTES=env=prod,region=ap-south-1
swarm emits spans for:
pipeline.run— root span for each pipelineagent.invoke— per-agentllm.call— per LLM calltool.dispatch— per tool callpermission.check— per decisionhook.run— per hook eventhttp.request— every API request
Trace attributes include run_id, agent, tool, model, tokens_in, tokens_out, cost_usd, rule_source (for permission checks).
Example trace#
pipeline.run [1.8s]
├─ agent.invoke (data_profiler) [220ms]
│ ├─ llm.call (sonnet) [180ms] {tokens_in: 420, tokens_out: 92, cost: $0.003}
│ └─ tool.dispatch (profile_data) [30ms]
├─ agent.invoke (algorithm_selector) [340ms]
│ ├─ llm.call (sonnet) [320ms]
│ └─ tool.dispatch (query_algorithm_registry) [12ms]
├─ agent.invoke (trainer) [1,100ms]
│ ├─ permission.check (train_classifier) [2ms] {behavior: allow, source: agent_allowlist}
│ └─ tool.dispatch (train_classifier) [1,080ms]
└─ agent.invoke (model_evaluator) [140ms]
└─ tool.dispatch (evaluate_model) [135ms]
Send to: Jaeger, Tempo, Honeycomb, Datadog, AWS X-Ray — anything OTLP-compatible.
Logs (structured JSON)#
Default format in production:
{
"ts": "2026-04-15T14:22:03.421Z",
"level": "INFO",
"logger": "ml_team.core.agent_runner",
"event": "agent.turn.complete",
"run_id": "7f8e9a2b",
"agent": "trainer",
"iteration": 2,
"tool_calls": 3,
"tokens": {"in": 1240, "out": 420},
"duration_ms": 1804,
"correlation_id": "req_xyz123"
}
Every log line has run_id + agent + correlation_id — trace a request end-to-end by grepping any of them.
Routing#
- Local dev: stderr, human-readable (
SWARM_LOG_FORMAT=text) - Production K8s: stdout JSON → Fluent Bit → Loki / Cloudwatch / Elasticsearch
- Audit-critical logs: also written to
run_eventstable (query-able via SQL)
Log levels#
| Level | What goes here |
|---|---|
| DEBUG | Internal state traces (per LLM call internals, per DB query) |
| INFO | Notable events (agent turn complete, pipeline run start, hook fired) |
| WARNING | Recoverable issues (retry, fallback, skill match failure) |
| ERROR | Failed operations, caller should know |
| CRITICAL | System-level incidents (DB unreachable, all LLM providers down) |
Default: INFO. Lower to DEBUG only during investigation.
Cost observability#
swarm tracks LLM cost per-call. Query:
# Per-day cost last 30 days
swarm reports costs --from 2026-03-15 --to 2026-04-15
# By agent
swarm reports costs --from 2026-04-01 --group-by agent
Or SQL:
SELECT
date_trunc('day', ts) AS day,
provider,
agent,
SUM(cost_usd) AS cost
FROM run_events
WHERE kind = 'llm_call'
AND ts > now() - interval '30 days'
GROUP BY day, provider, agent
ORDER BY day DESC, cost DESC;
Budget cap via SWARM_BUDGET_PER_PIPELINE_USD. Exceeding triggers pipeline pause with operator notification.
Healthchecks#
GET /healthz— always returns 200 if process is alive (liveness probe)GET /livez— detailed (DB connectivity, object storage reachable, LLM providers reachable)GET /metrics— Prometheus format
Kubernetes probe config (in Helm):
livenessProbe:
httpGet: { path: /healthz, port: 8000 }
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet: { path: /livez, port: 8000 }
initialDelaySeconds: 20
periodSeconds: 15
failureThreshold: 3
SIEM integration#
Enterprise customers want security events in their SIEM (Splunk, Elastic, Datadog).
Three options:
- Prometheus metric + Alertmanager → SIEM webhook — best for volume metrics
permission_denialstable via CDC — Debezium or similar → Kafka → SIEM- Structured log tail — Fluent Bit / Filebeat / Vector from API pod stdout
Most common: option 3 for logs + option 1 for metrics.
Next#
- Incident response — what to do when alerts fire
- Retention — how long metrics / traces / logs live