Skip to content

Observability#

Three pillars: metrics (Prometheus), traces (OpenTelemetry), logs (structured JSON).

Metrics (Prometheus)#

swarm exposes /metrics on the API (no auth required; scope via network policy).

Catalogue (selected)#

Metric Type Labels What it tells you
swarm_pipeline_runs_total counter state, template, profile Throughput, success rate
swarm_pipeline_duration_seconds histogram template, profile p50/p99 pipeline latency
swarm_pipeline_concurrent_runs gauge Current concurrency; compare to SWARM_MAX_CONCURRENT_PIPELINES
swarm_agent_runs_total counter agent, state Per-agent invocation + success rate
swarm_tool_calls_total counter tool, state Per-tool dispatch count
swarm_tool_denied_total counter tool, reason Permission denials (legacy pre-W7; still emitted for dashboard compat)
swarm_permission_decisions_total counter behavior, source Unified permission engine; one row per evaluated decision
swarm_hook_runs_total counter event, plugin Hook invocations by event + plugin
swarm_cron_job_runs_total counter job, status Cron execution history
swarm_cron_job_duration_seconds histogram job Cron runtime
swarm_batch_runs_total counter run, status Batch job starts / completes
swarm_batch_records_processed_total counter run Throughput
swarm_llm_calls_total counter provider, model, tier, state LLM usage
swarm_llm_tokens_total counter provider, model, direction (input/output/cached) Token consumption
swarm_llm_cost_usd_total counter provider, model Cost accumulation
swarm_llm_call_duration_seconds histogram provider, model LLM latency
swarm_retention_pruned_total counter artefact_kind Retention sweep output
swarm_plugin_shell_executions_total counter plugin, exit_code Shell-hook invocations (if enabled)
swarm_http_requests_total counter method, path, status HTTP traffic
swarm_http_request_duration_seconds histogram method, path HTTP p50/p99
# Prometheus Alertmanager rules
groups:
  - name: swarm
    rules:

      - alert: SwarmPipelineFailureRateHigh
        expr: |
          rate(swarm_pipeline_runs_total{state="failed"}[15m])
          / rate(swarm_pipeline_runs_total[15m]) > 0.1
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "swarm pipeline failure rate > 10%"

      - alert: SwarmAPIP99HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(swarm_http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels: { severity: warning }

      - alert: SwarmPermissionDenialSpike
        expr: |
          rate(swarm_permission_decisions_total{behavior="deny"}[10m]) > 10
        for: 5m
        labels: { severity: info }
        annotations:
          summary: "swarm denial rate spike  investigate"

      - alert: SwarmLLMCostPerHourHigh
        expr: |
          rate(swarm_llm_cost_usd_total[1h]) * 3600 > 50
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "swarm LLM spend > $50/hr"

      - alert: SwarmCronJobFailed
        expr: |
          increase(swarm_cron_job_runs_total{status="failed"}[1h]) > 0
        labels: { severity: info }

Grafana dashboard#

Ship a default dashboard at deploy/grafana/swarm-overview.json:

  • Pipeline throughput + duration heatmap
  • Permission denial breakdown by source
  • LLM cost (per hour / per day / per provider)
  • Cron job success rate
  • Agent-level tool call counts

Traces (OpenTelemetry)#

Enable by setting:

OTEL_SERVICE_NAME=swarm-api
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.internal:4317
OTEL_RESOURCE_ATTRIBUTES=env=prod,region=ap-south-1

swarm emits spans for:

  • pipeline.run — root span for each pipeline
  • agent.invoke — per-agent
  • llm.call — per LLM call
  • tool.dispatch — per tool call
  • permission.check — per decision
  • hook.run — per hook event
  • http.request — every API request

Trace attributes include run_id, agent, tool, model, tokens_in, tokens_out, cost_usd, rule_source (for permission checks).

Example trace#

pipeline.run [1.8s]
├─ agent.invoke (data_profiler) [220ms]
│   ├─ llm.call (sonnet) [180ms] {tokens_in: 420, tokens_out: 92, cost: $0.003}
│   └─ tool.dispatch (profile_data) [30ms]
├─ agent.invoke (algorithm_selector) [340ms]
│   ├─ llm.call (sonnet) [320ms]
│   └─ tool.dispatch (query_algorithm_registry) [12ms]
├─ agent.invoke (trainer) [1,100ms]
│   ├─ permission.check (train_classifier) [2ms] {behavior: allow, source: agent_allowlist}
│   └─ tool.dispatch (train_classifier) [1,080ms]
└─ agent.invoke (model_evaluator) [140ms]
    └─ tool.dispatch (evaluate_model) [135ms]

Send to: Jaeger, Tempo, Honeycomb, Datadog, AWS X-Ray — anything OTLP-compatible.

Logs (structured JSON)#

Default format in production:

{
  "ts": "2026-04-15T14:22:03.421Z",
  "level": "INFO",
  "logger": "ml_team.core.agent_runner",
  "event": "agent.turn.complete",
  "run_id": "7f8e9a2b",
  "agent": "trainer",
  "iteration": 2,
  "tool_calls": 3,
  "tokens": {"in": 1240, "out": 420},
  "duration_ms": 1804,
  "correlation_id": "req_xyz123"
}

Every log line has run_id + agent + correlation_id — trace a request end-to-end by grepping any of them.

Routing#

  • Local dev: stderr, human-readable (SWARM_LOG_FORMAT=text)
  • Production K8s: stdout JSON → Fluent Bit → Loki / Cloudwatch / Elasticsearch
  • Audit-critical logs: also written to run_events table (query-able via SQL)

Log levels#

Level What goes here
DEBUG Internal state traces (per LLM call internals, per DB query)
INFO Notable events (agent turn complete, pipeline run start, hook fired)
WARNING Recoverable issues (retry, fallback, skill match failure)
ERROR Failed operations, caller should know
CRITICAL System-level incidents (DB unreachable, all LLM providers down)

Default: INFO. Lower to DEBUG only during investigation.

Cost observability#

swarm tracks LLM cost per-call. Query:

# Per-day cost last 30 days
swarm reports costs --from 2026-03-15 --to 2026-04-15

# By agent
swarm reports costs --from 2026-04-01 --group-by agent

Or SQL:

SELECT
  date_trunc('day', ts)    AS day,
  provider,
  agent,
  SUM(cost_usd)            AS cost
FROM run_events
WHERE kind = 'llm_call'
  AND ts > now() - interval '30 days'
GROUP BY day, provider, agent
ORDER BY day DESC, cost DESC;

Budget cap via SWARM_BUDGET_PER_PIPELINE_USD. Exceeding triggers pipeline pause with operator notification.

Healthchecks#

  • GET /healthz — always returns 200 if process is alive (liveness probe)
  • GET /livez — detailed (DB connectivity, object storage reachable, LLM providers reachable)
  • GET /metrics — Prometheus format

Kubernetes probe config (in Helm):

livenessProbe:
  httpGet: { path: /healthz, port: 8000 }
  initialDelaySeconds: 10
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /livez, port: 8000 }
  initialDelaySeconds: 20
  periodSeconds: 15
  failureThreshold: 3

SIEM integration#

Enterprise customers want security events in their SIEM (Splunk, Elastic, Datadog).

Three options:

  1. Prometheus metric + Alertmanager → SIEM webhook — best for volume metrics
  2. permission_denials table via CDC — Debezium or similar → Kafka → SIEM
  3. Structured log tail — Fluent Bit / Filebeat / Vector from API pod stdout

Most common: option 3 for logs + option 1 for metrics.

Next#