Incident response#

Triage, communicate, resolve, learn. This page is the on-call playbook.

Severity definitions#

Level	What qualifies	Response time (Enterprise SLA)
P0	Platform down; data loss; confirmed breach	Ack 15min, first comms 30min, status-page 1h
P1	Major feature broken for all users; severe degradation	Ack 1h, fix target 4h
P2	Feature broken for some users; workaround exists	Ack 4h, fix next business day
P3	Minor bug, enhancement, question	Ack 24h, next scheduled release

SLA differs by customer tier — Enterprise gets tighter response than Team; Team gets tighter than Developer; Community is best-effort.

Triage flow#

Alert fires (PagerDuty / email / on-call phone)
    │
    ▼
 [Acknowledge]    — within SLA window
    │
    ▼
 [Assess severity] — is this really P0? Verify scope
    │
    ▼
 [Declare + communicate] —
    │   - status page update
    │   - Slack Connect to affected enterprise customers
    │   - Initial "we're investigating" note
    │
    ▼
 [Investigate]    — check metrics, logs, traces, recent deploys
    │
    ▼
 [Mitigate]       — rollback / feature-flag off / scale / ratelimit
    │
    ▼
 [Verify fix]     — confirm customer impact resolved
    │
    ▼
 [Document RCA]   — root cause analysis, published within 72h for P0
    │
    ▼
 [Schedule post-mortem] — blameless, within 1 week

First-response playbook (any severity)#

Acknowledge the alert — stop the paging
Open a dedicated Slack thread named #inc-YYYYMMDD-brief-description
Assign roles:
Incident commander (IC) — drives the call
Communications lead — writes customer + status-page updates
Investigator — dives into logs / metrics / code
Set a 15-minute check-in cadence during active incidents
Record everything — timestamps, decisions, commands run

Investigation patterns#

"The API is down"#

# Is it actually down?
curl -sSw "%{http_code}\n" https://api.swarm.customer.com/healthz

# Pod status
kubectl -n swarm get pods -o wide

# Recent events
kubectl -n swarm get events --sort-by='.lastTimestamp' | tail -20

# Container logs last 100 lines
kubectl -n swarm logs -l app=swarm-api --tail=100 | tail -100

# Database reachable?
kubectl -n swarm exec -it <pod> -- swarm admin db-ping

# LLM providers reachable?
kubectl -n swarm exec -it <pod> -- swarm llm ping

"Pipelines are failing"#

# Recent failures
SELECT run_id, template, failed_agent, failure_reason, ts
FROM run_events
WHERE kind='pipeline_failed' AND ts > now() - interval '1 hour'
ORDER BY ts DESC LIMIT 20;

# Pattern across failures
SELECT failed_agent, COUNT(*) FROM run_events
WHERE kind='pipeline_failed' AND ts > now() - interval '1 hour'
GROUP BY failed_agent ORDER BY 2 DESC;

Common causes: - LLM provider 5xx / 429 (check swarm_llm_calls_total{state="error"}) - Tool hang → timeout (check swarm_tool_calls_total{state="timeout"}) - Permission rule misfire (check swarm_permission_decisions_total{behavior="deny"} spike)

"Permission denials spiking"#

# What's being denied?
curl -sG https://api.customer.com/api/v1/permissions/denials \
  --data-urlencode "since=600" | jq '.items | group_by(.rule_source) | map({source: .[0].rule_source, count: length})'

Common causes: - A policy rule mis-authored (wrong tool name) - Compliance profile change pushed without pipeline update - Attacker probing (unlikely; would also show up in abnormal HTTP patterns)

"Cost exploding"#

# Which agent / which model is burning cash?
swarm reports costs --from <start_of_incident> --group-by agent

# Recent LLM calls — are they retrying?
SELECT agent, model, state, COUNT(*) FROM run_events
WHERE kind='llm_call' AND ts > now() - interval '1 hour'
GROUP BY agent, model, state;

Common cause: retry storm on LLM timeout → engaged budget cap → should have paused. Verify SWARM_BUDGET_PER_PIPELINE_USD is set.

Mitigation primitives#

Mitigation	How
Rollback last deploy	`helm rollback swarm <previous-revision> -n swarm`
Disable a feature flag	`swarm features set <flag> off` (immediate)
Block a misbehaving plugin	`swarm plugins uninstall <name>`
Scale up	`kubectl scale deploy swarm-api --replicas=10 -n swarm`
Pause cron scheduler	`swarm features set cron_scheduler off`
Stop accepting new pipelines	`kubectl patch deploy swarm-api -p '{"spec":{"replicas":0}}'` — drastic; all users impacted

Communication templates#

Status-page initial update (P0)#

Investigating — <timestamp>

We are investigating reports of <symptom>. Customers may experience
<impact>. We will update within 30 minutes.

Status-page update after cause identified#

Identified — <timestamp>

We've identified the cause: <plain-English description>. We are
implementing a fix. ETA: <time>.

Affected customers: <who>. Data integrity: <statement — "no data loss" or specific>.

Status-page resolution#

Resolved — <timestamp>

The issue has been resolved. Root cause: <one-sentence>. A full
post-mortem will be published within 72 hours at <link>.

We apologize for the disruption.

Customer email (Enterprise P0)#

Subject: [Urgent] swarm service incident — <brief>

Hi <customer name>,

We're writing to inform you of a service incident affecting swarm
starting <start time>. Details:

Impact: <what you experienced>
Scope: <affected services / regions>
Root cause: <if known>
Current status: <investigating / identified / mitigating / resolved>
ETA for resolution: <time>

What we're doing:
- <action 1>
- <action 2>

Your action required: <usually none, but sometimes — e.g. "rotate API key">

We'll update you every hour until resolution. For urgent concerns,
reply to this email or call +91-XXX-XXX-XXXX.

Regards,
<IC name>, Incident Commander
<timestamp>

Post-incident RCA (template)#

# RCA: <short title>

**Incident:** <ID>
**Date:** <YYYY-MM-DD>
**Duration:** <hh:mm to hh:mm>
**Severity:** P<N>
**Services affected:** <list>
**Customers affected:** <count / segments>

## Summary (3 sentences)

<What happened. What caused it. What we did about it.>

## Impact

- <Measurable impact: X pipelines failed, Y% users degraded, Z minutes downtime>

## Timeline

- <HH:MM> Event X happened
- <HH:MM> Alert fired
- <HH:MM> Incident declared
- <HH:MM> Mitigation applied
- <HH:MM> Fully resolved

## Root cause

<What fundamentally broke. 1-2 paragraphs.>

## What went well

- <specific thing>

## What went poorly

- <specific thing>

## Action items

| # | Action | Owner | Due |
|---|---|---|---|
| 1 | Add alert for X | @person | YYYY-MM-DD |
| 2 | Document Y | @person | YYYY-MM-DD |

## Lessons

<What we now understand better.>

Published externally for P0/P1. Internal-only for P2/P3 unless customers specifically ask.

Forensics for security incidents#

For suspected compromise or data exposure:

Preserve evidence — do NOT restart pods, do NOT roll back deploys. Snapshot first.
Quarantine — network-policy the affected pod, don't kill it
Collect artefacts — pod logs, container memory snapshot, network policy state, recent config changes
Chain of custody — document who touched what, when
Legal + customer comms — regulatory reporting deadlines may be as tight as 24-72h (DPDPA, HITECH, GDPR)

Next#

SECURITY.md — vuln disclosure process
Backup & restore — data-loss mitigation
Observability — how to instrument to reduce MTTR