Incident response#
Triage, communicate, resolve, learn. This page is the on-call playbook.
Severity definitions#
| Level | What qualifies | Response time (Enterprise SLA) |
|---|---|---|
| P0 | Platform down; data loss; confirmed breach | Ack 15min, first comms 30min, status-page 1h |
| P1 | Major feature broken for all users; severe degradation | Ack 1h, fix target 4h |
| P2 | Feature broken for some users; workaround exists | Ack 4h, fix next business day |
| P3 | Minor bug, enhancement, question | Ack 24h, next scheduled release |
SLA differs by customer tier — Enterprise gets tighter response than Team; Team gets tighter than Developer; Community is best-effort.
Triage flow#
Alert fires (PagerDuty / email / on-call phone)
│
▼
[Acknowledge] — within SLA window
│
▼
[Assess severity] — is this really P0? Verify scope
│
▼
[Declare + communicate] —
│ - status page update
│ - Slack Connect to affected enterprise customers
│ - Initial "we're investigating" note
│
▼
[Investigate] — check metrics, logs, traces, recent deploys
│
▼
[Mitigate] — rollback / feature-flag off / scale / ratelimit
│
▼
[Verify fix] — confirm customer impact resolved
│
▼
[Document RCA] — root cause analysis, published within 72h for P0
│
▼
[Schedule post-mortem] — blameless, within 1 week
First-response playbook (any severity)#
- Acknowledge the alert — stop the paging
- Open a dedicated Slack thread named
#inc-YYYYMMDD-brief-description - Assign roles:
- Incident commander (IC) — drives the call
- Communications lead — writes customer + status-page updates
- Investigator — dives into logs / metrics / code
- Set a 15-minute check-in cadence during active incidents
- Record everything — timestamps, decisions, commands run
Investigation patterns#
"The API is down"#
# Is it actually down?
curl -sSw "%{http_code}\n" https://api.swarm.customer.com/healthz
# Pod status
kubectl -n swarm get pods -o wide
# Recent events
kubectl -n swarm get events --sort-by='.lastTimestamp' | tail -20
# Container logs last 100 lines
kubectl -n swarm logs -l app=swarm-api --tail=100 | tail -100
# Database reachable?
kubectl -n swarm exec -it <pod> -- swarm admin db-ping
# LLM providers reachable?
kubectl -n swarm exec -it <pod> -- swarm llm ping
"Pipelines are failing"#
# Recent failures
SELECT run_id, template, failed_agent, failure_reason, ts
FROM run_events
WHERE kind='pipeline_failed' AND ts > now() - interval '1 hour'
ORDER BY ts DESC LIMIT 20;
# Pattern across failures
SELECT failed_agent, COUNT(*) FROM run_events
WHERE kind='pipeline_failed' AND ts > now() - interval '1 hour'
GROUP BY failed_agent ORDER BY 2 DESC;
Common causes:
- LLM provider 5xx / 429 (check swarm_llm_calls_total{state="error"})
- Tool hang → timeout (check swarm_tool_calls_total{state="timeout"})
- Permission rule misfire (check swarm_permission_decisions_total{behavior="deny"} spike)
"Permission denials spiking"#
# What's being denied?
curl -sG https://api.customer.com/api/v1/permissions/denials \
--data-urlencode "since=600" | jq '.items | group_by(.rule_source) | map({source: .[0].rule_source, count: length})'
Common causes: - A policy rule mis-authored (wrong tool name) - Compliance profile change pushed without pipeline update - Attacker probing (unlikely; would also show up in abnormal HTTP patterns)
"Cost exploding"#
# Which agent / which model is burning cash?
swarm reports costs --from <start_of_incident> --group-by agent
# Recent LLM calls — are they retrying?
SELECT agent, model, state, COUNT(*) FROM run_events
WHERE kind='llm_call' AND ts > now() - interval '1 hour'
GROUP BY agent, model, state;
Common cause: retry storm on LLM timeout → engaged budget cap → should have paused. Verify SWARM_BUDGET_PER_PIPELINE_USD is set.
Mitigation primitives#
| Mitigation | How |
|---|---|
| Rollback last deploy | helm rollback swarm <previous-revision> -n swarm |
| Disable a feature flag | swarm features set <flag> off (immediate) |
| Block a misbehaving plugin | swarm plugins uninstall <name> |
| Scale up | kubectl scale deploy swarm-api --replicas=10 -n swarm |
| Pause cron scheduler | swarm features set cron_scheduler off |
| Stop accepting new pipelines | kubectl patch deploy swarm-api -p '{"spec":{"replicas":0}}' — drastic; all users impacted |
Communication templates#
Status-page initial update (P0)#
Investigating — <timestamp>
We are investigating reports of <symptom>. Customers may experience
<impact>. We will update within 30 minutes.
Status-page update after cause identified#
Identified — <timestamp>
We've identified the cause: <plain-English description>. We are
implementing a fix. ETA: <time>.
Affected customers: <who>. Data integrity: <statement — "no data loss" or specific>.
Status-page resolution#
Resolved — <timestamp>
The issue has been resolved. Root cause: <one-sentence>. A full
post-mortem will be published within 72 hours at <link>.
We apologize for the disruption.
Customer email (Enterprise P0)#
Subject: [Urgent] swarm service incident — <brief>
Hi <customer name>,
We're writing to inform you of a service incident affecting swarm
starting <start time>. Details:
Impact: <what you experienced>
Scope: <affected services / regions>
Root cause: <if known>
Current status: <investigating / identified / mitigating / resolved>
ETA for resolution: <time>
What we're doing:
- <action 1>
- <action 2>
Your action required: <usually none, but sometimes — e.g. "rotate API key">
We'll update you every hour until resolution. For urgent concerns,
reply to this email or call +91-XXX-XXX-XXXX.
Regards,
<IC name>, Incident Commander
<timestamp>
Post-incident RCA (template)#
# RCA: <short title>
**Incident:** <ID>
**Date:** <YYYY-MM-DD>
**Duration:** <hh:mm to hh:mm>
**Severity:** P<N>
**Services affected:** <list>
**Customers affected:** <count / segments>
## Summary (3 sentences)
<What happened. What caused it. What we did about it.>
## Impact
- <Measurable impact: X pipelines failed, Y% users degraded, Z minutes downtime>
## Timeline
- <HH:MM> Event X happened
- <HH:MM> Alert fired
- <HH:MM> Incident declared
- <HH:MM> Mitigation applied
- <HH:MM> Fully resolved
## Root cause
<What fundamentally broke. 1-2 paragraphs.>
## What went well
- <specific thing>
## What went poorly
- <specific thing>
## Action items
| # | Action | Owner | Due |
|---|---|---|---|
| 1 | Add alert for X | @person | YYYY-MM-DD |
| 2 | Document Y | @person | YYYY-MM-DD |
## Lessons
<What we now understand better.>
Published externally for P0/P1. Internal-only for P2/P3 unless customers specifically ask.
Forensics for security incidents#
For suspected compromise or data exposure:
- Preserve evidence — do NOT restart pods, do NOT roll back deploys. Snapshot first.
- Quarantine — network-policy the affected pod, don't kill it
- Collect artefacts — pod logs, container memory snapshot, network policy state, recent config changes
- Chain of custody — document who touched what, when
- Legal + customer comms — regulatory reporting deadlines may be as tight as 24-72h (DPDPA, HITECH, GDPR)
Next#
- SECURITY.md — vuln disclosure process
- Backup & restore — data-loss mitigation
- Observability — how to instrument to reduce MTTR