Backup & restore#
swarm state splits across Postgres, object storage, and filesystem journals. Each needs its own backup + restore procedure.
What to back up#
| Component | Contains | Backup method |
|---|---|---|
| Postgres | runs, events, denials, plugin registry, approvals, feature flags | pg_dump + managed snapshots |
| Object storage | pipeline_runs/ (models, reports, audit PDFs, conversations) |
bucket replication + lifecycle rules |
~/.swarm/ state (cron + batch) |
cron jobs.json, batch run_dirs | volume snapshot or tarball |
| Secrets | JWT secret, OIDC credentials, LLM API keys | Secrets Manager / Vault / KMS (not in app backups) |
Backup schedule (recommended)#
| Type | Frequency | Retention |
|---|---|---|
Postgres logical (pg_dump) |
Nightly | 30 days |
Postgres physical (pg_basebackup) |
Weekly | 3 months |
| WAL archiving | Continuous (15-min segments) | 7 days |
| Object-storage replication | Continuous (cross-region) | Matches artefact TTL (7y for audit) |
~/.swarm/ volume snapshot |
Nightly | 30 days |
Postgres backup#
Option 1 — managed provider (RDS / Cloud SQL / Azure DB)#
Enable in the provider console: - Automated backups — daily, 14-35 days retention - Point-in-time recovery — WAL retention 1-7 days - Cross-region replica — for disaster recovery
Cost-efficient; proven; requires zero app-side work.
Option 2 — self-managed#
# Nightly logical dump
pg_dump -h <host> -U swarm -d swarm -F c -Z 6 \
-f /backups/swarm-$(date +%F).dump
# Verify:
pg_restore -l /backups/swarm-2026-04-15.dump | head
# Upload to S3
aws s3 cp /backups/swarm-2026-04-15.dump \
s3://acme-swarm-backups/postgres/$(date +%F).dump \
--storage-class STANDARD_IA
Automate via cron or Kubernetes CronJob.
Restore from pg_dump#
# Download
aws s3 cp s3://acme-swarm-backups/postgres/2026-04-15.dump /tmp/
# Restore to empty DB
createdb -h <host> -U swarm swarm_restored
pg_restore -h <host> -U swarm -d swarm_restored /tmp/2026-04-15.dump
# Verify
psql -h <host> -U swarm -d swarm_restored -c "SELECT COUNT(*) FROM run_events;"
# Swap (requires brief API downtime)
kubectl -n swarm scale deploy swarm-api --replicas=0
# Point swarm at the restored DB
kubectl -n swarm set env deploy/swarm-api SWARM_DB_URL=postgres://.../swarm_restored
kubectl -n swarm scale deploy swarm-api --replicas=3
Point-in-time recovery#
If you have WAL archiving + a recent base backup:
# Restore base backup to a scratch data directory
pg_basebackup -h <replica> -D /tmp/pgdata -P -X stream
# Fetch WALs
aws s3 sync s3://acme-swarm-backups/wal/ /tmp/pgdata/pg_wal/
# Configure recovery target
echo "restore_command = 'cp /tmp/pgdata/pg_wal/%f %p'" >> /tmp/pgdata/postgresql.conf
echo "recovery_target_time = '2026-04-15 09:30:00 IST'" >> /tmp/pgdata/postgresql.conf
touch /tmp/pgdata/recovery.signal
# Start postgres pointing at /tmp/pgdata
pg_ctl -D /tmp/pgdata start
Verify state; then swap as above.
Object storage backup#
S3#
# Bucket versioning — ensures no overwrites/deletes lose data
Versioning: Enabled
# Lifecycle — keeps audit artefacts long, prunes operational data
LifecycleConfiguration:
Rules:
- Id: audit-retention
Filter: { Prefix: "audit/" }
Transitions:
- Days: 90
StorageClass: GLACIER_IR
- Days: 365
StorageClass: DEEP_ARCHIVE
Expiration:
Days: 2555 # 7 years
- Id: conversations-retention
Filter: { Prefix: "conversations/" }
Expiration:
Days: 365
# Cross-region replication — disaster recovery
ReplicationConfiguration:
Role: <IAM role>
Rules:
- Status: Enabled
Destination:
Bucket: arn:aws:s3:::acme-swarm-drbr
StorageClass: STANDARD_IA
Restore: restore-from-versioning or from the DR bucket. No additional work needed.
GCS / Azure Blob#
Equivalent features: GCS Object Versioning + Lifecycle; Azure Blob versioning + soft delete + GRS replication.
~/.swarm/ state#
This directory contains cron jobs and batch run state. Lives on:
- Docker Compose: named volume swarm-state
- Kubernetes: PersistentVolumeClaim swarm-state
Backup:
# Docker Compose
docker run --rm -v swarm_swarm-state:/data -v $(pwd)/backups:/backup \
alpine tar -czf /backup/swarm-state-$(date +%F).tar.gz -C /data .
# Kubernetes — via a snapshot
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: swarm-state-$(date +%F)
namespace: swarm
spec:
volumeSnapshotClassName: csi-snapshot-class
source:
persistentVolumeClaimName: swarm-state
EOF
Restore: mount the snapshot / untar to a fresh volume, point swarm at it.
Test restores regularly#
Backups you haven't tested are hopes, not backups.
Quarterly DR drill:
# 1. Provision a scratch stack
helm install swarm-dr swarm/swarm -n swarm-dr --create-namespace \
-f values-dr.yaml
# 2. Restore Postgres from yesterday's backup
pg_restore -h swarm-dr-db ... /backups/swarm-yesterday.dump
# 3. Point the stack at the restored data
kubectl -n swarm-dr set env deploy/swarm-api SWARM_DB_URL=...
# 4. Validate
curl https://swarm-dr.customer.com/api/v1/pipelines | jq '.total'
# — should match yesterday's prod count ±1
# 5. Teardown
helm uninstall swarm-dr -n swarm-dr
Document the RTO (time from disaster declared to service restored) + RPO (data-loss window).
Target: - Enterprise tier: RTO 4h, RPO 1h - Team tier: RTO 24h, RPO 24h - Community self-host: best-effort (your own SLAs)
Secrets backup#
SWARM_JWT_SECRET, OIDC client secret, LLM API keys — these must be preserved separately:
- Kubernetes secrets — backed up as part of cluster backup (Velero, etc.)
- AWS Secrets Manager — automatic versioning; point-in-time recovery
- HashiCorp Vault — raft storage; periodic snapshots
If you lose the JWT secret: all existing JWTs become invalid. Users log back in via OIDC (no data loss, but a user-visible disruption). Rotate any API keys issued to external callers.
If you lose the DB + secrets: catastrophic. Restore Postgres from backup; rotate secrets; users log back in via OIDC.
BYOK key backup#
If using customer-managed keys:
- AWS KMS — auto-backed-up by AWS; multi-region keys for DR
- GCP KMS — same; use key replication
- Azure Key Vault — soft-delete + purge-protection enabled
If the customer revokes / loses the KMS key: data encrypted with that key is unrecoverable. This is the contract. Design the key rotation + archive strategy BEFORE production use.
Migration (prod → dev) for debugging#
Sometimes you need prod data in dev to reproduce an issue:
# 1. Export scrubbed prod data (removes PII)
swarm admin export \
--run-id <suspect_id> \
--scrub \
--output /tmp/run-<id>-scrubbed.tar.gz
# 2. Import into dev
cd dev-swarm
swarm admin import /tmp/run-<id>-scrubbed.tar.gz
The --scrub flag runs field-level redaction on conversations + events: removes PAN, Aadhaar, SSN, email, phone patterns. Model artefacts are preserved (they don't contain raw data). Good for reproducing bugs without PII contamination.
Next#
- Retention — how long things live before prune
- Incident response — disaster-recovery workflow
- Deployment: Data residency — cross-region replication + BYOK