Backup & restore#

swarm state splits across Postgres, object storage, and filesystem journals. Each needs its own backup + restore procedure.

What to back up#

Component	Contains	Backup method
Postgres	runs, events, denials, plugin registry, approvals, feature flags	`pg_dump` + managed snapshots
Object storage	`pipeline_runs/` (models, reports, audit PDFs, conversations)	bucket replication + lifecycle rules
`~/.swarm/` state (cron + batch)	cron jobs.json, batch run_dirs	volume snapshot or tarball
Secrets	JWT secret, OIDC credentials, LLM API keys	Secrets Manager / Vault / KMS (not in app backups)

Backup schedule (recommended)#

Type	Frequency	Retention
Postgres logical (`pg_dump`)	Nightly	30 days
Postgres physical (`pg_basebackup`)	Weekly	3 months
WAL archiving	Continuous (15-min segments)	7 days
Object-storage replication	Continuous (cross-region)	Matches artefact TTL (7y for audit)
`~/.swarm/` volume snapshot	Nightly	30 days

Postgres backup#

Option 1 — managed provider (RDS / Cloud SQL / Azure DB)#

Enable in the provider console: - Automated backups — daily, 14-35 days retention - Point-in-time recovery — WAL retention 1-7 days - Cross-region replica — for disaster recovery

Cost-efficient; proven; requires zero app-side work.

Option 2 — self-managed#

# Nightly logical dump
pg_dump -h <host> -U swarm -d swarm -F c -Z 6 \
  -f /backups/swarm-$(date +%F).dump

# Verify:
pg_restore -l /backups/swarm-2026-04-15.dump | head

# Upload to S3
aws s3 cp /backups/swarm-2026-04-15.dump \
  s3://acme-swarm-backups/postgres/$(date +%F).dump \
  --storage-class STANDARD_IA

Automate via cron or Kubernetes CronJob.

Restore from `pg_dump`#

# Download
aws s3 cp s3://acme-swarm-backups/postgres/2026-04-15.dump /tmp/

# Restore to empty DB
createdb -h <host> -U swarm swarm_restored
pg_restore -h <host> -U swarm -d swarm_restored /tmp/2026-04-15.dump

# Verify
psql -h <host> -U swarm -d swarm_restored -c "SELECT COUNT(*) FROM run_events;"

# Swap (requires brief API downtime)
kubectl -n swarm scale deploy swarm-api --replicas=0
# Point swarm at the restored DB
kubectl -n swarm set env deploy/swarm-api SWARM_DB_URL=postgres://.../swarm_restored
kubectl -n swarm scale deploy swarm-api --replicas=3

Point-in-time recovery#

If you have WAL archiving + a recent base backup:

# Restore base backup to a scratch data directory
pg_basebackup -h <replica> -D /tmp/pgdata -P -X stream

# Fetch WALs
aws s3 sync s3://acme-swarm-backups/wal/ /tmp/pgdata/pg_wal/

# Configure recovery target
echo "restore_command = 'cp /tmp/pgdata/pg_wal/%f %p'" >> /tmp/pgdata/postgresql.conf
echo "recovery_target_time = '2026-04-15 09:30:00 IST'" >> /tmp/pgdata/postgresql.conf
touch /tmp/pgdata/recovery.signal

# Start postgres pointing at /tmp/pgdata
pg_ctl -D /tmp/pgdata start

Verify state; then swap as above.

Object storage backup#

S3#

# Bucket versioning — ensures no overwrites/deletes lose data
Versioning: Enabled

# Lifecycle — keeps audit artefacts long, prunes operational data
LifecycleConfiguration:
  Rules:
    - Id: audit-retention
      Filter: { Prefix: "audit/" }
      Transitions:
        - Days: 90
          StorageClass: GLACIER_IR
        - Days: 365
          StorageClass: DEEP_ARCHIVE
      Expiration:
        Days: 2555          # 7 years

    - Id: conversations-retention
      Filter: { Prefix: "conversations/" }
      Expiration:
        Days: 365

# Cross-region replication — disaster recovery
ReplicationConfiguration:
  Role: <IAM role>
  Rules:
    - Status: Enabled
      Destination:
        Bucket: arn:aws:s3:::acme-swarm-drbr
        StorageClass: STANDARD_IA

Restore: restore-from-versioning or from the DR bucket. No additional work needed.

GCS / Azure Blob#

Equivalent features: GCS Object Versioning + Lifecycle; Azure Blob versioning + soft delete + GRS replication.

`~/.swarm/` state#

This directory contains cron jobs and batch run state. Lives on: - Docker Compose: named volume swarm-state - Kubernetes: PersistentVolumeClaim swarm-state

Backup:

# Docker Compose
docker run --rm -v swarm_swarm-state:/data -v $(pwd)/backups:/backup \
  alpine tar -czf /backup/swarm-state-$(date +%F).tar.gz -C /data .

# Kubernetes — via a snapshot
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: swarm-state-$(date +%F)
  namespace: swarm
spec:
  volumeSnapshotClassName: csi-snapshot-class
  source:
    persistentVolumeClaimName: swarm-state
EOF

Restore: mount the snapshot / untar to a fresh volume, point swarm at it.

Test restores regularly#

Backups you haven't tested are hopes, not backups.

Quarterly DR drill:

# 1. Provision a scratch stack
helm install swarm-dr swarm/swarm -n swarm-dr --create-namespace \
  -f values-dr.yaml

# 2. Restore Postgres from yesterday's backup
pg_restore -h swarm-dr-db ... /backups/swarm-yesterday.dump

# 3. Point the stack at the restored data
kubectl -n swarm-dr set env deploy/swarm-api SWARM_DB_URL=...

# 4. Validate
curl https://swarm-dr.customer.com/api/v1/pipelines | jq '.total'
# — should match yesterday's prod count ±1

# 5. Teardown
helm uninstall swarm-dr -n swarm-dr

Document the RTO (time from disaster declared to service restored) + RPO (data-loss window).

Target: - Enterprise tier: RTO 4h, RPO 1h - Team tier: RTO 24h, RPO 24h - Community self-host: best-effort (your own SLAs)

Secrets backup#

SWARM_JWT_SECRET, OIDC client secret, LLM API keys — these must be preserved separately:

Kubernetes secrets — backed up as part of cluster backup (Velero, etc.)
AWS Secrets Manager — automatic versioning; point-in-time recovery
HashiCorp Vault — raft storage; periodic snapshots

If you lose the JWT secret: all existing JWTs become invalid. Users log back in via OIDC (no data loss, but a user-visible disruption). Rotate any API keys issued to external callers.

If you lose the DB + secrets: catastrophic. Restore Postgres from backup; rotate secrets; users log back in via OIDC.

BYOK key backup#

If using customer-managed keys:

AWS KMS — auto-backed-up by AWS; multi-region keys for DR
GCP KMS — same; use key replication
Azure Key Vault — soft-delete + purge-protection enabled

If the customer revokes / loses the KMS key: data encrypted with that key is unrecoverable. This is the contract. Design the key rotation + archive strategy BEFORE production use.

Migration (prod → dev) for debugging#

Sometimes you need prod data in dev to reproduce an issue:

# 1. Export scrubbed prod data (removes PII)
swarm admin export \
  --run-id <suspect_id> \
  --scrub \
  --output /tmp/run-<id>-scrubbed.tar.gz

# 2. Import into dev
cd dev-swarm
swarm admin import /tmp/run-<id>-scrubbed.tar.gz

The --scrub flag runs field-level redaction on conversations + events: removes PAN, Aadhaar, SSN, email, phone patterns. Model artefacts are preserved (they don't contain raw data). Good for reproducing bugs without PII contamination.

Next#

Retention — how long things live before prune
Incident response — disaster-recovery workflow
Deployment: Data residency — cross-region replication + BYOK