Kubernetes + Helm#
Recommended production deployment. One Helm chart ships all topology variants (single-replica, multi-replica, multi-region, customer-VPC, air-gapped).
Helm chart ships in v0.12
The official Helm chart is targeted for v0.12. Today, use the Docker Compose path with a reverse proxy, or hand-roll a Deployment manifest from the shape below. The values.yaml schema below is stable and will land unchanged.
Quick install (once shipped)#
helm repo add swarm https://charts.theaisingularity.org
helm repo update
helm install swarm swarm/swarm \
--namespace swarm \
--create-namespace \
--values values.yaml
Chart structure#
deploy/helm/swarm/
├── Chart.yaml
├── values.yaml # defaults + schema
├── values.schema.json
├── templates/
│ ├── deployment-api.yaml
│ ├── deployment-dashboard.yaml
│ ├── service-api.yaml
│ ├── service-dashboard.yaml
│ ├── ingress.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── hpa.yaml # horizontal pod autoscaler
│ ├── pdb.yaml # pod disruption budget
│ ├── networkpolicy.yaml
│ ├── serviceaccount.yaml
│ ├── rolebinding.yaml
│ ├── job-migrate.yaml # one-off DB migration on upgrade
│ └── statefulset-postgres.yaml # optional — external Postgres recommended
└── charts/ # sub-charts (kube-prometheus stack, cert-manager integration)
values.yaml reference#
# Global
global:
imageRegistry: ghcr.io/theaisingularity
imagePullSecrets: [] # for private registry
storageClass: gp3 # fast SSD
# API
api:
enabled: true
image:
repository: swarm-api
tag: "0.11.0" # or "latest-lts"
pullPolicy: IfNotPresent
replicas: 3
resources:
requests: { cpu: 500m, memory: 1Gi }
limits: { cpu: 2000m, memory: 4Gi }
hpa:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPU: 70
podDisruptionBudget:
minAvailable: 2
service:
type: ClusterIP
port: 8000
env:
SWARM_LOG_LEVEL: INFO
SWARM_MAX_CONCURRENT_PIPELINES: "20"
envFromSecretName: swarm-secrets
# Dashboard
dashboard:
enabled: true
image:
repository: swarm-dashboard
tag: "0.11.0"
replicas: 2
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
service:
port: 3000
# Ingress
ingress:
enabled: true
className: nginx
hosts:
- host: swarm.customer.com
paths: [{ path: /, pathType: Prefix, backend: dashboard }]
- host: api.swarm.customer.com
paths: [{ path: /, pathType: Prefix, backend: api }]
tls:
- secretName: swarm-tls
hosts: [swarm.customer.com, api.swarm.customer.com]
# For cert-manager integration:
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
# Database
postgres:
# External (recommended for production)
external: true
host: swarm-db.<region>.rds.amazonaws.com
port: 5432
database: swarm
existingSecretName: swarm-db-credentials
# OR in-cluster (for PoC only)
# external: false
# image: postgres:16-alpine
# persistence:
# enabled: true
# size: 50Gi
# Object storage
storage:
backend: s3 # s3 | gcs | azure | local
bucket: swarm-runs
prefix: prod/
region: ap-south-1
existingSecretName: swarm-storage-credentials
# Cron scheduler + batch daemon
cron:
enabled: true
replicas: 1 # never >1; single scheduler
batch:
enabled: true
workerReplicas: 4 # separate worker deployment for batch jobs (future)
# Permissions + compliance
compliance:
profile: rbi_free_ai # default profile; pipeline override wins
extraPolicyFiles: []
# Observability
observability:
prometheus:
enabled: true
serviceMonitor: true # auto-register with kube-prometheus-stack
otel:
enabled: true
endpoint: https://otel-collector:4317
logs:
format: json
level: INFO
# Security
security:
podSecurityContext:
runAsNonRoot: true
runAsUser: 10001
fsGroup: 10001
containerSecurityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
networkPolicy:
enabled: true
allowedOutbound:
# LLM providers
- anthropic.com
- api.openai.com
- oauth2.googleapis.com
# object storage (pre-resolved CIDRs via values)
# BYOK (customer-managed keys)
byok:
enabled: false
provider: aws-kms | gcp-kms | azure-keyvault
keyArn: "arn:aws:kms:..." # for aws
Install + verify#
helm install swarm swarm/swarm -n swarm --create-namespace -f values.yaml
kubectl -n swarm get pods # wait for Ready
# Port-forward for initial verify
kubectl -n swarm port-forward svc/swarm-api 8000:8000
curl http://localhost:8000/healthz
Upgrade path#
# Backup first
kubectl -n swarm exec -it <postgres-pod> -- pg_dump ... > backup.sql
# Upgrade
helm upgrade swarm swarm/swarm -n swarm -f values.yaml --atomic --timeout 5m
# Helm runs the migration job automatically; verify:
kubectl -n swarm logs -l job-name=swarm-migrate
Rollback:
Customer-VPC topology#
For BFSI / healthcare customers who demand single-tenant + BYOK:
global:
imageRegistry: customer-registry.internal # air-gapped? mirror images here
postgres:
external: true
host: customer-rds.internal
storage:
backend: s3
bucket: customer-swarm-bucket
endpoint: https://s3.customer.internal # for non-AWS S3-compat
byok:
enabled: true
provider: aws-kms
keyArn: <customer-managed-CMK-ARN>
# Deny any egress except the customer's proxies
security:
networkPolicy:
enabled: true
allowedOutbound:
- customer-llm-proxy.internal
- customer-mcp-gateway.internal
ingress:
hosts:
- host: swarm.customer.internal
paths: [...]
tls:
- secretName: swarm-tls
hosts: [swarm.customer.internal]
Air-gapped#
For zero-egress zones:
- Bundle images: download all images + chart into a tar bundle
- Transfer to the air-gapped environment (physical media / approved transfer)
- Mirror registry: push into the internal registry
- Install pointing at internal registry
- License heartbeat: offline mode; license key validated against install fingerprint
Planned for v0.13. Today, follow deploy/helm/swarm/AIRGAP.md once the chart ships.
Multi-region#
Pattern: one swarm install per region, data stays in region, control plane coordinates cross-region.
# region-ap-south.values.yaml
global:
imageRegistry: ghcr.io/theaisingularity
postgres:
host: swarm-ap-south.rds.amazonaws.com
storage:
bucket: swarm-ap-south
region: ap-south-1
# Different secret per region
# region-eu-west.values.yaml
postgres:
host: swarm-eu-west.rds.amazonaws.com
storage:
bucket: swarm-eu-west
region: eu-west-3
Install each in its own namespace/cluster. Cross-region traffic never flows by default (networkPolicy denies egress between regions unless explicitly allowed).
See Data residency for the compliance logic.
Troubleshooting#
Pods restart on startup with CrashLoopBackOff
Usually DB migrations failed. Check kubectl logs -l job-name=swarm-migrate. Usually resolved by manually running migrations or fixing DB credentials.
Dashboard 502s behind ingress
Check kubectl -n swarm get ingress. Verify NEXT_PUBLIC_API_URL points at the public ingress hostname (not cluster-internal).
HPA doesn't scale
Ensure metrics-server is installed (kubectl get pods -n kube-system | grep metrics-server). Without it, CPU/memory metrics aren't available.
Next#
- Data residency — multi-region + BYOK
- Upgrade & versioning — LTS policy
- Operations: Observability — Prometheus, OTel