Skip to content

Kubernetes + Helm#

Recommended production deployment. One Helm chart ships all topology variants (single-replica, multi-replica, multi-region, customer-VPC, air-gapped).

Helm chart ships in v0.12

The official Helm chart is targeted for v0.12. Today, use the Docker Compose path with a reverse proxy, or hand-roll a Deployment manifest from the shape below. The values.yaml schema below is stable and will land unchanged.

Quick install (once shipped)#

helm repo add swarm https://charts.theaisingularity.org
helm repo update

helm install swarm swarm/swarm \
  --namespace swarm \
  --create-namespace \
  --values values.yaml

Chart structure#

deploy/helm/swarm/
├── Chart.yaml
├── values.yaml              # defaults + schema
├── values.schema.json
├── templates/
│   ├── deployment-api.yaml
│   ├── deployment-dashboard.yaml
│   ├── service-api.yaml
│   ├── service-dashboard.yaml
│   ├── ingress.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── hpa.yaml             # horizontal pod autoscaler
│   ├── pdb.yaml             # pod disruption budget
│   ├── networkpolicy.yaml
│   ├── serviceaccount.yaml
│   ├── rolebinding.yaml
│   ├── job-migrate.yaml     # one-off DB migration on upgrade
│   └── statefulset-postgres.yaml   # optional — external Postgres recommended
└── charts/                  # sub-charts (kube-prometheus stack, cert-manager integration)

values.yaml reference#

# Global
global:
  imageRegistry: ghcr.io/theaisingularity
  imagePullSecrets: []           # for private registry
  storageClass: gp3              # fast SSD

# API
api:
  enabled: true
  image:
    repository: swarm-api
    tag: "0.11.0"                # or "latest-lts"
    pullPolicy: IfNotPresent
  replicas: 3
  resources:
    requests: { cpu: 500m, memory: 1Gi }
    limits:   { cpu: 2000m, memory: 4Gi }
  hpa:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetCPU: 70
  podDisruptionBudget:
    minAvailable: 2
  service:
    type: ClusterIP
    port: 8000
  env:
    SWARM_LOG_LEVEL: INFO
    SWARM_MAX_CONCURRENT_PIPELINES: "20"
  envFromSecretName: swarm-secrets

# Dashboard
dashboard:
  enabled: true
  image:
    repository: swarm-dashboard
    tag: "0.11.0"
  replicas: 2
  resources:
    requests: { cpu: 100m, memory: 256Mi }
    limits:   { cpu: 500m, memory: 512Mi }
  service:
    port: 3000

# Ingress
ingress:
  enabled: true
  className: nginx
  hosts:
    - host: swarm.customer.com
      paths: [{ path: /, pathType: Prefix, backend: dashboard }]
    - host: api.swarm.customer.com
      paths: [{ path: /, pathType: Prefix, backend: api }]
  tls:
    - secretName: swarm-tls
      hosts: [swarm.customer.com, api.swarm.customer.com]
  # For cert-manager integration:
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod

# Database
postgres:
  # External (recommended for production)
  external: true
  host: swarm-db.<region>.rds.amazonaws.com
  port: 5432
  database: swarm
  existingSecretName: swarm-db-credentials
  # OR in-cluster (for PoC only)
  # external: false
  # image: postgres:16-alpine
  # persistence:
  #   enabled: true
  #   size: 50Gi

# Object storage
storage:
  backend: s3                    # s3 | gcs | azure | local
  bucket: swarm-runs
  prefix: prod/
  region: ap-south-1
  existingSecretName: swarm-storage-credentials

# Cron scheduler + batch daemon
cron:
  enabled: true
  replicas: 1                    # never >1; single scheduler
batch:
  enabled: true
  workerReplicas: 4              # separate worker deployment for batch jobs (future)

# Permissions + compliance
compliance:
  profile: rbi_free_ai            # default profile; pipeline override wins
  extraPolicyFiles: []

# Observability
observability:
  prometheus:
    enabled: true
    serviceMonitor: true          # auto-register with kube-prometheus-stack
  otel:
    enabled: true
    endpoint: https://otel-collector:4317
  logs:
    format: json
    level: INFO

# Security
security:
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 10001
    fsGroup: 10001
  containerSecurityContext:
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
  networkPolicy:
    enabled: true
    allowedOutbound:
      # LLM providers
      - anthropic.com
      - api.openai.com
      - oauth2.googleapis.com
      # object storage (pre-resolved CIDRs via values)

# BYOK (customer-managed keys)
byok:
  enabled: false
  provider: aws-kms | gcp-kms | azure-keyvault
  keyArn: "arn:aws:kms:..."       # for aws

Install + verify#

helm install swarm swarm/swarm -n swarm --create-namespace -f values.yaml
kubectl -n swarm get pods          # wait for Ready

# Port-forward for initial verify
kubectl -n swarm port-forward svc/swarm-api 8000:8000

curl http://localhost:8000/healthz

Upgrade path#

# Backup first
kubectl -n swarm exec -it <postgres-pod> -- pg_dump ... > backup.sql

# Upgrade
helm upgrade swarm swarm/swarm -n swarm -f values.yaml --atomic --timeout 5m

# Helm runs the migration job automatically; verify:
kubectl -n swarm logs -l job-name=swarm-migrate

Rollback:

helm rollback swarm <previous-revision> -n swarm

Customer-VPC topology#

For BFSI / healthcare customers who demand single-tenant + BYOK:

global:
  imageRegistry: customer-registry.internal   # air-gapped? mirror images here

postgres:
  external: true
  host: customer-rds.internal

storage:
  backend: s3
  bucket: customer-swarm-bucket
  endpoint: https://s3.customer.internal      # for non-AWS S3-compat

byok:
  enabled: true
  provider: aws-kms
  keyArn: <customer-managed-CMK-ARN>

# Deny any egress except the customer's proxies
security:
  networkPolicy:
    enabled: true
    allowedOutbound:
      - customer-llm-proxy.internal
      - customer-mcp-gateway.internal

ingress:
  hosts:
    - host: swarm.customer.internal
      paths: [...]
  tls:
    - secretName: swarm-tls
      hosts: [swarm.customer.internal]

Air-gapped#

For zero-egress zones:

  1. Bundle images: download all images + chart into a tar bundle
    helm package swarm/swarm -d ./bundle/charts
    docker save <images> -o ./bundle/images.tar
    
  2. Transfer to the air-gapped environment (physical media / approved transfer)
  3. Mirror registry: push into the internal registry
    docker load -i images.tar
    docker tag ghcr.io/theaisingularity/swarm-api:0.11.0 internal-registry/swarm-api:0.11.0
    docker push internal-registry/swarm-api:0.11.0
    
  4. Install pointing at internal registry
    helm install swarm ./bundle/charts/swarm-0.11.0.tgz \
      --set global.imageRegistry=internal-registry \
      -f values-airgapped.yaml
    
  5. License heartbeat: offline mode; license key validated against install fingerprint

Planned for v0.13. Today, follow deploy/helm/swarm/AIRGAP.md once the chart ships.

Multi-region#

Pattern: one swarm install per region, data stays in region, control plane coordinates cross-region.

# region-ap-south.values.yaml
global:
  imageRegistry: ghcr.io/theaisingularity
postgres:
  host: swarm-ap-south.rds.amazonaws.com
storage:
  bucket: swarm-ap-south
  region: ap-south-1
# Different secret per region
# region-eu-west.values.yaml
postgres:
  host: swarm-eu-west.rds.amazonaws.com
storage:
  bucket: swarm-eu-west
  region: eu-west-3

Install each in its own namespace/cluster. Cross-region traffic never flows by default (networkPolicy denies egress between regions unless explicitly allowed).

See Data residency for the compliance logic.

Troubleshooting#

Pods restart on startup with CrashLoopBackOff

Usually DB migrations failed. Check kubectl logs -l job-name=swarm-migrate. Usually resolved by manually running migrations or fixing DB credentials.

Dashboard 502s behind ingress

Check kubectl -n swarm get ingress. Verify NEXT_PUBLIC_API_URL points at the public ingress hostname (not cluster-internal).

HPA doesn't scale

Ensure metrics-server is installed (kubectl get pods -n kube-system | grep metrics-server). Without it, CPU/memory metrics aren't available.

Next#