Bring your own LLM#

Goal: route swarm's LLM calls to the provider(s) of your choice — Anthropic, OpenAI, Azure OpenAI, vLLM (self-hosted), or a proxy. Time: 10 minutes.

Supported providers (April 2026)#

Provider	Status	Typical use
Anthropic (Claude family)	Full — prompt caching, streaming, tool use	Default for reasoning-heavy agents
OpenAI (GPT family)	Full — streaming, tool use, Responses API (v2)	Alternative; good for fast tiers
Azure OpenAI	Full (via OpenAI SDK with Azure endpoint)	Enterprise customers on Azure
vLLM (self-hosted)	Full — OpenAI-compatible endpoint	Air-gapped, cost-sensitive, compliance-driven
Ollama	Partial (OpenAI-compat shim)	Local dev only — not production-tested
OpenRouter / LiteLLM	Works as OpenAI-compat proxy	Multi-provider abstraction

1. Pick providers and tiers#

swarm uses 3 tiers of LLM, each mapped to a provider+model:

Tier	Intended use	Example model
`fast`	Low-latency, low-cost, throwaway decisions	Claude Haiku 4 / GPT-5-mini
`standard`	Primary agent reasoning	Claude Sonnet 4.5 / GPT-5
`deep`	Hard problems (algorithm selection, complex eval)	Claude Opus 4.1 / GPT-5-pro

You don't need all three — a small team can map all tiers to one model. But mapping them differently lets the pipeline be cheap + fast + smart where each matters.

2. Configure#

All via environment variables (see Reference: configuration for the full list).

Anthropic (default)#

# .env
SWARM_LLM_PROVIDER_DEFAULT=anthropic
ANTHROPIC_API_KEY=sk-ant-...

SWARM_LLM_MODEL_FAST=claude-haiku-4-20260315
SWARM_LLM_MODEL_STANDARD=claude-sonnet-4-5-20260315
SWARM_LLM_MODEL_DEEP=claude-opus-4-1-20260315

OpenAI#

SWARM_LLM_PROVIDER_DEFAULT=openai
OPENAI_API_KEY=sk-proj-...

SWARM_LLM_MODEL_FAST=gpt-5-mini
SWARM_LLM_MODEL_STANDARD=gpt-5
SWARM_LLM_MODEL_DEEP=gpt-5-pro

Azure OpenAI#

SWARM_LLM_PROVIDER_DEFAULT=azure
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com
AZURE_OPENAI_API_VERSION=2026-02-01

SWARM_LLM_MODEL_FAST=gpt-4-mini-deployment-name
SWARM_LLM_MODEL_STANDARD=gpt-5-deployment-name
SWARM_LLM_MODEL_DEEP=gpt-5-pro-deployment-name

vLLM (self-hosted, air-gapped)#

SWARM_LLM_PROVIDER_DEFAULT=openai_compat          # vLLM speaks OpenAI protocol
OPENAI_API_BASE=https://vllm.internal.yourorg.com/v1
OPENAI_API_KEY=<your-internal-token>

SWARM_LLM_MODEL_FAST=mistralai/Mistral-7B-Instruct-v0.4
SWARM_LLM_MODEL_STANDARD=meta-llama/Llama-3.3-70B-Instruct
SWARM_LLM_MODEL_DEEP=deepseek-ai/DeepSeek-V3

Mixed (Anthropic for reasoning + OpenAI for fast)#

SWARM_LLM_PROVIDER_FAST=openai
SWARM_LLM_MODEL_FAST=gpt-5-mini

SWARM_LLM_PROVIDER_STANDARD=anthropic
SWARM_LLM_MODEL_STANDARD=claude-sonnet-4-5-20260315

SWARM_LLM_PROVIDER_DEEP=anthropic
SWARM_LLM_MODEL_DEEP=claude-opus-4-1-20260315

Restart the api container after changes:

docker compose restart api

3. Verify#

swarm llm ping

Pings each configured provider + tier with a 1-token prompt. Reports latency + cost estimate.

Provider     Tier        Model                        Latency    Est. cost/1M tokens
anthropic    fast        claude-haiku-4-20260315      120ms      $0.50 in / $2.00 out
anthropic    standard    claude-sonnet-4-5-20260315   210ms      $3.00 in / $15.00 out
anthropic    deep        claude-opus-4-1-20260315     320ms      $15.00 in / $75.00 out

4. Per-agent override#

Sometimes one agent needs a different model. Override per-agent:

# In config/agent_defs.py
AgentConfig(
    name="ml_director",
    model_tier="deep",      # uses SWARM_LLM_MODEL_DEEP
    ...
)

Or force a specific model:

AgentConfig(
    name="ml_director",
    model_override="anthropic/claude-opus-4-1-20260315",
    ...
)

model_override takes precedence over model_tier.

5. Cost + rate-limit controls#

Budgets#

# In .env
SWARM_BUDGET_PER_PIPELINE_USD=5.00       # hard cap per pipeline run
SWARM_BUDGET_PER_DAY_USD=500.00          # rolling 24h cap per tenant
SWARM_BUDGET_ACTION_ON_OVER=pause        # pause | alert_only | deny

When over budget, pipeline pauses; operator can approve continuation or kill.

Retries + backoff#

Configurable per provider:

SWARM_LLM_RETRY_MAX_ATTEMPTS=3
SWARM_LLM_RETRY_BACKOFF_SECONDS=2        # exponential
SWARM_LLM_RETRY_ON_429=true              # rate limited
SWARM_LLM_RETRY_ON_5xx=true

Prompt caching#

swarm uses prompt caching automatically for Anthropic (ephemeral cache control on stable system prompts + tool schemas). OpenAI does automatic caching server-side — no config needed.

Savings: 50-90% on input tokens for the agent-heavy flows.

6. Multi-provider failover#

For production reliability, configure a secondary:

SWARM_LLM_PROVIDER_STANDARD=anthropic
SWARM_LLM_PROVIDER_STANDARD_FALLBACK=openai
SWARM_LLM_MODEL_STANDARD_FALLBACK=gpt-5

If Anthropic returns 5xx / 429 after retries, swarm automatically retries on OpenAI. The run_events log records the provider switch.

7. Audit + billing#

Every LLM call lands in run_events with: - Provider + model - Token counts (input / output / cached) - Cost estimate - Latency - Agent that made the call

Query for the month's bill:

SELECT
  provider, model,
  SUM(input_tokens) AS input,
  SUM(output_tokens) AS output,
  SUM(cost_usd) AS cost
FROM run_events
WHERE kind = 'llm_call'
  AND ts > date_trunc('month', now())
GROUP BY provider, model
ORDER BY cost DESC;

Or via:

swarm reports costs --from 2026-04-01 --to 2026-04-30

Compliance notes#

BFSI (India)#

RBI Master Direction prefers data stays in India. Anthropic's India-region endpoint is NOT yet available (April 2026). Options: - Use Azure OpenAI with an India-region deployment (ap-south-1) - Self-host vLLM on Indian cloud - Use OpenAI's standard endpoint with zero-retention contract (talk to OpenAI enterprise)

HIPAA (US)#

Both Anthropic and OpenAI sign BAAs for their enterprise tiers. Verify: - API tier is BAA-eligible - "Zero data retention" is configured (OPENAI_API_BETA=zdr for OpenAI; Anthropic enterprise contract for Claude)

Use EU-region endpoints (Anthropic EU, Azure OpenAI EU)
Verify sub-processor list at https://trust.anthropic.com / Microsoft Trust Center
US CLOUD Act exposure: OpenAI (non-Azure) and Anthropic are US companies — data may be subject to CLOUD Act even with EU region. Legal review required.

See Compliance profiles.

Troubleshooting#

Keep hitting rate limits

Increase SWARM_LLM_RETRY_MAX_ATTEMPTS and lower concurrency (SWARM_MAX_CONCURRENT_AGENTS). Or negotiate a higher tier with your provider.

Costs are exploding

Check swarm reports costs for the offending agent
Downgrade its model_tier if viable
Enable prompt caching (on by default for Anthropic; verify logs show cache hits)
Set SWARM_BUDGET_PER_PIPELINE_USD lower so runaway pipelines pause early

vLLM endpoint returns 404 for tool calls

Some vLLM versions don't support tool_choice yet. Check vllm --version (≥0.6.3 required) or use a wrapper like LiteLLM which translates.

LLM call hangs indefinitely

Timeout defaults to 120s. Lower with SWARM_LLM_TIMEOUT_SECONDS=30. Also check network / DNS — some corporate firewalls block Anthropic's API host.

Next#

Reference: configuration — every SWARM_* env var
Deployment: data residency — region + sovereignty concerns
Cost dashboard — ongoing spend monitoring