Bring your own LLM#
Goal: route swarm's LLM calls to the provider(s) of your choice — Anthropic, OpenAI, Azure OpenAI, vLLM (self-hosted), or a proxy. Time: 10 minutes.
Supported providers (April 2026)#
| Provider | Status | Typical use |
|---|---|---|
| Anthropic (Claude family) | Full — prompt caching, streaming, tool use | Default for reasoning-heavy agents |
| OpenAI (GPT family) | Full — streaming, tool use, Responses API (v2) | Alternative; good for fast tiers |
| Azure OpenAI | Full (via OpenAI SDK with Azure endpoint) | Enterprise customers on Azure |
| vLLM (self-hosted) | Full — OpenAI-compatible endpoint | Air-gapped, cost-sensitive, compliance-driven |
| Ollama | Partial (OpenAI-compat shim) | Local dev only — not production-tested |
| OpenRouter / LiteLLM | Works as OpenAI-compat proxy | Multi-provider abstraction |
1. Pick providers and tiers#
swarm uses 3 tiers of LLM, each mapped to a provider+model:
| Tier | Intended use | Example model |
|---|---|---|
fast |
Low-latency, low-cost, throwaway decisions | Claude Haiku 4 / GPT-5-mini |
standard |
Primary agent reasoning | Claude Sonnet 4.5 / GPT-5 |
deep |
Hard problems (algorithm selection, complex eval) | Claude Opus 4.1 / GPT-5-pro |
You don't need all three — a small team can map all tiers to one model. But mapping them differently lets the pipeline be cheap + fast + smart where each matters.
2. Configure#
All via environment variables (see Reference: configuration for the full list).
Anthropic (default)#
# .env
SWARM_LLM_PROVIDER_DEFAULT=anthropic
ANTHROPIC_API_KEY=sk-ant-...
SWARM_LLM_MODEL_FAST=claude-haiku-4-20260315
SWARM_LLM_MODEL_STANDARD=claude-sonnet-4-5-20260315
SWARM_LLM_MODEL_DEEP=claude-opus-4-1-20260315
OpenAI#
SWARM_LLM_PROVIDER_DEFAULT=openai
OPENAI_API_KEY=sk-proj-...
SWARM_LLM_MODEL_FAST=gpt-5-mini
SWARM_LLM_MODEL_STANDARD=gpt-5
SWARM_LLM_MODEL_DEEP=gpt-5-pro
Azure OpenAI#
SWARM_LLM_PROVIDER_DEFAULT=azure
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com
AZURE_OPENAI_API_VERSION=2026-02-01
SWARM_LLM_MODEL_FAST=gpt-4-mini-deployment-name
SWARM_LLM_MODEL_STANDARD=gpt-5-deployment-name
SWARM_LLM_MODEL_DEEP=gpt-5-pro-deployment-name
vLLM (self-hosted, air-gapped)#
SWARM_LLM_PROVIDER_DEFAULT=openai_compat # vLLM speaks OpenAI protocol
OPENAI_API_BASE=https://vllm.internal.yourorg.com/v1
OPENAI_API_KEY=<your-internal-token>
SWARM_LLM_MODEL_FAST=mistralai/Mistral-7B-Instruct-v0.4
SWARM_LLM_MODEL_STANDARD=meta-llama/Llama-3.3-70B-Instruct
SWARM_LLM_MODEL_DEEP=deepseek-ai/DeepSeek-V3
Mixed (Anthropic for reasoning + OpenAI for fast)#
SWARM_LLM_PROVIDER_FAST=openai
SWARM_LLM_MODEL_FAST=gpt-5-mini
SWARM_LLM_PROVIDER_STANDARD=anthropic
SWARM_LLM_MODEL_STANDARD=claude-sonnet-4-5-20260315
SWARM_LLM_PROVIDER_DEEP=anthropic
SWARM_LLM_MODEL_DEEP=claude-opus-4-1-20260315
Restart the api container after changes:
3. Verify#
Pings each configured provider + tier with a 1-token prompt. Reports latency + cost estimate.
Provider Tier Model Latency Est. cost/1M tokens
anthropic fast claude-haiku-4-20260315 120ms $0.50 in / $2.00 out
anthropic standard claude-sonnet-4-5-20260315 210ms $3.00 in / $15.00 out
anthropic deep claude-opus-4-1-20260315 320ms $15.00 in / $75.00 out
4. Per-agent override#
Sometimes one agent needs a different model. Override per-agent:
# In config/agent_defs.py
AgentConfig(
name="ml_director",
model_tier="deep", # uses SWARM_LLM_MODEL_DEEP
...
)
Or force a specific model:
model_override takes precedence over model_tier.
5. Cost + rate-limit controls#
Budgets#
# In .env
SWARM_BUDGET_PER_PIPELINE_USD=5.00 # hard cap per pipeline run
SWARM_BUDGET_PER_DAY_USD=500.00 # rolling 24h cap per tenant
SWARM_BUDGET_ACTION_ON_OVER=pause # pause | alert_only | deny
When over budget, pipeline pauses; operator can approve continuation or kill.
Retries + backoff#
Configurable per provider:
SWARM_LLM_RETRY_MAX_ATTEMPTS=3
SWARM_LLM_RETRY_BACKOFF_SECONDS=2 # exponential
SWARM_LLM_RETRY_ON_429=true # rate limited
SWARM_LLM_RETRY_ON_5xx=true
Prompt caching#
swarm uses prompt caching automatically for Anthropic (ephemeral cache control on stable system prompts + tool schemas). OpenAI does automatic caching server-side — no config needed.
Savings: 50-90% on input tokens for the agent-heavy flows.
6. Multi-provider failover#
For production reliability, configure a secondary:
SWARM_LLM_PROVIDER_STANDARD=anthropic
SWARM_LLM_PROVIDER_STANDARD_FALLBACK=openai
SWARM_LLM_MODEL_STANDARD_FALLBACK=gpt-5
If Anthropic returns 5xx / 429 after retries, swarm automatically retries on OpenAI. The run_events log records the provider switch.
7. Audit + billing#
Every LLM call lands in run_events with:
- Provider + model
- Token counts (input / output / cached)
- Cost estimate
- Latency
- Agent that made the call
Query for the month's bill:
SELECT
provider, model,
SUM(input_tokens) AS input,
SUM(output_tokens) AS output,
SUM(cost_usd) AS cost
FROM run_events
WHERE kind = 'llm_call'
AND ts > date_trunc('month', now())
GROUP BY provider, model
ORDER BY cost DESC;
Or via:
Compliance notes#
BFSI (India)#
RBI Master Direction prefers data stays in India. Anthropic's India-region endpoint is NOT yet available (April 2026). Options: - Use Azure OpenAI with an India-region deployment (ap-south-1) - Self-host vLLM on Indian cloud - Use OpenAI's standard endpoint with zero-retention contract (talk to OpenAI enterprise)
HIPAA (US)#
Both Anthropic and OpenAI sign BAAs for their enterprise tiers. Verify:
- API tier is BAA-eligible
- "Zero data retention" is configured (OPENAI_API_BETA=zdr for OpenAI; Anthropic enterprise contract for Claude)
EU (AI Act + GDPR)#
- Use EU-region endpoints (Anthropic EU, Azure OpenAI EU)
- Verify sub-processor list at https://trust.anthropic.com / Microsoft Trust Center
- US CLOUD Act exposure: OpenAI (non-Azure) and Anthropic are US companies — data may be subject to CLOUD Act even with EU region. Legal review required.
See Compliance profiles.
Troubleshooting#
Keep hitting rate limits
Increase SWARM_LLM_RETRY_MAX_ATTEMPTS and lower concurrency (SWARM_MAX_CONCURRENT_AGENTS). Or negotiate a higher tier with your provider.
Costs are exploding
- Check
swarm reports costsfor the offending agent - Downgrade its
model_tierif viable - Enable prompt caching (on by default for Anthropic; verify logs show cache hits)
- Set
SWARM_BUDGET_PER_PIPELINE_USDlower so runaway pipelines pause early
vLLM endpoint returns 404 for tool calls
Some vLLM versions don't support tool_choice yet. Check vllm --version (≥0.6.3 required) or use a wrapper like LiteLLM which translates.
LLM call hangs indefinitely
Timeout defaults to 120s. Lower with SWARM_LLM_TIMEOUT_SECONDS=30. Also check network / DNS — some corporate firewalls block Anthropic's API host.
Next#
- Reference: configuration — every
SWARM_*env var - Deployment: data residency — region + sovereignty concerns
- Cost dashboard — ongoing spend monitoring