Platform overview (MASTER_README)#
The source-of-truth product + system + operating-model document. Canonical copy lives at MASTER_README.md in the repo; this page is auto-rendered via include-markdown on every push to master.
Source of truth for the product, the system, and the operating model. Read this first when onboarding or resuming work. Engineering quick-start lives in README.md; this document is the layer above — why the platform exists, who it's for, how it's organised, and how to reason about it.
Latest release:
v0.12.0— first signed release. 12-week platform refactor + guardrails subsystem + ship tooling complete. 628 → 1258 tests (+630, zero regressions). See CHANGELOG.md for the full breakdown. BFSI procurement pack in.project/security/(architecture, STRIDE threat model, pre-filled CAIQ Lite questionnaire). Customer artifact verifiable with onecosign verify-blobcommand.
Vision#
Make agentic MLOps installable by a regulated mid-market bank in 8 weeks. The ML lifecycle — data prep through deployment, fairness audit, drift monitoring, model cards, audit trail — delivered as a VPC-installable product, not a services engagement, with every step compliance-ready for Indian BFSI regulators.
Core values#
- Compliance is a first-class feature, not a bolt-on. Every subsystem emits audit. Every denial is source-attributed. Every artefact has a retention policy. BFSI regulators don't grade the feature matrix — they grade the evidence.
- Operator control over every observable channel. If an operator can see it, they can turn it off. Three tiers: invariant (regulator-mandated, immutable), flag (operator-togglable), experiment (opt-in).
- No framework lock-in. LangGraph is one backend; the native orchestrator is another; both share the same agent definitions. Swapping providers (OpenAI, Anthropic, local vLLM) is config, not code.
- Explicit over implicit. Permission denials, feature-flag states, plugin manifests, audit PDFs — all live where auditors can grep them, not buried in opaque state machines.
- Measure before you optimise. Every performance claim has a bench baseline in
ml_team/tests/bench/with a nightly diff.
What the platform is#
A Python + FastAPI backend + Next.js dashboard that orchestrates 40 AI agents across 7 teams in a 3-tier hierarchy (director → coordinators → workers) to run the full ML lifecycle. The agents cooperate via a supervisor-worker pattern, tools dispatched through a per-agent allowlist, every tool call audited.
Deliverables per run: - Trained model (joblib bundle + metrics sidecar) - Fairness audit JSON (fairlearn, per-protected-attribute group) - Drift report JSON (PSI + KS + chi²) - SHAP explanations (tree-native fast path) - Model card (Markdown, assembled from above) - Audit PDF (reportlab, tamper-evident SHA-256 hash) - Deployable Docker image + K8s manifests (HPA, non-root, readOnlyRootFilesystem) - Full conversation trail (JSONL per agent) - Prometheus metrics covering every phase
What swarm is NOT#
- Not a generic agent framework — the 40 agents have ML-specific responsibilities defined in
ml_team/config/agent_defs.pyandml_team/config/team_definitions.yaml. - Not an AutoML product — human-in-the-loop gates are intentional; the agents propose, humans approve deployments.
- Not a cloud service — we ship to the customer's VPC. Data never leaves their network.
- Not a dev-tool or IDE agent — we target MLOps workflows, not code editing.
Target users & personas#
| Persona | What they do with swarm | What they need from it |
|---|---|---|
| Tier-2/3 bank Head of Risk | Approves deployments, reviews audit PDFs | One-click model cards, tamper-evident audit, RBI FREE-AI alignment evidence |
| BFSI data scientist (IC) | Submits problem statements, reviews trained models, tunes agents | REST + CLI to kick off runs; /playground to test models; conversation trail to debug |
| Platform operator / ML engineer | Installs + maintains swarm in VPC | /settings feature flags, /cron scheduling, /plugins extensions, observability via Prometheus |
| Compliance auditor (internal) | Pulls denial logs, retention records, fairness reports | /transparency page lists every metric + artefact + endpoint + denial; permission_denials SQLite table is the compliance ledger |
| External auditor (RBI inspection) | Verifies model lifecycle compliance | Audit PDF + model card + signed conversation JSONL, reproducible work_dir |
| CIO / buyer | Evaluates swarm for procurement | MASTER_README.md (this doc), docs/SYSTEM-DESIGN.md, live dashboard demo |
Problems solved#
- Compliance cost. RBI FREE-AI (Aug 2025) and DPDP Act require fairness audit, model cards, explainability, drift monitoring, audit trail — independently building these costs a BFSI IT shop 12–18 months. Swarm ships all seven Sutras on day one.
- Mid-market lock-out. DataRobot / Dataiku start at $300K/year and are cloud-only. Fractal / Tiger engagements are open-ended services. Mid-market banks can't afford either. Swarm is $25K pilot → $75K deployment → $10K/mo retainer, VPC-installable, first model in 8 weeks.
- Talent scarcity. Hiring five ML engineers in India takes 9 months at Tier-2/3 bank compensation. Swarm lets an existing platform team + one ML lead ship production models.
- Audit reconstruction. "Why was tool X blocked for agent Y between T1 and T2?" takes days of log-grepping in ad-hoc systems. Swarm's unified permission engine (W7-1) answers it in one SQL query.
- Scheduled ML ops. Nightly drift checks, periodic retraining, automated audit-PDF generation today live in ad-hoc cron scripts per customer. Swarm's cron scheduler (W7-3) makes them a first-class REST + CLI + dashboard primitive.
Key capabilities (as shipped)#
Pipeline orchestration#
- 40-agent hierarchical team with circuit-breaker-protected delegation
- Two backends (LangGraph supervisor + native orchestrator) sharing one agent config
- Pipeline configs:
fast_prototype,default_ml_pipeline,parallel_research - Intra-agent parallel tool calls (3–5× speedup, experiment-flagged)
- Context compaction at 80% window (keeps long runs alive)
- Evaluator-generator separation (optional per-agent rubric grading)
Compliance & audit (RBI FREE-AI aligned)#
| Sutra | Implementation |
|---|---|
| Safety | Per-agent tool allowlists + unified permission engine (W7-1) + permission_denials audit table |
| Transparency | Model card generator + audit PDF with tamper-evident SHA-256 |
| Accountability | Per-agent conversation JSONL, persistent across restarts |
| Fairness | fairlearn-based bias audit (demographic parity, equal opportunity) |
| Explainability | SHAP TreeExplainer + Explainer fallback |
| Sustainability | Per-agent token + cost metrics + retention sweep |
| Model lifecycle | Drift detection (PSI + KS + chi²) + champion-challenger with shadow log |
Auth & access control#
- 3 roles (admin / operator / viewer) with precedence
- JWT bearer (HS256, 24h TTL)
- OIDC SSO via Okta / Azure AD / Google Workspace (authorization-code + PKCE)
- Legacy X-API-Key preserved for back-compat
- Unified permission engine (W7-1) — RBAC + allowlist + HITL + flags + YAML policy all feed one decision pipeline
HITL (Human-in-the-loop)#
- 6 gate types (deploy, data request, manual, security, cost, custom)
- Pipeline pauses, state persists, resume from checkpoint after approval
- Dashboard surfaces pending gates in real time
Operator surfaces#
- REST API under
/api/v1/*— authoritative - Next.js dashboard —
/pipelines,/agents,/deployments,/plugins,/cron,/settings,/transparency - CLI (
swarm) — thin stdlib wrapper over REST; BFSI air-gapped / SSH-only deployments use this - Programmatic — import
ml_team.core.*directly
Plugin ecosystem (Claude Code format)#
- MCP server ingestion via
.mcp.json(Phase A, Week 4) - Skills via
skills/<name>/SKILL.md(Phase B, Week 5) - Hook lifecycle via
hooks/hooks.json(W7-2) — SessionStart / PreTool / PostTool / PreCompaction / PostCompaction - Manifest SHA-256 pinning — tamper detection on reload
- Admin-only install; whitelist at
ml_team/config/plugins.yaml
Ops primitives (Week 7)#
- Unified permission engine with YAML-authored custom rules — source-attributed denial audit
- Plugin hook lifecycle — 5 events, server-side PII masking etc.
- Cron scheduler — schedule strings (
30m/every 2h/0 9 * * */ ISO), 4 task kinds (retrain, drift_check, audit_pdf, custom) - Batch runner — JSONL dataset through pinned processor; checkpointing + resume; tool-usage aggregator
Observability#
- Prometheus metrics on every subsystem (28+ counters/histograms)
- OpenTelemetry spans (parent/child, tokens, costs)
- Real-time WebSocket streaming of pipeline events
- Conversation JSONL per agent
Customer composability (Track 1 — shipped in v0.12.0)#
Three-layer architecture: core/ engine (never forked) + lib/ versioned shelf (agents, tools, workflows, guardrails, compliance templates) + deployments/<customer>/ (per-customer composition). SWARM_DEPLOYMENT=deployments/hdfc_bank at boot loads only that customer's agents + workflows + branding + compliance profile. BFSI baseline template ships with real RBI FREE-AI permission rules + demographic-parity fairness gate + drift-baseline gate + 2555-day retention; see lib/templates/bfsi_baseline/.
Runtime guardrails subsystem (Track 2 — 15 guardrails, shipped in v0.12.0)#
Six categories: network · execution · input_safety · persistence · platform_integrity · agent_safety. Every guardrail independently versioned in lib/guardrails/<category>/<id>/guardrail.yaml; activation declared per deployment in activate_guardrails + runtime config in guardrail_configs. Auto-configured at API boot via ml_team/core/guardrail_bootstrap.py. Full catalogue: lib/guardrails/README.md; STRIDE mapping: .project/security/threat_model.md.
Tier semantics: invariant (regulator-mandated, beats operator POLICY at priority 60) · flag (operator-togglable at priority 45-50) · off. A BFSI profile sets 12 guardrails as invariant; a generic profile sets 0.
Ship pipeline (swarm deploy, shipped in v0.12.0)#
Four CLI subcommands:
- swarm deploy new <customer> --template=<> — scaffold from a template (generic_ml / bfsi_baseline / hipaa_baseline)
- swarm deploy validate <customer> — lint config + lib refs + customer-name match
- swarm deploy ship <customer> — build-time-filtered tarball (other customers' deployments/ excluded) + MANIFEST.yaml (pinned lib versions + config SHA-256) + auto-generated 5-section security whitepaper
- swarm deploy whitepaper <customer> — emit standalone markdown whitepaper
Release workflow (.github/workflows/release-supply-chain.yml): signed-commit gate (git log %G?) → CycloneDX 1.5 SBOM → Cosign-keyless sign tarball + SBOM → base-image cosign verify → GitHub Release upload. One-command customer verification documented in SECURITY.md § Supply-chain integrity.
Security architecture + BFSI procurement pack#
Three documents written for CISO / procurement / vendor-risk teams — not marketing, rigorous self-attestation with file-path evidence:
- .project/security/architecture.md — 1-page data-flow diagram + defence-in-depth matrix + crypto inventory + customer verification commands
- .project/security/threat_model.md — STRIDE across 9 critical assets + DREAD scoring + top-10 residual risks + compliance-clause mapping (RBI / DPDP / EU AI Act / HIPAA / GDPR / SOC 2 / OWASP LLM / NIST AI RMF)
- .project/security/caiq_lite.md — 60-question pre-filled CAIQ v4.0.3 subset · aggregate ~75% Y/Y+P · top-10 procurement gaps enumerated with rupee estimates
GTM / positioning#
Market#
- Geography: India, expanding to Southeast Asia (similar regulator postures)
- Buyer: CIO / Head of Risk at Tier-2/3 banks, NBFCs, fintechs, insurers
- Segment: Mid-market (₹500Cr–₹25,000Cr AUM) — too small for DataRobot pricing, too big to rely on services
Competitive frame#
- DataRobot / Dataiku: $300K+/yr, cloud-only, US-centric compliance. Swarm is VPC, Indian-regulator-first, $10K/mo.
- Fractal / Tiger / Tredence: Services engagements, 12–18 month projects, no product. Swarm is a product with a services wrapper.
- Build-in-house: 9+ months hiring + 18 months building. Swarm is 8 weeks to first production model.
Pricing#
- $25K pilot (single use case, 8 weeks, fully operator-run)
- $75K deployment (VPC installation + 4 use cases, 6 months)
- $10K/month retainer (support + updates + new agent integrations)
Differentiation#
- Compliance-first: Audit trail, denial log, retention policy, model cards, fairness reports are built-in, not add-ons
- VPC-installable: No data egress; works in air-gapped regulated environments
- Plugin-extensible: Claude Code plugin ecosystem opens hundreds of integrations (Linear, Sentry, Grafana, Slack, Snowflake…) with admin-approval gates
- Operator observability: Transparent feature flags + permission denial log + cron + batch + retention all surfaced in one
/transparencypage - Compliance as config: BFSI customers author custom rules in
permission_policies.yaml— no code changes, source-attributed in audit
Architecture overview#
┌────────────────────────────── User surfaces ───────────────────────────────┐
│ │
│ Dashboard (Next.js 15 + React 19) │
│ /pipelines · /agents · /deployments · /plugins · /cron │
│ /settings · /transparency · /quality │
│ │
│ CLI (`swarm` — stdlib argparse + httpx over REST) │
│ │
│ REST API (FastAPI, JWT bearer auth + OIDC SSO, /api/v1/*) │
│ │
└────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────── Orchestration layer ───────────────────────────┐
│ │
│ Unified Permission Engine (W7-1) — one decision per tool / HTTP call │
│ Plugin Hook Lifecycle (W7-2) — SessionStart / Pre·Post Tool / Compaction │
│ │
│ Orchestrator ↔ LangGraph supervisor ↔ Native orchestrator │
│ │ │
│ ▼ │
│ ML Director → 7 team leads → 33 worker agents │
│ (tool_set per agent; allowlist gated) │
│ │
└────────────────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────── Agent runtime ───────────────────────────────┐
│ │
│ AgentRunner (ReAct loop) │
│ ├── SESSION_START hook (inject system-prompt overrides) │
│ ├── skill registry match + inject │
│ ├── iteration loop │
│ │ ├── context compaction (80% window, W6) │
│ │ ├── LLM call (prompt cache on anthropic/openai) │
│ │ ├── tool dispatch (parallel where safe) │
│ │ │ ├── PRE_TOOL hook (PII mask, arg mutation) │
│ │ │ ├── Permission engine check → ALLOW / ASK / DENY │
│ │ │ ├── ToolExecutor / MCP composite │
│ │ │ └── POST_TOOL hook (redact result content) │
│ │ └── conversation store write (buffered JSONL) │
│ └── terminal response → evaluator grade (optional, W6) │
│ │
└────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────── Persistence + I/O ─────────────────────────────┐
│ │
│ SQLite (runs, events, deployments, shadow_predictions, │
│ plugins, permission_denials, users) │
│ Filesystem (conversations/*.jsonl, compacted_*.jsonl, │
│ audit_report_*.pdf, cron/jobs.json, batch/{id}/*) │
│ LLM HTTP client (shared per provider+tier, prompt-cached) │
│ MCP clients (stdio + streamable-HTTP, v2025-11-25) │
│ │
└────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────── Background daemons ────────────────────────────┐
│ │
│ Retention sweep (24h) — prunes JSONL / events / audit PDFs past TTL │
│ Cron scheduler (60s) — retrain / drift_check / audit_pdf / custom │
│ │
└────────────────────────────────────────────────────────────────────────────┘
Request lifecycles#
Pipeline run (happy path):
1. Operator POSTs to /api/v1/pipelines with {problem_statement, data_path, pipeline_config}
2. require_role(Role.OPERATOR) → permission engine → ALLOW
3. _start_pipeline_inproc creates run row in SQLite + submits to _executor thread pool
4. _run_pipeline_background boots the backend (native or langgraph), creates RunMemory + ApprovalStore + ConversationStore
5. Director agent runs its ReAct loop, delegates to team leads, workers fire tools
6. Each tool call: permission engine check → hook PRE_TOOL → dispatch → hook POST_TOOL → record
7. Pipeline reaches terminal response → run status updated → audit PDF if auto_audit_pdf flag on
HITL approval pause:
1. Tool calls require_approval(...) → raises ApprovalRequired
2. ToolExecutor.execute re-raises (W7's fix prevents this being swallowed)
3. AgentRunner.run catches it, persists gate into ApprovalStore, appends [AWAITING APPROVAL] message, re-raises
4. Orchestrator catches it, sets pipeline status to awaiting_approval
5. Dashboard surfaces the gate; operator approves via /api/v1/pipelines/{id}/approvals/{gate}
6. _resume_pipeline_after_approval restores state from gate's resumable_state dict and resumes
Permission denial (compliance query):
1. Agent attempts tool not in its allowlist
2. ToolExecutor.execute builds PermissionContext → permissions.check(ctx)
3. agent_allowlist_source emits DENY rule → check returns PermissionDecision(DENY, …)
4. permission_audit.record_denial writes row to permission_denials table
5. build_denied_result returns tool_not_allowed payload to the agent
6. Auditor later queries GET /api/v1/permissions/denials?tool=run_bash&since=<epoch>
Codebase map#
/swarm
├── MASTER_README.md ← you are here
├── README.md — engineering quick-start
├── AGENTS.md / CLAUDE.md / GEMINI.md — per-harness guidance
├── docs/
│ ├── SYSTEM-DESIGN.md — deep architecture reference
│ ├── EXTENDING.md — how to add agents, tools, roles, plugins
│ ├── BENCHMARKS.md — bench baseline + running instructions
│ └── ...
├── scripts/ — one-off + CI scripts
├── examples/
│ └── plugins/
│ └── hello-swarm/ — reference plugin (MCP + skill + hook)
├── ml_team/ — main Python package (installed as `ml_team.*`)
│ ├── MASTER_README → references this file
│ ├── README.md — engineering quick-start (long-form)
│ ├── pyproject.toml — package metadata + deps
│ ├── core/ [IMPL + LEARNING] — engine primitives
│ │ permissions, hooks, cron, batch, compaction, evaluator,
│ │ plugin_loader, skill_registry, mcp_client, agent_runner,
│ │ conversation_store, approval, feature_flags, retention,
│ │ tool_executor, llm_client, memory, interfaces, team_factory,
│ │ orchestrator, graph_executor, rag, types
│ ├── api/ [IMPL + LEARNING] — FastAPI surface
│ │ ├── app.py, auth.py, users.py, database.py, metrics.py,
│ │ │ rate_limit.py, logging_config.py, oidc.py
│ │ └── routers/ [IMPL + LEARNING] — REST endpoints
│ │ agents, batch, chat, cron, datasets, deployments, docs,
│ │ evaluations, features, inference, knowledge, mcp, models,
│ │ permissions, pipelines, plugins, auth
│ ├── tools/ [IMPL + LEARNING] — agent-callable tools
│ │ ml_tools, training, explainability, fairness, drift,
│ │ model_card, audit_pdf, champion_challenger, deploy,
│ │ docker_tools, http_tools, git_ops, gpu_tools, memory_tools,
│ │ mlflow_tools, search, execution, file_ops
│ ├── backends/ [IMPL + LEARNING] — pipeline execution
│ │ native_backend, langgraph_backend, base
│ ├── config/ [IMPL + LEARNING] — declarations
│ │ agent_defs, team_definitions.yaml, mcp_servers.yaml,
│ │ permission_policies.yaml, plugins.yaml
│ ├── tests/ [IMPL only] — testing strategy
│ │ ├── bench/ — pytest-benchmark suite
│ │ ├── e2e/ — live-LLM end-to-end
│ │ └── test_*.py — unit + integration (550+ tests)
│ ├── dashboard/ [IMPL + LEARNING] — Next.js 15 app
│ │ └── src/
│ │ ├── app/ — page router (pipelines, agents, …)
│ │ ├── components/ — shared React components
│ │ └── lib/ — API client + auth context
│ └── cli.py — `swarm` CLI entry point
├── pipeline_runs/ — generated at runtime (per-run work_dir)
├── .project/
│ ├── decisions.md — ADR log (append-only)
│ ├── architecture.md — evolving architecture notes
│ └── journal.md — chronological narrative
└── .github/
└── workflows/ — CI (pytest, bench-nightly, lint)
Folders marked [IMPL + LEARNING] carry the two-README pair per this doc system. Folders in tests/ carry only IMPLEMENTATION_README.md.
Major workflows#
Train a model end-to-end#
POST /api/v1/pipelines
↓
ML Director → Data Team (profile, clean)
→ Algorithm Team (pick algorithm, fetch repo)
→ Training Team (train + tune)
→ Evaluation Team (metrics, fairness, SHAP)
→ Deployment Team (package Docker, K8s manifests, shadow-log)
→ Quality Team (model card, audit PDF, reproducibility check)
→ Management (summary report)
↓
GET /api/v1/pipelines/{id}/audit-report.pdf
Schedule automated retraining#
POST /api/v1/cron/jobs {"schedule":"0 2 * * *","task_kind":"retrain",
"task_config":{"problem_statement":"...","data_path":"..."}}
↓
cron.tick() runs at the scheduled time
↓
cron_tasks.task_retrain calls _start_pipeline_inproc(...)
↓
Same pipeline flow as above, logged with cron job_id
Score a batch of records#
POST /api/v1/batch {"processor_kind":"inference","records":[...],
"processor_config":{"model_path":"..."}}
↓
ThreadPoolExecutor dispatches records through the inference processor
↓
results.jsonl streams in; checkpoint every 10 records; resume supported
↓
GET /api/v1/batch/{id} returns aggregate stats + tool-usage summary
Install a plugin (MCP + skill + hook)#
admin$ swarm plugins install /path/to/my-plugin
↓
Plugin manifest SHA-256 pinned
↓
MCP servers registered with MCPToolProvider (namespaced: plugin:<name>:<server>)
Skills parsed into SkillRegistry (matched against agent task context)
Hooks registered into HookRegistry (fire on SessionStart / PreTool / PostTool / …)
↓
Subsequent pipeline runs route through the plugin's contributions
Pull a compliance report#
GET /api/v1/permissions/denials?since=<start_of_quarter>
↓
permission_denials SQLite table → JSON rows
↓
Each row: tool_call_id, tool_name, agent_name, rule_source, reason, HTTP context
↓
Feed into regulator-facing Excel / PDF
Engineering standards#
Code style#
- Python: f-strings, type hints, pathlib, 4-space indent. See
ml_team/pyproject.toml[tool.ruff]. - TypeScript: 2-space indent, strict types, no
anywithout comment. Dashboard is Next.js 15 + React 19; seeml_team/dashboard/AGENTS.mdfor framework notes. - SQL: parameterised only. Schema changes in
ml_team/api/database.py::init_db(). - Config: YAML for human-authored (policies, agents, teams). JSON for machine-generated (plugin manifests, batch checkpoints).
Testing#
- Unit tests colocated under
ml_team/tests/test_*.py. - Benchmarks in
ml_team/tests/bench/(skipped by default; nightly CI). - E2E live-LLM in
ml_team/tests/e2e/(manual runs only). - Target: all tests stay green on
master. A red test is a P0. - Regression:
.venv/bin/pytest ml_team/tests --tb=short -q --deselect ml_team/tests/test_deploy_tools.py::test_package_model_actually_builds_if_docker_present.
Feature flags#
- Every behaviour toggle lives in
ml_team/core/feature_flags.py::_FEATURES_LIST. - Three tiers:
INVARIANT(regulator-mandated, refuses override),FLAG(operator-togglable),EXPERIMENT(default off, opt-in). - New features default-off unless they're operational primitives (cron, batch).
Commits#
- Conventional-commit prefix:
feat,fix,docs,test,perf,refactor,chore. - Body explains "why" + impact + test count.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>trailer when AI-assisted.
Documentation#
- Every meaningful folder has
IMPLEMENTATION_README.md+LEARNING_README.md. - Root has this
MASTER_README.md. - Docs update in the same commit as the code change they describe.
- See
.project/decisions.mdfor ADRs — one per significant design decision.
Current priorities#
(Updated with each work cycle. Source: user discussion + .project/decisions.md + most-recent commits.)
Now#
- BFSI pilot readiness — Track 1 (customer composability) + Track 2 (15 guardrails) + ship tooling all in v0.12.0. Procurement pack (architecture / STRIDE / CAIQ Lite) shipped. Remaining: SOC 2 Type I readiness, pen-test, MSA/DPA/DPIA legal templates, first design-partner conversation.
- Documentation system — MASTER_README + README + 4 training modules shipped; 6 more in progress (Track 2 coverage, customer-deployment authoring, verification flow).
Next#
- First BFSI design-partner conversation — innovation labs first, not central procurement (see
.project/security/caiq_lite.md§ Top 10 gaps for the procurement sequence) - SOC 2 Type I readiness assessment via Drata or Vanta (free gap analysis; 3-4 month track to certification)
- Pen-test engagement (Lucideus / Cobalt quotes, Q2 2026 target)
- MSA + DPA + DPIA templates via BFSI-savvy Indian tech lawyer
- Profile bootstrap wiring (~40 lines) — in progress; auto-activates all 6 configurable guardrails from
deployments/<customer>/config.yaml::guardrail_configs swarm deploy diff+swarm deploy rotate-secret(Doppler-wired)
Deferred (explicit, from plan file)#
- QueryEngine AsyncGenerator refactor (UX-only, no compliance win)
- ACP adapter for VS Code / JetBrains (dev-tool market, not our buyer)
- Coordinator mode (only valuable >15 teams)
- Semantic compaction v2 (current basic works)
- Hermes multi-platform gateway (Telegram/Discord/Signal — not our surface)
- Async orchestration core (item #9 — revisit when >4 concurrent pipelines needed)
- Auto-retry / HITL escalation on low evaluator grades (needs customer input)
Risks & assumptions#
Technical risks#
| Risk | Mitigation |
|---|---|
| Permission engine adds ~30µs per tool dispatch (~17× baseline) | Absolute cost negligible vs LLM latency (~500ms). Bench tracks regression. |
| Context compaction is heuristic (char-count / 4 ≈ tokens) | Conservative 80% threshold leaves margin. Upgrade path to tiktoken documented. |
| SQLite is single-writer | WAL mode on, per-thread conns. High-concurrency deployments switch to Postgres (migration path via SQLAlchemy planned). |
| Single-node cron + batch | k8s CronJobs or k8s Jobs for multi-host fan-out. Documented non-goal. |
| Plugin SHA-pinning is manifest-only, not server code | Good-enough for Phase A; Phase B may add handler-file SHA when shell-handlers land. |
Business risks#
| Risk | Mitigation |
|---|---|
| Mid-market CIOs skeptical of LLM reliability | HITL gates, model cards, audit PDFs, signed conversation JSONL — compliance-first framing |
| Data leaves the VPC? | Hard no. Swarm installs in the customer's network; LLM calls optionally go to customer-hosted vLLM/Ollama instead of OpenAI. |
| Regulatory drift (RBI updates FREE-AI) | Feature flag tiers + rule-source attribution let us ship compliance amendments as YAML updates, not code pushes |
| Single-founder bus factor | Documentation system + ADR log + agent-driven workflows mean every design choice is recoverable from git + docs |
Assumptions (worth revisiting if they break)#
- BFSI customers prefer VPC-installable products over cloud SaaS (true today; DPDP + RBI data-localisation reinforce)
- Claude Code plugin ecosystem continues growing (base shared with Claude Code — our plugin consumers win from every community MCP)
- Prompt cache + HTTP pool + schema cache continue to dominate LLM-path latency (verified via W1 bench; re-verify annually)
Glossary#
| Term | Meaning |
|---|---|
| Agent | A single LLM persona with a system prompt + tool allowlist + tier (director/coordinator/worker). Defined in config/agent_defs.py. |
| Team | A group of agents with one coordinator lead + N workers. 7 teams total. Declared in config/team_definitions.yaml. |
| Pipeline config | A named graph of teams to execute (e.g. fast_prototype skips quality-team for speed). |
| Run (pipeline_run) | One invocation of a pipeline for a problem statement. Identified by run_id. |
| Work dir | Per-run filesystem scratch space at pipeline_runs/{run_id}/. Models, logs, audit PDFs land here. |
| Approval gate | A pause point where a human must approve before the pipeline continues. 6 gate types. |
| Tool | A Python function agents can call. Each has a schema; agent invokes via LLM tool-call. |
| Allowlist | Per-agent set of tool names it's permitted to call. Denials go through permission engine. |
| MCP server | External tool provider speaking Model Context Protocol. stdio or streamable-HTTP. |
| Plugin | A Claude Code plugin directory containing any of .mcp.json, skills/, hooks/. |
| Skill | A Markdown file (SKILL.md) with frontmatter; injected into an agent's system prompt on keyword match. |
| Hook | A Python callable registered on a lifecycle event (SESSION_START, PRE_TOOL, POST_TOOL, PRE/POST_COMPACTION). |
| Feature flag | Named operator-togglable behaviour switch. 3 tiers. Registry at feature_flags.py. |
| Permission rule | One row in the decision engine. Has tool_name glob, behavior, source, priority, optional arg regex. |
| Denial | A DENY permission decision, logged to permission_denials table with full context. |
| Cron job | A scheduled task (retrain / drift_check / audit_pdf / custom). File-backed store. |
| Batch run | A JSONL dataset processed through a pinned processor with checkpointing. |
| RBI FREE-AI | Reserve Bank of India's Aug-2025 framework for responsible AI in financial services. 7 Sutras. |
| SR 11-7 | US Federal Reserve's model risk management guidance. Swarm aligns. |
| DPDP | India's Digital Personal Data Protection Act. Data residency + consent + retention requirements. |
Project evolution log#
(High-level timeline. Detailed ADRs in .project/decisions.md; commit-level narrative in .project/journal.md.)
Weeks 1–6 (perf + extensibility)#
- W1: Benchmark harness + prompt caching + shared HTTP client + shared schema cache (664× agent-build speedup, 96× schema-cache)
- W2: Feature flag registry + observability controls + dashboard
/settings+/transparency+ retention daemon + I/O batching (6.5× SQLite, 3.9× JSONL) - W3: Intra-agent parallel tool calls (2.9× wall-clock on multi-tool turns)
- W4: MCP streamable-HTTP transport (spec 2025-11-25) + Claude Code plugin loader Phase A (MCP ingestion)
- W5: Plugin skills ingestion (Phase B) + dashboard
/pluginspage - W6: Context compaction (80% threshold) + evaluator-generator separation (optional per-agent rubric grading)
Week 7 (compliance + ops pack, current)#
- W7-1: Unified permission engine — RBAC + allowlist + HITL + flags + YAML policy fold into one rule engine with source-attributed denial audit
- W7-2: Plugin hook lifecycle — 5 events, server-side PII masking via reference plugin
- W7-3: Cron scheduler (vendored from Hermes) — schedule strings, 4 task kinds, REST + CLI + dashboard
- W7-4: Batch runner — JSONL input, 3 processors, checkpointing + resume, tool-usage aggregator
- W7-5: ADRs + README +
/transparencyrefresh + cross-feature integration smoke
Running totals: 545 passing tests, 22 shipped commits this resume session, seven dashboard pages, 28 metrics, 25 feature flags.
Pre-W1 (foundation, archived)#
- Security & compliance foundation: tool allowlists, RBAC + OIDC, audit PDF, champion-challenger
- RBI FREE-AI bundle: fairness + drift + SHAP + model cards
- Dashboard: pipelines, agents, deployments, docs browser
- CI: nightly pytest + bench workflow
How to use this file#
- Starting a new work session: read this top-to-bottom once, then jump to the relevant folder's
LEARNING_README.mdfor the mental model +IMPLEMENTATION_README.mdfor the contract. - Resuming mid-thread: find the last entry in
.project/journal.md, then read only affected folders' READMEs. - Onboarding a new engineer: this doc →
ml_team/README.md(quick-start) →docs/SYSTEM-DESIGN.md→ clone and run. - Onboarding a new CIO buyer: this doc →
docs/SYSTEM-DESIGN.md→ live demo. - Investigating a denial:
ml_team/api/routers/IMPLEMENTATION_README.md§permissions +ml_team/core/IMPLEMENTATION_README.md§permissions → SQL thepermission_denialstable.
Maintenance#
This file is the source of truth. Updates are mandatory when: - A new subsystem lands (folder with its own READMEs) - GTM / positioning / pricing changes - Architecture changes (backend swap, new persistence layer, new daemon) - Risks materialise (or get closed) - New personas emerge
Changes to this file land in the same commit as the code that prompted them. PRs that touch core subsystems without updating this file are incomplete — the CI guard (see .github/workflows/doc-drift.yml) enforces it.