Skip to content

Platform overview (MASTER_README)#

The source-of-truth product + system + operating-model document. Canonical copy lives at MASTER_README.md in the repo; this page is auto-rendered via include-markdown on every push to master.

Source of truth for the product, the system, and the operating model. Read this first when onboarding or resuming work. Engineering quick-start lives in README.md; this document is the layer above — why the platform exists, who it's for, how it's organised, and how to reason about it.

Latest release: v0.12.0 — first signed release. 12-week platform refactor + guardrails subsystem + ship tooling complete. 628 → 1258 tests (+630, zero regressions). See CHANGELOG.md for the full breakdown. BFSI procurement pack in .project/security/ (architecture, STRIDE threat model, pre-filled CAIQ Lite questionnaire). Customer artifact verifiable with one cosign verify-blob command.


Vision#

Make agentic MLOps installable by a regulated mid-market bank in 8 weeks. The ML lifecycle — data prep through deployment, fairness audit, drift monitoring, model cards, audit trail — delivered as a VPC-installable product, not a services engagement, with every step compliance-ready for Indian BFSI regulators.

Core values#

  1. Compliance is a first-class feature, not a bolt-on. Every subsystem emits audit. Every denial is source-attributed. Every artefact has a retention policy. BFSI regulators don't grade the feature matrix — they grade the evidence.
  2. Operator control over every observable channel. If an operator can see it, they can turn it off. Three tiers: invariant (regulator-mandated, immutable), flag (operator-togglable), experiment (opt-in).
  3. No framework lock-in. LangGraph is one backend; the native orchestrator is another; both share the same agent definitions. Swapping providers (OpenAI, Anthropic, local vLLM) is config, not code.
  4. Explicit over implicit. Permission denials, feature-flag states, plugin manifests, audit PDFs — all live where auditors can grep them, not buried in opaque state machines.
  5. Measure before you optimise. Every performance claim has a bench baseline in ml_team/tests/bench/ with a nightly diff.

What the platform is#

A Python + FastAPI backend + Next.js dashboard that orchestrates 40 AI agents across 7 teams in a 3-tier hierarchy (director → coordinators → workers) to run the full ML lifecycle. The agents cooperate via a supervisor-worker pattern, tools dispatched through a per-agent allowlist, every tool call audited.

Deliverables per run: - Trained model (joblib bundle + metrics sidecar) - Fairness audit JSON (fairlearn, per-protected-attribute group) - Drift report JSON (PSI + KS + chi²) - SHAP explanations (tree-native fast path) - Model card (Markdown, assembled from above) - Audit PDF (reportlab, tamper-evident SHA-256 hash) - Deployable Docker image + K8s manifests (HPA, non-root, readOnlyRootFilesystem) - Full conversation trail (JSONL per agent) - Prometheus metrics covering every phase

What swarm is NOT#

  • Not a generic agent framework — the 40 agents have ML-specific responsibilities defined in ml_team/config/agent_defs.py and ml_team/config/team_definitions.yaml.
  • Not an AutoML product — human-in-the-loop gates are intentional; the agents propose, humans approve deployments.
  • Not a cloud service — we ship to the customer's VPC. Data never leaves their network.
  • Not a dev-tool or IDE agent — we target MLOps workflows, not code editing.

Target users & personas#

Persona What they do with swarm What they need from it
Tier-2/3 bank Head of Risk Approves deployments, reviews audit PDFs One-click model cards, tamper-evident audit, RBI FREE-AI alignment evidence
BFSI data scientist (IC) Submits problem statements, reviews trained models, tunes agents REST + CLI to kick off runs; /playground to test models; conversation trail to debug
Platform operator / ML engineer Installs + maintains swarm in VPC /settings feature flags, /cron scheduling, /plugins extensions, observability via Prometheus
Compliance auditor (internal) Pulls denial logs, retention records, fairness reports /transparency page lists every metric + artefact + endpoint + denial; permission_denials SQLite table is the compliance ledger
External auditor (RBI inspection) Verifies model lifecycle compliance Audit PDF + model card + signed conversation JSONL, reproducible work_dir
CIO / buyer Evaluates swarm for procurement MASTER_README.md (this doc), docs/SYSTEM-DESIGN.md, live dashboard demo

Problems solved#

  1. Compliance cost. RBI FREE-AI (Aug 2025) and DPDP Act require fairness audit, model cards, explainability, drift monitoring, audit trail — independently building these costs a BFSI IT shop 12–18 months. Swarm ships all seven Sutras on day one.
  2. Mid-market lock-out. DataRobot / Dataiku start at $300K/year and are cloud-only. Fractal / Tiger engagements are open-ended services. Mid-market banks can't afford either. Swarm is $25K pilot → $75K deployment → $10K/mo retainer, VPC-installable, first model in 8 weeks.
  3. Talent scarcity. Hiring five ML engineers in India takes 9 months at Tier-2/3 bank compensation. Swarm lets an existing platform team + one ML lead ship production models.
  4. Audit reconstruction. "Why was tool X blocked for agent Y between T1 and T2?" takes days of log-grepping in ad-hoc systems. Swarm's unified permission engine (W7-1) answers it in one SQL query.
  5. Scheduled ML ops. Nightly drift checks, periodic retraining, automated audit-PDF generation today live in ad-hoc cron scripts per customer. Swarm's cron scheduler (W7-3) makes them a first-class REST + CLI + dashboard primitive.

Key capabilities (as shipped)#

Pipeline orchestration#

  • 40-agent hierarchical team with circuit-breaker-protected delegation
  • Two backends (LangGraph supervisor + native orchestrator) sharing one agent config
  • Pipeline configs: fast_prototype, default_ml_pipeline, parallel_research
  • Intra-agent parallel tool calls (3–5× speedup, experiment-flagged)
  • Context compaction at 80% window (keeps long runs alive)
  • Evaluator-generator separation (optional per-agent rubric grading)

Compliance & audit (RBI FREE-AI aligned)#

Sutra Implementation
Safety Per-agent tool allowlists + unified permission engine (W7-1) + permission_denials audit table
Transparency Model card generator + audit PDF with tamper-evident SHA-256
Accountability Per-agent conversation JSONL, persistent across restarts
Fairness fairlearn-based bias audit (demographic parity, equal opportunity)
Explainability SHAP TreeExplainer + Explainer fallback
Sustainability Per-agent token + cost metrics + retention sweep
Model lifecycle Drift detection (PSI + KS + chi²) + champion-challenger with shadow log

Auth & access control#

  • 3 roles (admin / operator / viewer) with precedence
  • JWT bearer (HS256, 24h TTL)
  • OIDC SSO via Okta / Azure AD / Google Workspace (authorization-code + PKCE)
  • Legacy X-API-Key preserved for back-compat
  • Unified permission engine (W7-1) — RBAC + allowlist + HITL + flags + YAML policy all feed one decision pipeline

HITL (Human-in-the-loop)#

  • 6 gate types (deploy, data request, manual, security, cost, custom)
  • Pipeline pauses, state persists, resume from checkpoint after approval
  • Dashboard surfaces pending gates in real time

Operator surfaces#

  • REST API under /api/v1/* — authoritative
  • Next.js dashboard/pipelines, /agents, /deployments, /plugins, /cron, /settings, /transparency
  • CLI (swarm) — thin stdlib wrapper over REST; BFSI air-gapped / SSH-only deployments use this
  • Programmatic — import ml_team.core.* directly

Plugin ecosystem (Claude Code format)#

  • MCP server ingestion via .mcp.json (Phase A, Week 4)
  • Skills via skills/<name>/SKILL.md (Phase B, Week 5)
  • Hook lifecycle via hooks/hooks.json (W7-2) — SessionStart / PreTool / PostTool / PreCompaction / PostCompaction
  • Manifest SHA-256 pinning — tamper detection on reload
  • Admin-only install; whitelist at ml_team/config/plugins.yaml

Ops primitives (Week 7)#

  • Unified permission engine with YAML-authored custom rules — source-attributed denial audit
  • Plugin hook lifecycle — 5 events, server-side PII masking etc.
  • Cron scheduler — schedule strings (30m / every 2h / 0 9 * * * / ISO), 4 task kinds (retrain, drift_check, audit_pdf, custom)
  • Batch runner — JSONL dataset through pinned processor; checkpointing + resume; tool-usage aggregator

Observability#

  • Prometheus metrics on every subsystem (28+ counters/histograms)
  • OpenTelemetry spans (parent/child, tokens, costs)
  • Real-time WebSocket streaming of pipeline events
  • Conversation JSONL per agent

Customer composability (Track 1 — shipped in v0.12.0)#

Three-layer architecture: core/ engine (never forked) + lib/ versioned shelf (agents, tools, workflows, guardrails, compliance templates) + deployments/<customer>/ (per-customer composition). SWARM_DEPLOYMENT=deployments/hdfc_bank at boot loads only that customer's agents + workflows + branding + compliance profile. BFSI baseline template ships with real RBI FREE-AI permission rules + demographic-parity fairness gate + drift-baseline gate + 2555-day retention; see lib/templates/bfsi_baseline/.

Runtime guardrails subsystem (Track 2 — 15 guardrails, shipped in v0.12.0)#

Six categories: network · execution · input_safety · persistence · platform_integrity · agent_safety. Every guardrail independently versioned in lib/guardrails/<category>/<id>/guardrail.yaml; activation declared per deployment in activate_guardrails + runtime config in guardrail_configs. Auto-configured at API boot via ml_team/core/guardrail_bootstrap.py. Full catalogue: lib/guardrails/README.md; STRIDE mapping: .project/security/threat_model.md.

Tier semantics: invariant (regulator-mandated, beats operator POLICY at priority 60) · flag (operator-togglable at priority 45-50) · off. A BFSI profile sets 12 guardrails as invariant; a generic profile sets 0.

Ship pipeline (swarm deploy, shipped in v0.12.0)#

Four CLI subcommands: - swarm deploy new <customer> --template=<> — scaffold from a template (generic_ml / bfsi_baseline / hipaa_baseline) - swarm deploy validate <customer> — lint config + lib refs + customer-name match - swarm deploy ship <customer> — build-time-filtered tarball (other customers' deployments/ excluded) + MANIFEST.yaml (pinned lib versions + config SHA-256) + auto-generated 5-section security whitepaper - swarm deploy whitepaper <customer> — emit standalone markdown whitepaper

Release workflow (.github/workflows/release-supply-chain.yml): signed-commit gate (git log %G?) → CycloneDX 1.5 SBOM → Cosign-keyless sign tarball + SBOM → base-image cosign verify → GitHub Release upload. One-command customer verification documented in SECURITY.md § Supply-chain integrity.

Security architecture + BFSI procurement pack#

Three documents written for CISO / procurement / vendor-risk teams — not marketing, rigorous self-attestation with file-path evidence: - .project/security/architecture.md — 1-page data-flow diagram + defence-in-depth matrix + crypto inventory + customer verification commands - .project/security/threat_model.md — STRIDE across 9 critical assets + DREAD scoring + top-10 residual risks + compliance-clause mapping (RBI / DPDP / EU AI Act / HIPAA / GDPR / SOC 2 / OWASP LLM / NIST AI RMF) - .project/security/caiq_lite.md — 60-question pre-filled CAIQ v4.0.3 subset · aggregate ~75% Y/Y+P · top-10 procurement gaps enumerated with rupee estimates


GTM / positioning#

Market#

  • Geography: India, expanding to Southeast Asia (similar regulator postures)
  • Buyer: CIO / Head of Risk at Tier-2/3 banks, NBFCs, fintechs, insurers
  • Segment: Mid-market (₹500Cr–₹25,000Cr AUM) — too small for DataRobot pricing, too big to rely on services

Competitive frame#

  • DataRobot / Dataiku: $300K+/yr, cloud-only, US-centric compliance. Swarm is VPC, Indian-regulator-first, $10K/mo.
  • Fractal / Tiger / Tredence: Services engagements, 12–18 month projects, no product. Swarm is a product with a services wrapper.
  • Build-in-house: 9+ months hiring + 18 months building. Swarm is 8 weeks to first production model.

Pricing#

  • $25K pilot (single use case, 8 weeks, fully operator-run)
  • $75K deployment (VPC installation + 4 use cases, 6 months)
  • $10K/month retainer (support + updates + new agent integrations)

Differentiation#

  • Compliance-first: Audit trail, denial log, retention policy, model cards, fairness reports are built-in, not add-ons
  • VPC-installable: No data egress; works in air-gapped regulated environments
  • Plugin-extensible: Claude Code plugin ecosystem opens hundreds of integrations (Linear, Sentry, Grafana, Slack, Snowflake…) with admin-approval gates
  • Operator observability: Transparent feature flags + permission denial log + cron + batch + retention all surfaced in one /transparency page
  • Compliance as config: BFSI customers author custom rules in permission_policies.yaml — no code changes, source-attributed in audit

Architecture overview#

┌────────────────────────────── User surfaces ───────────────────────────────┐
│                                                                            │
│   Dashboard (Next.js 15 + React 19)                                        │
│      /pipelines · /agents · /deployments · /plugins · /cron                │
│      /settings · /transparency · /quality                                  │
│                                                                            │
│   CLI (`swarm` — stdlib argparse + httpx over REST)                        │
│                                                                            │
│   REST API (FastAPI, JWT bearer auth + OIDC SSO, /api/v1/*)                │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────── Orchestration layer ───────────────────────────┐
│                                                                            │
│   Unified Permission Engine (W7-1) — one decision per tool / HTTP call     │
│   Plugin Hook Lifecycle (W7-2) — SessionStart / Pre·Post Tool / Compaction │
│                                                                            │
│   Orchestrator  ↔  LangGraph supervisor  ↔  Native orchestrator            │
│        │                                                                   │
│        ▼                                                                   │
│   ML Director  →  7 team leads  →  33 worker agents                        │
│                                    (tool_set per agent; allowlist gated)   │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────── Agent runtime ───────────────────────────────┐
│                                                                            │
│   AgentRunner (ReAct loop)                                                 │
│      ├── SESSION_START hook (inject system-prompt overrides)               │
│      ├── skill registry match + inject                                     │
│      ├── iteration loop                                                    │
│      │     ├── context compaction (80% window, W6)                         │
│      │     ├── LLM call (prompt cache on anthropic/openai)                 │
│      │     ├── tool dispatch (parallel where safe)                         │
│      │     │     ├── PRE_TOOL hook (PII mask, arg mutation)                │
│      │     │     ├── Permission engine check → ALLOW / ASK / DENY          │
│      │     │     ├── ToolExecutor / MCP composite                          │
│      │     │     └── POST_TOOL hook (redact result content)                │
│      │     └── conversation store write (buffered JSONL)                   │
│      └── terminal response → evaluator grade (optional, W6)                │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────── Persistence + I/O ─────────────────────────────┐
│                                                                            │
│   SQLite (runs, events, deployments, shadow_predictions,                   │
│            plugins, permission_denials, users)                             │
│   Filesystem (conversations/*.jsonl, compacted_*.jsonl,                    │
│               audit_report_*.pdf, cron/jobs.json, batch/{id}/*)            │
│   LLM HTTP client (shared per provider+tier, prompt-cached)                │
│   MCP clients (stdio + streamable-HTTP, v2025-11-25)                       │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────── Background daemons ────────────────────────────┐
│                                                                            │
│   Retention sweep (24h)    — prunes JSONL / events / audit PDFs past TTL   │
│   Cron scheduler (60s)     — retrain / drift_check / audit_pdf / custom    │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Request lifecycles#

Pipeline run (happy path): 1. Operator POSTs to /api/v1/pipelines with {problem_statement, data_path, pipeline_config} 2. require_role(Role.OPERATOR) → permission engine → ALLOW 3. _start_pipeline_inproc creates run row in SQLite + submits to _executor thread pool 4. _run_pipeline_background boots the backend (native or langgraph), creates RunMemory + ApprovalStore + ConversationStore 5. Director agent runs its ReAct loop, delegates to team leads, workers fire tools 6. Each tool call: permission engine check → hook PRE_TOOL → dispatch → hook POST_TOOL → record 7. Pipeline reaches terminal response → run status updated → audit PDF if auto_audit_pdf flag on

HITL approval pause: 1. Tool calls require_approval(...) → raises ApprovalRequired 2. ToolExecutor.execute re-raises (W7's fix prevents this being swallowed) 3. AgentRunner.run catches it, persists gate into ApprovalStore, appends [AWAITING APPROVAL] message, re-raises 4. Orchestrator catches it, sets pipeline status to awaiting_approval 5. Dashboard surfaces the gate; operator approves via /api/v1/pipelines/{id}/approvals/{gate} 6. _resume_pipeline_after_approval restores state from gate's resumable_state dict and resumes

Permission denial (compliance query): 1. Agent attempts tool not in its allowlist 2. ToolExecutor.execute builds PermissionContextpermissions.check(ctx) 3. agent_allowlist_source emits DENY rule → check returns PermissionDecision(DENY, …) 4. permission_audit.record_denial writes row to permission_denials table 5. build_denied_result returns tool_not_allowed payload to the agent 6. Auditor later queries GET /api/v1/permissions/denials?tool=run_bash&since=<epoch>


Codebase map#

/swarm
├── MASTER_README.md                ← you are here
├── README.md                       — engineering quick-start
├── AGENTS.md / CLAUDE.md / GEMINI.md — per-harness guidance
├── docs/
│   ├── SYSTEM-DESIGN.md            — deep architecture reference
│   ├── EXTENDING.md                — how to add agents, tools, roles, plugins
│   ├── BENCHMARKS.md               — bench baseline + running instructions
│   └── ...
├── scripts/                        — one-off + CI scripts
├── examples/
│   └── plugins/
│       └── hello-swarm/            — reference plugin (MCP + skill + hook)
├── ml_team/                        — main Python package (installed as `ml_team.*`)
│   ├── MASTER_README → references this file
│   ├── README.md                   — engineering quick-start (long-form)
│   ├── pyproject.toml              — package metadata + deps
│   ├── core/                       [IMPL + LEARNING] — engine primitives
│   │     permissions, hooks, cron, batch, compaction, evaluator,
│   │     plugin_loader, skill_registry, mcp_client, agent_runner,
│   │     conversation_store, approval, feature_flags, retention,
│   │     tool_executor, llm_client, memory, interfaces, team_factory,
│   │     orchestrator, graph_executor, rag, types
│   ├── api/                        [IMPL + LEARNING] — FastAPI surface
│   │   ├── app.py, auth.py, users.py, database.py, metrics.py,
│   │   │ rate_limit.py, logging_config.py, oidc.py
│   │   └── routers/                [IMPL + LEARNING] — REST endpoints
│   │         agents, batch, chat, cron, datasets, deployments, docs,
│   │         evaluations, features, inference, knowledge, mcp, models,
│   │         permissions, pipelines, plugins, auth
│   ├── tools/                      [IMPL + LEARNING] — agent-callable tools
│   │         ml_tools, training, explainability, fairness, drift,
│   │         model_card, audit_pdf, champion_challenger, deploy,
│   │         docker_tools, http_tools, git_ops, gpu_tools, memory_tools,
│   │         mlflow_tools, search, execution, file_ops
│   ├── backends/                   [IMPL + LEARNING] — pipeline execution
│   │         native_backend, langgraph_backend, base
│   ├── config/                     [IMPL + LEARNING] — declarations
│   │         agent_defs, team_definitions.yaml, mcp_servers.yaml,
│   │         permission_policies.yaml, plugins.yaml
│   ├── tests/                      [IMPL only] — testing strategy
│   │   ├── bench/                  — pytest-benchmark suite
│   │   ├── e2e/                    — live-LLM end-to-end
│   │   └── test_*.py               — unit + integration (550+ tests)
│   ├── dashboard/                  [IMPL + LEARNING] — Next.js 15 app
│   │   └── src/
│   │       ├── app/                — page router (pipelines, agents, …)
│   │       ├── components/         — shared React components
│   │       └── lib/                — API client + auth context
│   └── cli.py                      — `swarm` CLI entry point
├── pipeline_runs/                  — generated at runtime (per-run work_dir)
├── .project/
│   ├── decisions.md                — ADR log (append-only)
│   ├── architecture.md             — evolving architecture notes
│   └── journal.md                  — chronological narrative
└── .github/
    └── workflows/                  — CI (pytest, bench-nightly, lint)

Folders marked [IMPL + LEARNING] carry the two-README pair per this doc system. Folders in tests/ carry only IMPLEMENTATION_README.md.


Major workflows#

Train a model end-to-end#

POST /api/v1/pipelines
ML Director → Data Team (profile, clean)
            → Algorithm Team (pick algorithm, fetch repo)
            → Training Team (train + tune)
            → Evaluation Team (metrics, fairness, SHAP)
            → Deployment Team (package Docker, K8s manifests, shadow-log)
            → Quality Team (model card, audit PDF, reproducibility check)
            → Management (summary report)
GET /api/v1/pipelines/{id}/audit-report.pdf

Schedule automated retraining#

POST /api/v1/cron/jobs {"schedule":"0 2 * * *","task_kind":"retrain",
                        "task_config":{"problem_statement":"...","data_path":"..."}}
cron.tick() runs at the scheduled time
cron_tasks.task_retrain calls _start_pipeline_inproc(...)
Same pipeline flow as above, logged with cron job_id

Score a batch of records#

POST /api/v1/batch {"processor_kind":"inference","records":[...],
                    "processor_config":{"model_path":"..."}}
ThreadPoolExecutor dispatches records through the inference processor
results.jsonl streams in; checkpoint every 10 records; resume supported
GET /api/v1/batch/{id} returns aggregate stats + tool-usage summary

Install a plugin (MCP + skill + hook)#

admin$ swarm plugins install /path/to/my-plugin
Plugin manifest SHA-256 pinned
MCP servers registered with MCPToolProvider (namespaced: plugin:<name>:<server>)
Skills parsed into SkillRegistry (matched against agent task context)
Hooks registered into HookRegistry (fire on SessionStart / PreTool / PostTool / …)
Subsequent pipeline runs route through the plugin's contributions

Pull a compliance report#

GET /api/v1/permissions/denials?since=<start_of_quarter>
permission_denials SQLite table → JSON rows
Each row: tool_call_id, tool_name, agent_name, rule_source, reason, HTTP context
Feed into regulator-facing Excel / PDF

Engineering standards#

Code style#

  • Python: f-strings, type hints, pathlib, 4-space indent. See ml_team/pyproject.toml [tool.ruff].
  • TypeScript: 2-space indent, strict types, no any without comment. Dashboard is Next.js 15 + React 19; see ml_team/dashboard/AGENTS.md for framework notes.
  • SQL: parameterised only. Schema changes in ml_team/api/database.py::init_db().
  • Config: YAML for human-authored (policies, agents, teams). JSON for machine-generated (plugin manifests, batch checkpoints).

Testing#

  • Unit tests colocated under ml_team/tests/test_*.py.
  • Benchmarks in ml_team/tests/bench/ (skipped by default; nightly CI).
  • E2E live-LLM in ml_team/tests/e2e/ (manual runs only).
  • Target: all tests stay green on master. A red test is a P0.
  • Regression: .venv/bin/pytest ml_team/tests --tb=short -q --deselect ml_team/tests/test_deploy_tools.py::test_package_model_actually_builds_if_docker_present.

Feature flags#

  • Every behaviour toggle lives in ml_team/core/feature_flags.py::_FEATURES_LIST.
  • Three tiers: INVARIANT (regulator-mandated, refuses override), FLAG (operator-togglable), EXPERIMENT (default off, opt-in).
  • New features default-off unless they're operational primitives (cron, batch).

Commits#

  • Conventional-commit prefix: feat, fix, docs, test, perf, refactor, chore.
  • Body explains "why" + impact + test count.
  • Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> trailer when AI-assisted.

Documentation#

  • Every meaningful folder has IMPLEMENTATION_README.md + LEARNING_README.md.
  • Root has this MASTER_README.md.
  • Docs update in the same commit as the code change they describe.
  • See .project/decisions.md for ADRs — one per significant design decision.

Current priorities#

(Updated with each work cycle. Source: user discussion + .project/decisions.md + most-recent commits.)

Now#

  • BFSI pilot readiness — Track 1 (customer composability) + Track 2 (15 guardrails) + ship tooling all in v0.12.0. Procurement pack (architecture / STRIDE / CAIQ Lite) shipped. Remaining: SOC 2 Type I readiness, pen-test, MSA/DPA/DPIA legal templates, first design-partner conversation.
  • Documentation system — MASTER_README + README + 4 training modules shipped; 6 more in progress (Track 2 coverage, customer-deployment authoring, verification flow).

Next#

  • First BFSI design-partner conversation — innovation labs first, not central procurement (see .project/security/caiq_lite.md § Top 10 gaps for the procurement sequence)
  • SOC 2 Type I readiness assessment via Drata or Vanta (free gap analysis; 3-4 month track to certification)
  • Pen-test engagement (Lucideus / Cobalt quotes, Q2 2026 target)
  • MSA + DPA + DPIA templates via BFSI-savvy Indian tech lawyer
  • Profile bootstrap wiring (~40 lines) — in progress; auto-activates all 6 configurable guardrails from deployments/<customer>/config.yaml::guardrail_configs
  • swarm deploy diff + swarm deploy rotate-secret (Doppler-wired)

Deferred (explicit, from plan file)#

  • QueryEngine AsyncGenerator refactor (UX-only, no compliance win)
  • ACP adapter for VS Code / JetBrains (dev-tool market, not our buyer)
  • Coordinator mode (only valuable >15 teams)
  • Semantic compaction v2 (current basic works)
  • Hermes multi-platform gateway (Telegram/Discord/Signal — not our surface)
  • Async orchestration core (item #9 — revisit when >4 concurrent pipelines needed)
  • Auto-retry / HITL escalation on low evaluator grades (needs customer input)

Risks & assumptions#

Technical risks#

Risk Mitigation
Permission engine adds ~30µs per tool dispatch (~17× baseline) Absolute cost negligible vs LLM latency (~500ms). Bench tracks regression.
Context compaction is heuristic (char-count / 4 ≈ tokens) Conservative 80% threshold leaves margin. Upgrade path to tiktoken documented.
SQLite is single-writer WAL mode on, per-thread conns. High-concurrency deployments switch to Postgres (migration path via SQLAlchemy planned).
Single-node cron + batch k8s CronJobs or k8s Jobs for multi-host fan-out. Documented non-goal.
Plugin SHA-pinning is manifest-only, not server code Good-enough for Phase A; Phase B may add handler-file SHA when shell-handlers land.

Business risks#

Risk Mitigation
Mid-market CIOs skeptical of LLM reliability HITL gates, model cards, audit PDFs, signed conversation JSONL — compliance-first framing
Data leaves the VPC? Hard no. Swarm installs in the customer's network; LLM calls optionally go to customer-hosted vLLM/Ollama instead of OpenAI.
Regulatory drift (RBI updates FREE-AI) Feature flag tiers + rule-source attribution let us ship compliance amendments as YAML updates, not code pushes
Single-founder bus factor Documentation system + ADR log + agent-driven workflows mean every design choice is recoverable from git + docs

Assumptions (worth revisiting if they break)#

  • BFSI customers prefer VPC-installable products over cloud SaaS (true today; DPDP + RBI data-localisation reinforce)
  • Claude Code plugin ecosystem continues growing (base shared with Claude Code — our plugin consumers win from every community MCP)
  • Prompt cache + HTTP pool + schema cache continue to dominate LLM-path latency (verified via W1 bench; re-verify annually)

Glossary#

Term Meaning
Agent A single LLM persona with a system prompt + tool allowlist + tier (director/coordinator/worker). Defined in config/agent_defs.py.
Team A group of agents with one coordinator lead + N workers. 7 teams total. Declared in config/team_definitions.yaml.
Pipeline config A named graph of teams to execute (e.g. fast_prototype skips quality-team for speed).
Run (pipeline_run) One invocation of a pipeline for a problem statement. Identified by run_id.
Work dir Per-run filesystem scratch space at pipeline_runs/{run_id}/. Models, logs, audit PDFs land here.
Approval gate A pause point where a human must approve before the pipeline continues. 6 gate types.
Tool A Python function agents can call. Each has a schema; agent invokes via LLM tool-call.
Allowlist Per-agent set of tool names it's permitted to call. Denials go through permission engine.
MCP server External tool provider speaking Model Context Protocol. stdio or streamable-HTTP.
Plugin A Claude Code plugin directory containing any of .mcp.json, skills/, hooks/.
Skill A Markdown file (SKILL.md) with frontmatter; injected into an agent's system prompt on keyword match.
Hook A Python callable registered on a lifecycle event (SESSION_START, PRE_TOOL, POST_TOOL, PRE/POST_COMPACTION).
Feature flag Named operator-togglable behaviour switch. 3 tiers. Registry at feature_flags.py.
Permission rule One row in the decision engine. Has tool_name glob, behavior, source, priority, optional arg regex.
Denial A DENY permission decision, logged to permission_denials table with full context.
Cron job A scheduled task (retrain / drift_check / audit_pdf / custom). File-backed store.
Batch run A JSONL dataset processed through a pinned processor with checkpointing.
RBI FREE-AI Reserve Bank of India's Aug-2025 framework for responsible AI in financial services. 7 Sutras.
SR 11-7 US Federal Reserve's model risk management guidance. Swarm aligns.
DPDP India's Digital Personal Data Protection Act. Data residency + consent + retention requirements.

Project evolution log#

(High-level timeline. Detailed ADRs in .project/decisions.md; commit-level narrative in .project/journal.md.)

Weeks 1–6 (perf + extensibility)#

  • W1: Benchmark harness + prompt caching + shared HTTP client + shared schema cache (664× agent-build speedup, 96× schema-cache)
  • W2: Feature flag registry + observability controls + dashboard /settings + /transparency + retention daemon + I/O batching (6.5× SQLite, 3.9× JSONL)
  • W3: Intra-agent parallel tool calls (2.9× wall-clock on multi-tool turns)
  • W4: MCP streamable-HTTP transport (spec 2025-11-25) + Claude Code plugin loader Phase A (MCP ingestion)
  • W5: Plugin skills ingestion (Phase B) + dashboard /plugins page
  • W6: Context compaction (80% threshold) + evaluator-generator separation (optional per-agent rubric grading)

Week 7 (compliance + ops pack, current)#

  • W7-1: Unified permission engine — RBAC + allowlist + HITL + flags + YAML policy fold into one rule engine with source-attributed denial audit
  • W7-2: Plugin hook lifecycle — 5 events, server-side PII masking via reference plugin
  • W7-3: Cron scheduler (vendored from Hermes) — schedule strings, 4 task kinds, REST + CLI + dashboard
  • W7-4: Batch runner — JSONL input, 3 processors, checkpointing + resume, tool-usage aggregator
  • W7-5: ADRs + README + /transparency refresh + cross-feature integration smoke

Running totals: 545 passing tests, 22 shipped commits this resume session, seven dashboard pages, 28 metrics, 25 feature flags.

Pre-W1 (foundation, archived)#

  • Security & compliance foundation: tool allowlists, RBAC + OIDC, audit PDF, champion-challenger
  • RBI FREE-AI bundle: fairness + drift + SHAP + model cards
  • Dashboard: pipelines, agents, deployments, docs browser
  • CI: nightly pytest + bench workflow

How to use this file#

  • Starting a new work session: read this top-to-bottom once, then jump to the relevant folder's LEARNING_README.md for the mental model + IMPLEMENTATION_README.md for the contract.
  • Resuming mid-thread: find the last entry in .project/journal.md, then read only affected folders' READMEs.
  • Onboarding a new engineer: this doc → ml_team/README.md (quick-start) → docs/SYSTEM-DESIGN.md → clone and run.
  • Onboarding a new CIO buyer: this doc → docs/SYSTEM-DESIGN.md → live demo.
  • Investigating a denial: ml_team/api/routers/IMPLEMENTATION_README.md §permissions + ml_team/core/IMPLEMENTATION_README.md §permissions → SQL the permission_denials table.

Maintenance#

This file is the source of truth. Updates are mandatory when: - A new subsystem lands (folder with its own READMEs) - GTM / positioning / pricing changes - Architecture changes (backend swap, new persistence layer, new daemon) - Risks materialise (or get closed) - New personas emerge

Changes to this file land in the same commit as the code that prompted them. PRs that touch core subsystems without updating this file are incomplete — the CI guard (see .github/workflows/doc-drift.yml) enforces it.