How to add a test#

Every bug fix + feature lands with a regression test. Conventions documented here; patterns work for 628+ existing tests.

The regression gate#

.venv/bin/pytest ml_team/tests \
  --tb=short -q \
  --deselect ml_team/tests/test_deploy_tools.py::test_package_model_actually_builds_if_docker_present

Target: all green, zero new warnings. The --deselect skips a Docker-requiring test that's not part of the default gate.

Where tests live#

Test type	Location	Rule
Unit	`ml_team/tests/test_<module>.py`	One test file per module under test
Integration	`ml_team/tests/test_<feature>_integration.py`	For cross-module flows (W7-style)
Regression	same as unit	Add the offending input + assert correct behaviour
Benchmark	`ml_team/tests/bench/test_bench_<concern>.py`	pytest-benchmark framework
End-to-end	`ml_team/tests/e2e/test_<flow>_live.py`	Against a live stack; not in default gate
Marketplace matrix	`ml_team/tests/test_plugin_marketplace_matrix.py`	Parametrised over real CC plugins

Skeleton for a new unit test#

"""Tests for ml_team.<subsystem>.<module>."""
from __future__ import annotations

import sys
from pathlib import Path

import pytest

sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent))

from ml_team.<subsystem> import <module>


# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------

@pytest.fixture
def tmp_work_dir(tmp_path):
    """From conftest.py — reuse; don't reinvent."""
    d = tmp_path / "work"
    d.mkdir()
    return d


@pytest.fixture(autouse=True)
def _reset_global_state():
    """Many modules have module-level singletons that bleed across tests.
    Reset them in an autouse fixture so tests don't implicitly depend on order."""
    from ml_team.<subsystem>.<module> import <registry_or_singleton>
    <registry_or_singleton>.clear()  # or whatever the reset is
    yield
    <registry_or_singleton>.clear()


# ---------------------------------------------------------------------------
# Test cases
# ---------------------------------------------------------------------------

def test_<aspect_under_test>():
    """One sentence describing what this asserts."""
    # Arrange
    ...
    # Act
    result = <module>.<function>(...)
    # Assert
    assert result == <expected>

Resetting singletons#

The #1 source of flaky tests is shared state between tests. swarm has several module-level singletons:

Singleton	Reset call
`hooks.registry()`	`hooks.registry().clear_all()`
`skill_registry.registry()`	`skill_registry.registry().clear_all()`
`commands_registry.registry()`	`commands_registry.registry().clear_all()`
`agents_registry.registry()`	`agents_registry.registry().clear_all()`
Feature flags runtime overrides	`feature_flags.reset_runtime()`
Permission engine rule sources	`permissions.reset_sources()` + `permissions.mark_uninitialized()`
MCP provider	create a fresh `MCPToolProvider()` per test
LLM client pool	`llm_client._SHARED_CLIENTS.clear()` (rare)
Conversation store file locks	`conversation_store._file_locks.clear()`

When in doubt: wrap in an autouse=True fixture and always clean up.

Fresh DB per test#

Tests that hit Postgres/SQLite need a per-test DB to avoid bleeding:

@pytest.fixture
def fresh_db(tmp_path, monkeypatch):
    db_path = tmp_path / "test.db"
    monkeypatch.setenv("ML_TEAM_DB", str(db_path))
    from ml_team.api import database as db_mod
    monkeypatch.setattr(db_mod, "_DB_PATH", str(db_path))
    if hasattr(db_mod._local, "conn") and db_mod._local.conn is not None:
        try: db_mod._local.conn.close()
        except Exception: pass
        db_mod._local.conn = None
    db_mod.init_db()

Mocking LLM + HTTP#

Don't hit real LLM endpoints in tests:

from unittest.mock import patch, MagicMock

def test_agent_calls_llm():
    fake_response = MagicMock(
        content=[MagicMock(text="ok")],
        usage=MagicMock(input_tokens=10, output_tokens=5),
    )
    with patch("anthropic.Anthropic") as mock_client:
        mock_client.return_value.messages.create.return_value = fake_response
        # run your code
        ...

For HTTP, use httpx.MockTransport:

import httpx

def test_upstream_call():
    def handler(req):
        return httpx.Response(200, json={"status": "ok"})
    transport = httpx.MockTransport(handler)
    with httpx.Client(transport=transport) as client:
        # inject the client into your code under test
        ...

Parametrising#

Use pytest.parametrize for multi-case tests; use explicit ids= for readable test names:

@pytest.mark.parametrize("input_,expected", [
    ("560034", "Bangalore"),
    ("110001", "Delhi"),
    ("400001", "Mumbai"),
], ids=["blr", "del", "bom"])
def test_pin_geocodes(input_, expected):
    assert geocode_pin(input_)["city"] == expected

Integration + smoke tests#

When you land a multi-module feature:

Write the unit tests per module (fast, isolated)
Write one integration test that exercises the whole flow (slower, realistic)

Example: ml_team/tests/test_w7_integration.py exercises permissions + hooks + cron + batch in one app instance.

Matrix tests (for ecosystem coverage)#

Pattern used by test_plugin_marketplace_matrix.py:

PLUGIN_CASES = [
    (_OFFICIAL / "plugin-dev", {"skills": 7, "cmds": 1, "agents": 3, ...}),
    # ... 15 more
]

@pytest.mark.parametrize("path,expected", PLUGIN_CASES,
                         ids=[p.name for p, _ in PLUGIN_CASES])
def test_plugin_registers_expected(path, expected, isolated_fixtures):
    ...

Gives you one test function that expands to N test cases, each with a clear name.

Bench tests#

# ml_team/tests/bench/test_bench_foo.py
def test_tool_dispatch_speed(benchmark, setup_env):
    """Baseline: 50µs/call. Alert if >60µs."""
    result = benchmark(lambda: tool_executor.execute(tool_call))
    assert result.content

Run:

.venv/bin/pytest ml_team/tests/bench \
  --benchmark-only \
  --benchmark-compare=0001_baseline \
  --benchmark-compare-fail=mean:10%

Alerts on >10% regression.

Common pitfalls#

Pitfall	Fix
Test passes locally, fails in CI	Missing singleton reset; add autouse fixture
Flaky timing-based test	Mock `time.time()` / `datetime.now()` via `freezegun`
Test pollutes filesystem	Use `tmp_path`; never hard-code paths
LLM test makes real API call	Wrap `anthropic.Anthropic` / `openai.OpenAI` with mock
Test takes >30s	Move to `e2e/` — out of default gate
Test asserts exact log text	Brittle; assert presence of key fields instead
New test file not picked up	Check pytest conftest discovery; file must start with `test_`

Doc-drift reminder#

When you add a test for a new module, also update the IMPL README for the relevant subsystem. The CI's doc-drift.yml workflow expects this.

Next#

Release process — what happens after your PR merges
tests IMPL README — every existing test file categorised