How to add a test#
Every bug fix + feature lands with a regression test. Conventions documented here; patterns work for 628+ existing tests.
The regression gate#
.venv/bin/pytest ml_team/tests \
--tb=short -q \
--deselect ml_team/tests/test_deploy_tools.py::test_package_model_actually_builds_if_docker_present
Target: all green, zero new warnings. The --deselect skips a Docker-requiring test that's not part of the default gate.
Where tests live#
| Test type | Location | Rule |
|---|---|---|
| Unit | ml_team/tests/test_<module>.py |
One test file per module under test |
| Integration | ml_team/tests/test_<feature>_integration.py |
For cross-module flows (W7-style) |
| Regression | same as unit | Add the offending input + assert correct behaviour |
| Benchmark | ml_team/tests/bench/test_bench_<concern>.py |
pytest-benchmark framework |
| End-to-end | ml_team/tests/e2e/test_<flow>_live.py |
Against a live stack; not in default gate |
| Marketplace matrix | ml_team/tests/test_plugin_marketplace_matrix.py |
Parametrised over real CC plugins |
Skeleton for a new unit test#
"""Tests for ml_team.<subsystem>.<module>."""
from __future__ import annotations
import sys
from pathlib import Path
import pytest
sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent))
from ml_team.<subsystem> import <module>
# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------
@pytest.fixture
def tmp_work_dir(tmp_path):
"""From conftest.py — reuse; don't reinvent."""
d = tmp_path / "work"
d.mkdir()
return d
@pytest.fixture(autouse=True)
def _reset_global_state():
"""Many modules have module-level singletons that bleed across tests.
Reset them in an autouse fixture so tests don't implicitly depend on order."""
from ml_team.<subsystem>.<module> import <registry_or_singleton>
<registry_or_singleton>.clear() # or whatever the reset is
yield
<registry_or_singleton>.clear()
# ---------------------------------------------------------------------------
# Test cases
# ---------------------------------------------------------------------------
def test_<aspect_under_test>():
"""One sentence describing what this asserts."""
# Arrange
...
# Act
result = <module>.<function>(...)
# Assert
assert result == <expected>
Resetting singletons#
The #1 source of flaky tests is shared state between tests. swarm has several module-level singletons:
| Singleton | Reset call |
|---|---|
hooks.registry() |
hooks.registry().clear_all() |
skill_registry.registry() |
skill_registry.registry().clear_all() |
commands_registry.registry() |
commands_registry.registry().clear_all() |
agents_registry.registry() |
agents_registry.registry().clear_all() |
| Feature flags runtime overrides | feature_flags.reset_runtime() |
| Permission engine rule sources | permissions.reset_sources() + permissions.mark_uninitialized() |
| MCP provider | create a fresh MCPToolProvider() per test |
| LLM client pool | llm_client._SHARED_CLIENTS.clear() (rare) |
| Conversation store file locks | conversation_store._file_locks.clear() |
When in doubt: wrap in an autouse=True fixture and always clean up.
Fresh DB per test#
Tests that hit Postgres/SQLite need a per-test DB to avoid bleeding:
@pytest.fixture
def fresh_db(tmp_path, monkeypatch):
db_path = tmp_path / "test.db"
monkeypatch.setenv("ML_TEAM_DB", str(db_path))
from ml_team.api import database as db_mod
monkeypatch.setattr(db_mod, "_DB_PATH", str(db_path))
if hasattr(db_mod._local, "conn") and db_mod._local.conn is not None:
try: db_mod._local.conn.close()
except Exception: pass
db_mod._local.conn = None
db_mod.init_db()
Mocking LLM + HTTP#
Don't hit real LLM endpoints in tests:
from unittest.mock import patch, MagicMock
def test_agent_calls_llm():
fake_response = MagicMock(
content=[MagicMock(text="ok")],
usage=MagicMock(input_tokens=10, output_tokens=5),
)
with patch("anthropic.Anthropic") as mock_client:
mock_client.return_value.messages.create.return_value = fake_response
# run your code
...
For HTTP, use httpx.MockTransport:
import httpx
def test_upstream_call():
def handler(req):
return httpx.Response(200, json={"status": "ok"})
transport = httpx.MockTransport(handler)
with httpx.Client(transport=transport) as client:
# inject the client into your code under test
...
Parametrising#
Use pytest.parametrize for multi-case tests; use explicit ids= for readable test names:
@pytest.mark.parametrize("input_,expected", [
("560034", "Bangalore"),
("110001", "Delhi"),
("400001", "Mumbai"),
], ids=["blr", "del", "bom"])
def test_pin_geocodes(input_, expected):
assert geocode_pin(input_)["city"] == expected
Integration + smoke tests#
When you land a multi-module feature:
- Write the unit tests per module (fast, isolated)
- Write one integration test that exercises the whole flow (slower, realistic)
Example: ml_team/tests/test_w7_integration.py exercises permissions + hooks + cron + batch in one app instance.
Matrix tests (for ecosystem coverage)#
Pattern used by test_plugin_marketplace_matrix.py:
PLUGIN_CASES = [
(_OFFICIAL / "plugin-dev", {"skills": 7, "cmds": 1, "agents": 3, ...}),
# ... 15 more
]
@pytest.mark.parametrize("path,expected", PLUGIN_CASES,
ids=[p.name for p, _ in PLUGIN_CASES])
def test_plugin_registers_expected(path, expected, isolated_fixtures):
...
Gives you one test function that expands to N test cases, each with a clear name.
Bench tests#
# ml_team/tests/bench/test_bench_foo.py
def test_tool_dispatch_speed(benchmark, setup_env):
"""Baseline: 50µs/call. Alert if >60µs."""
result = benchmark(lambda: tool_executor.execute(tool_call))
assert result.content
Run:
.venv/bin/pytest ml_team/tests/bench \
--benchmark-only \
--benchmark-compare=0001_baseline \
--benchmark-compare-fail=mean:10%
Alerts on >10% regression.
Common pitfalls#
| Pitfall | Fix |
|---|---|
| Test passes locally, fails in CI | Missing singleton reset; add autouse fixture |
| Flaky timing-based test | Mock time.time() / datetime.now() via freezegun |
| Test pollutes filesystem | Use tmp_path; never hard-code paths |
| LLM test makes real API call | Wrap anthropic.Anthropic / openai.OpenAI with mock |
| Test takes >30s | Move to e2e/ — out of default gate |
| Test asserts exact log text | Brittle; assert presence of key fields instead |
| New test file not picked up | Check pytest conftest discovery; file must start with test_ |
Doc-drift reminder#
When you add a test for a new module, also update the IMPL README for the relevant subsystem. The CI's doc-drift.yml workflow expects this.
Next#
- Release process — what happens after your PR merges
- tests IMPL README — every existing test file categorised