Skip to content

How to add a test#

Every bug fix + feature lands with a regression test. Conventions documented here; patterns work for 628+ existing tests.

The regression gate#

.venv/bin/pytest ml_team/tests \
  --tb=short -q \
  --deselect ml_team/tests/test_deploy_tools.py::test_package_model_actually_builds_if_docker_present

Target: all green, zero new warnings. The --deselect skips a Docker-requiring test that's not part of the default gate.

Where tests live#

Test type Location Rule
Unit ml_team/tests/test_<module>.py One test file per module under test
Integration ml_team/tests/test_<feature>_integration.py For cross-module flows (W7-style)
Regression same as unit Add the offending input + assert correct behaviour
Benchmark ml_team/tests/bench/test_bench_<concern>.py pytest-benchmark framework
End-to-end ml_team/tests/e2e/test_<flow>_live.py Against a live stack; not in default gate
Marketplace matrix ml_team/tests/test_plugin_marketplace_matrix.py Parametrised over real CC plugins

Skeleton for a new unit test#

"""Tests for ml_team.<subsystem>.<module>."""
from __future__ import annotations

import sys
from pathlib import Path

import pytest

sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent))

from ml_team.<subsystem> import <module>


# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------

@pytest.fixture
def tmp_work_dir(tmp_path):
    """From conftest.py — reuse; don't reinvent."""
    d = tmp_path / "work"
    d.mkdir()
    return d


@pytest.fixture(autouse=True)
def _reset_global_state():
    """Many modules have module-level singletons that bleed across tests.
    Reset them in an autouse fixture so tests don't implicitly depend on order."""
    from ml_team.<subsystem>.<module> import <registry_or_singleton>
    <registry_or_singleton>.clear()  # or whatever the reset is
    yield
    <registry_or_singleton>.clear()


# ---------------------------------------------------------------------------
# Test cases
# ---------------------------------------------------------------------------

def test_<aspect_under_test>():
    """One sentence describing what this asserts."""
    # Arrange
    ...
    # Act
    result = <module>.<function>(...)
    # Assert
    assert result == <expected>

Resetting singletons#

The #1 source of flaky tests is shared state between tests. swarm has several module-level singletons:

Singleton Reset call
hooks.registry() hooks.registry().clear_all()
skill_registry.registry() skill_registry.registry().clear_all()
commands_registry.registry() commands_registry.registry().clear_all()
agents_registry.registry() agents_registry.registry().clear_all()
Feature flags runtime overrides feature_flags.reset_runtime()
Permission engine rule sources permissions.reset_sources() + permissions.mark_uninitialized()
MCP provider create a fresh MCPToolProvider() per test
LLM client pool llm_client._SHARED_CLIENTS.clear() (rare)
Conversation store file locks conversation_store._file_locks.clear()

When in doubt: wrap in an autouse=True fixture and always clean up.

Fresh DB per test#

Tests that hit Postgres/SQLite need a per-test DB to avoid bleeding:

@pytest.fixture
def fresh_db(tmp_path, monkeypatch):
    db_path = tmp_path / "test.db"
    monkeypatch.setenv("ML_TEAM_DB", str(db_path))
    from ml_team.api import database as db_mod
    monkeypatch.setattr(db_mod, "_DB_PATH", str(db_path))
    if hasattr(db_mod._local, "conn") and db_mod._local.conn is not None:
        try: db_mod._local.conn.close()
        except Exception: pass
        db_mod._local.conn = None
    db_mod.init_db()

Mocking LLM + HTTP#

Don't hit real LLM endpoints in tests:

from unittest.mock import patch, MagicMock

def test_agent_calls_llm():
    fake_response = MagicMock(
        content=[MagicMock(text="ok")],
        usage=MagicMock(input_tokens=10, output_tokens=5),
    )
    with patch("anthropic.Anthropic") as mock_client:
        mock_client.return_value.messages.create.return_value = fake_response
        # run your code
        ...

For HTTP, use httpx.MockTransport:

import httpx

def test_upstream_call():
    def handler(req):
        return httpx.Response(200, json={"status": "ok"})
    transport = httpx.MockTransport(handler)
    with httpx.Client(transport=transport) as client:
        # inject the client into your code under test
        ...

Parametrising#

Use pytest.parametrize for multi-case tests; use explicit ids= for readable test names:

@pytest.mark.parametrize("input_,expected", [
    ("560034", "Bangalore"),
    ("110001", "Delhi"),
    ("400001", "Mumbai"),
], ids=["blr", "del", "bom"])
def test_pin_geocodes(input_, expected):
    assert geocode_pin(input_)["city"] == expected

Integration + smoke tests#

When you land a multi-module feature:

  1. Write the unit tests per module (fast, isolated)
  2. Write one integration test that exercises the whole flow (slower, realistic)

Example: ml_team/tests/test_w7_integration.py exercises permissions + hooks + cron + batch in one app instance.

Matrix tests (for ecosystem coverage)#

Pattern used by test_plugin_marketplace_matrix.py:

PLUGIN_CASES = [
    (_OFFICIAL / "plugin-dev", {"skills": 7, "cmds": 1, "agents": 3, ...}),
    # ... 15 more
]

@pytest.mark.parametrize("path,expected", PLUGIN_CASES,
                         ids=[p.name for p, _ in PLUGIN_CASES])
def test_plugin_registers_expected(path, expected, isolated_fixtures):
    ...

Gives you one test function that expands to N test cases, each with a clear name.

Bench tests#

# ml_team/tests/bench/test_bench_foo.py
def test_tool_dispatch_speed(benchmark, setup_env):
    """Baseline: 50µs/call. Alert if >60µs."""
    result = benchmark(lambda: tool_executor.execute(tool_call))
    assert result.content

Run:

.venv/bin/pytest ml_team/tests/bench \
  --benchmark-only \
  --benchmark-compare=0001_baseline \
  --benchmark-compare-fail=mean:10%

Alerts on >10% regression.

Common pitfalls#

Pitfall Fix
Test passes locally, fails in CI Missing singleton reset; add autouse fixture
Flaky timing-based test Mock time.time() / datetime.now() via freezegun
Test pollutes filesystem Use tmp_path; never hard-code paths
LLM test makes real API call Wrap anthropic.Anthropic / openai.OpenAI with mock
Test takes >30s Move to e2e/ — out of default gate
Test asserts exact log text Brittle; assert presence of key fields instead
New test file not picked up Check pytest conftest discovery; file must start with test_

Doc-drift reminder#

When you add a test for a new module, also update the IMPL README for the relevant subsystem. The CI's doc-drift.yml workflow expects this.

Next#