ADR-009: Error Handling Strategy

The swarm orchestrator has multiple error domains (config, git, messaging, agents, sessions) that surface at different layers. We need a consistent strategy for error propagation, user-facing messages, and recovery.

Decision

Use anyhow for application-level error propagation and thiserror for domain-specific error enums in errors.rs.

Domain Error Types

#![allow(unused)]
fn main() {
// All defined in swarm/src/errors.rs
enum ConfigError    { MissingFile, ParseError, ValidationError, VersionMismatch }
enum GitError       { NotARepo, DirtyTree, WorktreeOp, VersionTooOld }
enum MessagingError { DbOpen, DbLocked, UnknownAgent, SelfSend }
enum AgentError     { SpawnFailed, BinaryNotFound, Timeout }
enum SessionError   { StaleLockfile, RecoveryNeeded }
}

Each variant carries a human-readable message via #[error("...")].

Propagation Rules

Library-style modules (config.rs, messaging.rs, session.rs, worktree.rs) return Result<T, SpecificError> using domain error types.
Orchestrator and runner use anyhow::Result and convert via ? (automatic From impls from thiserror).
CLI layer (main.rs) catches anyhow::Error, prints user-friendly messages, and sets exit codes.

User-Facing Error Messages

All errors surfaced to the user include: what failed, why, and what to do.
Example: "Git version 2.17 is too old. Swarm requires git >= 2.20. Please upgrade git."
Internal errors (panics, unexpected states) are logged with full context to orchestrator.log and shown to the user as "internal error, see .swarm/orchestrator.log".

Recovery vs Fatal

Category	Behavior
Config errors	Fatal at startup. Fix config and retry.
Git prereq errors	Fatal at startup. Fix environment and retry.
Agent spawn errors	Per-agent retry with backoff (see state machine).
Messaging DB errors	Fatal if DB can't open. Transient locks retried via `busy_timeout`.
Session stale	Prompt user: recover or clean and restart.

Alternatives Considered

anyhow only (no domain types): Simpler but loses structured matching for tests and recovery logic.
Custom error type with enum: More boilerplate than thiserror for the same result.
eyre + color-eyre: Better panic reports but adds dependency; anyhow is sufficient with tracing.

Consequences

Domain errors are testable via pattern matching.
anyhow provides clean ?-chaining in orchestrator code.
Error messages are consistent and actionable.
thiserror derives keep boilerplate minimal.

swarm