ADR-009: Error Handling Strategy
Status
Accepted
Context
The swarm orchestrator has multiple error domains (config, git, messaging, agents, sessions) that surface at different layers. We need a consistent strategy for error propagation, user-facing messages, and recovery.
Decision
Use anyhow for application-level error propagation and thiserror for domain-specific error enums in errors.rs.
Domain Error Types
#![allow(unused)] fn main() { // All defined in swarm/src/errors.rs enum ConfigError { MissingFile, ParseError, ValidationError, VersionMismatch } enum GitError { NotARepo, DirtyTree, WorktreeOp, VersionTooOld } enum MessagingError { DbOpen, DbLocked, UnknownAgent, SelfSend } enum AgentError { SpawnFailed, BinaryNotFound, Timeout } enum SessionError { StaleLockfile, RecoveryNeeded } }
Each variant carries a human-readable message via #[error("...")].
Propagation Rules
- Library-style modules (
config.rs,messaging.rs,session.rs,worktree.rs) returnResult<T, SpecificError>using domain error types. - Orchestrator and runner use
anyhow::Resultand convert via?(automaticFromimpls fromthiserror). - CLI layer (
main.rs) catchesanyhow::Error, prints user-friendly messages, and sets exit codes.
User-Facing Error Messages
- All errors surfaced to the user include: what failed, why, and what to do.
- Example: "Git version 2.17 is too old. Swarm requires git >= 2.20. Please upgrade git."
- Internal errors (panics, unexpected states) are logged with full context to
orchestrator.logand shown to the user as "internal error, see .swarm/orchestrator.log".
Recovery vs Fatal
| Category | Behavior |
|---|---|
| Config errors | Fatal at startup. Fix config and retry. |
| Git prereq errors | Fatal at startup. Fix environment and retry. |
| Agent spawn errors | Per-agent retry with backoff (see state machine). |
| Messaging DB errors | Fatal if DB can't open. Transient locks retried via busy_timeout. |
| Session stale | Prompt user: recover or clean and restart. |
Alternatives Considered
anyhowonly (no domain types): Simpler but loses structured matching for tests and recovery logic.- Custom error type with enum: More boilerplate than thiserror for the same result.
eyre+color-eyre: Better panic reports but adds dependency;anyhowis sufficient with tracing.
Consequences
- Domain errors are testable via pattern matching.
anyhowprovides clean?-chaining in orchestrator code.- Error messages are consistent and actionable.
thiserrorderives keep boilerplate minimal.