ADR-009: Error Handling Strategy

Status

Accepted

Context

The swarm orchestrator has multiple error domains (config, git, messaging, agents, sessions) that surface at different layers. We need a consistent strategy for error propagation, user-facing messages, and recovery.

Decision

Use anyhow for application-level error propagation and thiserror for domain-specific error enums in errors.rs.

Domain Error Types

#![allow(unused)]
fn main() {
// All defined in swarm/src/errors.rs
enum ConfigError    { MissingFile, ParseError, ValidationError, VersionMismatch }
enum GitError       { NotARepo, DirtyTree, WorktreeOp, VersionTooOld }
enum MessagingError { DbOpen, DbLocked, UnknownAgent, SelfSend }
enum AgentError     { SpawnFailed, BinaryNotFound, Timeout }
enum SessionError   { StaleLockfile, RecoveryNeeded }
}

Each variant carries a human-readable message via #[error("...")].

Propagation Rules

  1. Library-style modules (config.rs, messaging.rs, session.rs, worktree.rs) return Result<T, SpecificError> using domain error types.
  2. Orchestrator and runner use anyhow::Result and convert via ? (automatic From impls from thiserror).
  3. CLI layer (main.rs) catches anyhow::Error, prints user-friendly messages, and sets exit codes.

User-Facing Error Messages

  • All errors surfaced to the user include: what failed, why, and what to do.
  • Example: "Git version 2.17 is too old. Swarm requires git >= 2.20. Please upgrade git."
  • Internal errors (panics, unexpected states) are logged with full context to orchestrator.log and shown to the user as "internal error, see .swarm/orchestrator.log".

Recovery vs Fatal

CategoryBehavior
Config errorsFatal at startup. Fix config and retry.
Git prereq errorsFatal at startup. Fix environment and retry.
Agent spawn errorsPer-agent retry with backoff (see state machine).
Messaging DB errorsFatal if DB can't open. Transient locks retried via busy_timeout.
Session stalePrompt user: recover or clean and restart.

Alternatives Considered

  1. anyhow only (no domain types): Simpler but loses structured matching for tests and recovery logic.
  2. Custom error type with enum: More boilerplate than thiserror for the same result.
  3. eyre + color-eyre: Better panic reports but adds dependency; anyhow is sufficient with tracing.

Consequences

  • Domain errors are testable via pattern matching.
  • anyhow provides clean ?-chaining in orchestrator code.
  • Error messages are consistent and actionable.
  • thiserror derives keep boilerplate minimal.