Files
Codewalkers/docs/test-inventory.md
Lukas May a2ab4c4a84 docs: Add comprehensive test inventory with coverage gaps and redundancy map
Audited all 44 test files one by one. Documents what each test verifies,
identifies 12 redundant test pairs, 13 coverage gaps (prioritized), fragility
assessment, and mock style inconsistencies.
2026-03-02 12:23:39 +09:00

27 KiB
Raw Blame History

Test Inventory

Comprehensive catalog of every test in the suite, what it verifies, and where coverage is weak or redundant.

44 test files | 35 unit | 7 E2E | 4 integration (mocked) | 5 integration (real providers)

Last audited: 2026-03-02


Coverage Matrix

Subjective 010 scores. 0 = no coverage, 5 = happy paths only, 10 = comprehensive with edge cases.

Module Happy Path Error/Edge Integration Race/Concurrency Notes
Agent Manager 8 6 6 9 Missing: getResult with actual result, multi-provider spawn
Agent OutputHandler 6 4 5 8 Missing: processAgentOutput with error status, no-result-line case
Agent FileIO 8 7 8 Missing: round-trip write→read, writeInputFiles content verification
Agent ProcessManager 6 4 3 0 Heavy mocking; missing: killProcess, cleanupWorktree, error paths
Agent Lifecycle 8 7 3 0 Missing: controller integration, analyzeError→shouldRetry chain
DB Repositories 9 7 9 task.test.ts thin: missing findByParentTaskId, approval, category queries
DB Cascades 10 10 Excellent: full graph verified
Dispatch Manager 7 5 8 0 "dependency scenario" test is misleading (deps not implemented)
Phase Dispatch 9 7 8 0 Best dependency tests (diamond pattern, blocking propagation)
Coordination/Merge 7 5 7 0 Overlap with ConflictResolutionService
Git Manager 9 7 9 0 Real git repos, no mocks. Excellent.
Git BranchManager 6 5 6 Only remoteBranchExists tested. Other methods untested.
Process Manager/Registry 8 6 0 Registry is pure and clean. Manager has mock fragility.
Logging 8 6 8 0 Real filesystem. Good.
Events Bus 8 3 0 Missing: handler-throws, handler ordering
Server/HTTP 8 6 9 0 Real HTTP + real PID files
Shutdown 5 2 0 Missing: signal→handler→shutdown chain, error resilience
tRPC Router 3 1 3 0 Only health and status tested. All domain procedures untested.
Preview 7 5 4 0 Missing: discoverConfig, docker-client, health-checker unit tests
E2E Workflows 8 6 8 0 Good lifecycle coverage. Output parsing never tested E2E.

Test Files — Detailed Inventory

Agent Module (src/agent/)

manager.test.ts — MultiProviderAgentManager

Test Verifies
spawn / creates agent record with provided name repository.create called with name
spawn / rejects duplicate agent names Unique name constraint enforced
spawn / emits AgentSpawned event with name Event has agent name in payload
spawn / writes diagnostic files for workdir verification spawn-diagnostic.json written with metadata
spawn / uses custom cwd if provided child_process.spawn gets custom cwd
stop / stops running agent and updates status DB updated to stopped
stop / kills detached process if running Status update after spawn+stop (does NOT verify kill)
stop / throws if agent not found Error on missing ID
stop / emits AgentStopped event Event has user_requested reason
list / returns all agents with names findAll returns correct data
get / finds agent by id findById returns agent
get / returns null if not found findById returns null
getByName / finds agent by name findByName returns agent
getByName / returns null if not found findByName returns null
resume / resumes agent waiting for input Spawn called with --resume + session args
resume / rejects if not waiting for input Error on wrong status
resume / rejects if no session Error on null sessionId
resume / emits AgentResumed event Event has sessionId
getResult / returns null when no result Agent result is null
delete / deletes agent and clears active state repository.delete called
delete / emits agent:deleted event Event has agent name
delete / throws if not found Error on missing ID
delete / handles missing workdir gracefully No error when workdir absent

Mocking: 5 vi.mock() calls (child_process, git/manager, node:fs, node:fs/promises, file-tailer). Brittle. Gaps: No test for getResult with actual result. No multi-provider tests. findByTaskId/findBySessionId mocked but never exercised.


output-handler.test.ts — OutputHandler

Test Verifies
processAgentOutput / parse questions from Claude output waiting_for_input status + questions stored
processAgentOutput / handle malformed questions Falls to crashed status
processAgentOutput / handle "done" status No waiting_for_input set
getPendingQuestions / retrieve stored questions JSON round-trip from DB
getPendingQuestions / return null when none Null for no questions
formatAnswersAsPrompt / format normal answers String formatting correct
formatAnswersAsPrompt / instruction-enhanced answers __instruction__ prepended
formatAnswersAsPrompt / instruction with whitespace Trimmed correctly
formatAnswersAsPrompt / only instruction, no answers Instruction + header only
formatAnswersAsPrompt / empty answers Header with no entries

Gaps: No test for processAgentOutput with "error" status. No test for output with no result line. Uses real provider registry (fragile).


completion-detection.test.ts — readSignalCompletion (private method)

Test Verifies
detects "questions" status signal.json parsed, returns content
detects "done" status Same
detects "error" status Same
returns null when no signal.json Missing file returns null
returns null for "running" status Non-terminal status ignored
handles malformed signal.json Bad JSON returns null

Concern: All 6 tests access private method via (outputHandler as any).readSignalCompletion. Real filesystem (good). Partially redundant with completion-race-condition.test.ts tests 2-3 which test the same scenarios through the public handleCompletion method.


completion-race-condition.test.ts — handleCompletion mutex

Test Verifies
prevent concurrent completion via mutex Only 1 of 2 concurrent calls executes
handle completion with questions status Final status waiting_for_input
handle completion when done Final status idle
clean up lock even on failure Mutex released after error

Concern: Module-level mock repo mutated per-test without beforeEach reset. Tests 2-3 overlap with completion-detection.test.ts but are strictly better (test public API).


mutex-completion.test.ts — handleCompletion mutex (focused)

Test Verifies
prevent concurrent completion Same agentId blocked
allow sequential completion Same agentId allowed after first finishes
clean up on exception Lock released on error
per-agent mutex isolation Different agentIds run concurrently

Excellent focused design. 4 complementary scenarios. Uses real timers (50ms delays). Partial overlap with completion-race-condition.test.ts test 1 — both test the mutex concurrent blocking.


file-io.test.ts — File I/O utilities

Test Verifies
generateId / non-empty string Returns truthy
generateId / unique values 100 unique IDs
writeInputFiles / initiative.md File created
writeInputFiles / phase.md File created
writeInputFiles / task.md File created
writeInputFiles / pages Files in pages/ subdir
writeInputFiles / empty options No crash
readSummary / with frontmatter Parsed correctly
readSummary / missing file Returns null
readSummary / no frontmatter Body only
readSummary / empty files_modified Empty array
readPhaseFiles / from phases/ dir Parsed with frontmatter
readPhaseFiles / missing dir Returns []
readPhaseFiles / no dependencies Defaults to []
readTaskFiles / from tasks/ dir Parsed with frontmatter
readTaskFiles / defaults category/type execute/auto defaults
readTaskFiles / missing dir Returns []
readDecisionFiles / from decisions/ dir Parsed
readDecisionFiles / missing dir Returns []
readPageFiles / from pages/ dir Parsed with pageId
readPageFiles / missing dir Returns []
readPageFiles / ignores non-.md Filter works

Real filesystem, no mocks. Good. Gaps: writeInputFiles only checks existsSync, not file content. No write→read round-trip test.


process-manager.test.ts — ProcessManager

Test Verifies
getAgentWorkdir Path computation
createProjectWorktrees Worktree creation for initiative projects
createProjectWorktrees / throws on failure Error propagation
createProjectWorktrees / logs details Completes without error (weak)
createStandaloneWorktree Path returned
createStandaloneWorktree / throws on failure Error propagation
spawnDetached / validates cwd exists existsSync called
spawnDetached / throws if cwd missing Error thrown
spawnDetached / passes correct cwd spawn args verified
spawnDetached / logs spawn info Completes (weak)
spawnDetached / writes prompt file PROMPT.md written
buildSpawnCommand / native prompt mode Command+args assembled
buildSpawnCommand / flag prompt mode Command+args assembled
buildResumeCommand / flag style Resume args assembled
buildResumeCommand / no resume support Error thrown

6 vi.mock() calls. Most brittle test file. "Logs comprehensive" tests are empty assertions. Missing: killProcess, cleanupWorktree.


mock-manager.test.ts — MockAgentManager (testing the test double)

39 tests covering: default scenario, configured delay, error/questions scenarios, resume, stop, list, get/getByName, scenario overrides, event emission order, name uniqueness, constructor options, agent modes (execute/discuss/plan/detail), structured questions.

Thorough. One duplicate: "should emit stopped event with detail_complete reason" appears twice with "(second test)" suffix.


Agent Lifecycle (src/agent/lifecycle/)

error-analyzer.test.ts — 20 tests. Auth failure (4), usage limit (4), timeout (2), missing signal (3), process crash (4), unknown (2), context preservation (2). Pure behavioral classification tests.

instructions.test.ts — 6 tests. Constant content verification + addInstructionToPrompt pure function tests. Appropriate for 28-line module.

retry-policy.test.ts — 12 tests. Config (2), shouldRetry per error type (8), backoff delays (2). Clean decision-function tests. Missing: AgentExhaustedError/AgentFailureError class tests.

signal-manager.test.ts — 18 tests. clearSignal (2), checkSignalExists (2), readSignal (7), waitForSignal (3), validateSignalFile (4). Real filesystem, no mocks. waitForSignal has real-timer flakiness risk.


DB Repositories (src/db/repositories/drizzle/)

All use real in-memory SQLite via createTestDatabase(). No mocks. Integration tests against real SQL.

agent.test.ts — 20 tests. Full CRUD + findByStatus, findByTaskId, findBySessionId. Gap: never tests initiativeId field.

cascade.test.ts — 4 tests. High value. Validates entire FK cascade graph: initiative→phases→tasks→pages→junction rows. Verifies SET NULL on agents/conversations. Before/after assertions prove setup correctness.

initiative.test.ts — 12 tests. Standard CRUD + findByStatus. Simplest repo, clean.

message.test.ts — 22 tests. Most comprehensive repo test. Covers sender/recipient polymorphism, threading (parentMessageId), status lifecycle (pending→read→responded), filtering (findPendingForUser, findRequiringResponse, findReplies).

phase.test.ts — 21 tests. CRUD + 12 dependency tests (createDependency, getDependencies, getDependents, findDependenciesByInitiativeId). Direct-only (not transitive) verified. Cross-initiative scoping verified.

task.test.ts — 12 tests. Thinnest relative to feature surface. Missing: findByParentTaskId, findByInitiativeId, findPendingApproval, category filtering, requiresApproval behavior.

Shared pattern: 8 tests use setTimeout(resolve, 10) for timestamp differentiation (~80ms total delay).


Dispatch (src/dispatch/)

manager.test.ts — 13 tests. Queue, getNextDispatchable (priority, queuedAt ordering), completeTask, blockTask, dispatchNext (no tasks, no agents, success), getQueueState, dependency scenario. Real DB + mock event bus + mock agent manager. Gap: "dependency scenario" test admits dependencies aren't implemented — just retests priority ordering.

phase-manager.test.ts — 17 tests. queuePhase (status validation, deps), getNextDispatchablePhase (dependency blocking), dispatchNextPhase, completePhase, blockPhase, diamond dependency scenario. Best dependency tests: full diamond pattern (A→B, A→C, B→D, C→D) with step-by-step verification. Gap: mock DispatchManager never asserted (task-level dispatch not verified).


Coordination (src/coordination/)

manager.test.ts — 14 tests. queueMerge, getNextMergeable (priority ordering), processMerges (success path, conflict detection, resolution task creation, message creation), getQueueState, handleConflict, error handling (undefined repos).

conflict-resolution-service.test.ts — 10 tests. handleConflict (resolution task properties, original task blocking, agent message, event emission, optional deps, error cases, multiple conflicts, parentTaskId preservation).

Overlap: 5+ tests in manager.test.ts duplicate conflict-handling logic from conflict-resolution-service.test.ts.


Git (src/git/)

manager.test.ts — 22 tests. create, remove, list, get, diff (clean/added/modified/deleted), merge (success/conflict/abort/cleanup), edge cases. Real git repos on filesystem. No mocks. Excellent. Most confident test file in the suite.

simple-git-branch-manager.test.ts — 4 tests. remoteBranchExists only. Real git with bare repo + clones. Gap: only one method tested; other methods untested.


Process (src/process/)

manager.test.ts — 18 tests. spawn, stop, stopAll, restart, isRunning (live/dead/stopped), event emission. Uses mocked execa. Good separation of event bus concern.

registry.test.ts — 14 tests. Pure unit tests, no mocks. register, get, getAll, updateStatus, unregister, getByPid, clear, size. Exemplary.


Logging (src/logging/)

manager.test.ts — 13 tests. getBaseDir, ensureLogDir, ensureProcessDir, getProcessDir, getLogPath, listLogs (filtering), cleanOldLogs (retainDays). Real filesystem.

writer.test.ts — 17 tests. open, writeStdout/writeStderr (string/Buffer, timestamps, multi-line), close (flush, idempotent), write-after-close, append mode, event emission, backwards compatibility. Real filesystem + real EventBus.


Events (src/events/)

bus.test.ts — 10 tests. emit/on, event isolation, once, off, multiple handlers, typed events, timestamp preservation, factory. Pure behavioral tests. Gap: no handler-throws test, no handler ordering test.


Server (src/server/)

index.test.ts — 16 tests. Lifecycle (start/stop/double-start/PID conflict/idempotent stop/isRunning), HTTP endpoints (health/status/404/405), PID file management (create/remove/stale cleanup), event emission. Real HTTP server + real fetch. Gap: no port-already-in-use test.

shutdown.test.ts — 6 tests. shutdown() call order (server.stop before processManager.stopAll), signal handler installation (SIGTERM/SIGINT/SIGHUP). Gap: doesn't verify signal→handler→shutdown chain. Doesn't test error resilience (what if server.stop() throws).


tRPC Router (src/trpc/)

router.test.ts — 14 tests. health procedure (shape, Zod, uptime calc, null startedAt, processCount), status procedure (shape, Zod, startedAt, pid, uptime), Zod schema validation.

Major gap: Only health and status are tested. All domain procedures untested: initiatives, tasks, phases, agents, pages, proposals, accounts, conversations, previews.


Preview (src/preview/)

compose-generator.test.ts — 10 tests. Pure function tests. YAML generation, build config handling, Caddy deps, Caddyfile routing (single/multi/internal/specificity), labels. No mocks. Strong.

config-reader.test.ts — 6 tests. parseCwPreviewConfig only. Minimal/multi-service, validation errors, internal services, build normalization. Gap: discoverConfig (filesystem discovery) is untested — always mocked in manager.test.ts.

manager.test.ts — 20 tests. start (full lifecycle, phaseId, Docker unavailable, project not found, compose-up failure, health-check failure, no healthcheck), stop (cleanup, missing labels), list (all/filtered/skip non-preview/skip incomplete), getStatus (running/failed/stopped/building/null), stopAll (all/error resilience/empty). 6 vi.mock() calls.

port-allocator.test.ts — 4 tests. Base port, skip used ports, gap filling, real TCP port contention. One real TCP server test. Gap: no upper-bound/exhaustion test.


E2E Tests (src/test/e2e/)

All use createTestHarness() with in-memory DB, MockAgentManager, MockWorktreeManager, real dispatch/coordination managers.

happy-path.test.ts — 6 tests. Single task flow, sequential deps, parallel dispatch, full merge flow, complex dependency flow (diamond), fixture dependency DB verification. Gap: merge flow requires manual agent+worktree bridging (harness not fully integrated).

architect-workflow.test.ts — 8 tests. Discuss mode (complete, pause/resume), plan mode (complete, persist phases), plan conflict detection (reject dup, auto-dismiss stale, different initiatives), full discuss→plan→phases workflow. Gap: architect output is never parsed — mock discards structured data.

decompose-workflow.test.ts — 10 tests. Detail mode (complete, pause/resume, multiple questions), detail conflict detection (3 tests), task persistence (create, all types, dependencies), full workflow. Gap: filename says "decompose" but describe says "Detail". Task dependency rows never verified in DB.

phase-dispatch.test.ts — 7 tests. Cleanest E2E file. Independent parallel dispatch, sequential deps, diamond dependency (A→B,C→D), approval gate (pending/in_progress rejection), blocked phase (no dispatch, downstream blocking). No fake timers, no agent involvement.

recovery-scenarios.test.ts — 9 tests. Queue state in DB, in-progress task recovery after crash, blocked state persistence, merge queue recovery, agent Q&A (question/resume, structured questions, status transitions, multiple questions).

edge-cases.test.ts — 14 tests. Agent crash (events, task status, error message), agent waiting/resume (3 tests), task blocking (4 tests), merge conflict handling (4 tests).

extended-scenarios.test.ts — 6 tests. Conflict hand-back round-trip (full lifecycle, context preservation, sequential conflicts), multi-agent parallel work (parallel completion, merge ordering, mixed outcomes).


Integration Tests (src/test/integration/)

agent-workdir-verification.test.ts — 2 tests. Skipped by default (REAL_WORKDIR_TESTS). Spawns real Claude agent, verifies working directory path via diagnostic files and agent-created files.

crash-race-condition.test.ts — 4 tests. Always runs. Tests signal.json-based completion detection vs crash marking. Tests 2 and 4 overlap (same setup, test 4 has weaker assertions). Accesses private methods via as any.

real-claude.test.ts — 3 tests. Skipped by default. Direct Claude CLI contract tests: done/questions/error status outputs validated against Zod schema. ~$0.10.

real-e2e-crash.test.ts — 1 test. Dead test: expect(true).toBe(true). Hardcoded absolute paths. Diagnostic script, not a regression test.


Real Provider Tests (src/test/integration/real-providers/)

All skipped by default. ~$0.50 total per full run.

claude-manager.test.ts — 7 tests. Output parsing (text_delta, init/session, result), questions flow, session resume (2 tests), error handling. Resume test accepts crashed as valid (too lenient).

codex-manager.test.ts — 3 tests. Spawn, output parsing, provider config. All accept crashed as valid — pass vacuously when Codex isn't installed.

conversation.test.ts — 1 test. Most ambitious: two real Claude agents doing actual coding while communicating via cw ask/listen/answer. Verifies 2 conversations, file creation, content, and interleaved coding+conversation. ~$0.30, 5min timeout.

crash-recovery.test.ts — 4 tests. Server restart simulation (resume streaming, mark dead agent, process output for dead agent), event idempotency (no duplicate events on restart).

schema-retry.test.ts — 9 tests. Valid output (done, questions, multiple questions), retry logic (bad→good, code blocks, surrounding text), mode-specific schemas (discuss, plan, detail). Tests 5-6 have conditional assertions (silently pass on crash).


Redundancy Map

Tests covering the same scenario multiple times. Bold = the better version to keep.

Scenario Tests
Signal.json completion detection (questions→waiting) completion-detection.test.ts #1, completion-race-condition.test.ts #2
Signal.json completion detection (done→idle) completion-detection.test.ts #2, completion-race-condition.test.ts #3
Mutex concurrent blocking completion-race-condition.test.ts #1, mutex-completion.test.ts #1 (more focused)
Mutex cleanup on error completion-race-condition.test.ts #4, mutex-completion.test.ts #3 (more focused)
Agent Q&A resume flow edge-cases.test.ts #5, recovery-scenarios.test.ts #5 (has options)
Agent status transitions (running→waiting→idle) edge-cases.test.ts #6, recovery-scenarios.test.ts #8 (more thorough)
Task blocking (DB status) edge-cases.test.ts #10, recovery-scenarios.test.ts #3
Merge conflict → resolution task edge-cases.test.ts #13, extended-scenarios.test.ts #1 (full round-trip)
Conflict resolution task creation coordination/manager.test.ts (5 tests), conflict-resolution-service.test.ts (10 tests)
MockAgentManager detail_complete reason mock-manager.test.ts has duplicate test "(second test)"
Crash marking with no signal crash-race-condition.test.ts #2, crash-race-condition.test.ts #4 (weaker)
real-e2e-crash.test.ts #1 Dead: expect(true).toBe(true) — contributes nothing

Coverage Gaps (Priority Order)

Critical

  1. tRPC domain procedures entirely untested — initiatives, tasks, phases, agents, pages, proposals, accounts, conversations, previews. Only health/status have tests. This is the single largest gap.

  2. Agent output parsing never tested E2E — Architect helpers (setArchitectPlanComplete, setArchitectDetailComplete) discard structured output. The pipeline from agent output → DB persistence is exercised nowhere.

  3. task.test.ts thin relative to feature surface — Missing: findByParentTaskId, findByInitiativeId, findPendingApproval, approveTask, category filtering, requiresApproval field behavior. These features exist in production but have zero direct unit test coverage.

Important

  1. discoverConfig (preview config discovery) untested — The filesystem walk logic (.cw-preview.yml → docker-compose.yml → Dockerfile) is always mocked away.

  2. docker-client.js and health-checker.js have no unit tests — Always mocked. No verification of the Docker shell-out layer.

  3. Signal handler → shutdown chain untestedshutdown.test.ts verifies listeners are installed and shutdown() works when called directly, but never tests signal → handler → shutdown.

  4. No concurrent dispatch/merge tests — For a multi-agent system, racing dispatchNext() or concurrent merges are never exercised.

  5. processAgentOutput missing error status test — "done" and "questions" tested, "error" not tested.

Minor

  1. EventBus handler-throws behavior untested — Does emit swallow or propagate handler errors?

  2. Port allocator upper-bound/exhaustion untested — What happens when all ports 9100-9200 are taken?

  3. error-analyzerretry-policy integration untested — Each tested independently but the chain (analyzeError output fed to shouldRetry) is never verified end-to-end.

  4. Git BranchManager incomplete — Only remoteBranchExists tested. Other methods (if any) untested.

  5. ProcessManager.killProcess and cleanupWorktree untested — No unit test for these methods.


Fragility Assessment

High Fragility (refactoring will break tests)

  • agent/process-manager.test.ts — 6 vi.mock() calls. Any import restructuring breaks tests.
  • agent/manager.test.ts — 5 vi.mock() calls. Same concern.
  • preview/manager.test.ts — 6 vi.mock() calls. Same concern.
  • completion-detection.test.ts — Tests private method readSignalCompletion via as any.
  • completion-race-condition.test.ts — Monkey-patches private methods. Module-level mock mutated without reset.
  • crash-race-condition.test.ts — Accesses private handleCompletion via as any.

Low Fragility (behavioral tests)

  • DB repository tests — Real in-memory SQLite, test public API only.
  • git/manager.test.ts — Real git repos, no mocks.
  • file-io.test.ts — Real filesystem, pure function tests.
  • process/registry.test.ts — Pure unit tests, no mocks.
  • preview/compose-generator.test.ts — Pure function tests.
  • events/bus.test.ts — Pure behavioral tests.
  • E2E tests — Test through public harness API.

Mock Style Inconsistencies

The test suite uses 4 different mocking patterns with no shared convention:

Pattern Files Concern
vi.mock() module-level manager.test.ts, process-manager.test.ts, preview/manager.test.ts Import-structure coupled
Factory function (createMockX()) output-handler.test.ts, coordination/.test.ts, server/.test.ts Cleaner but copy-pasted
Inline const object completion-race-condition.test.ts, mutex-completion.test.ts Simple but mutated per-test
Hand-rolled class harness.ts (MockWorktreeManager, CapturingEventBus) Best for shared infrastructure

createMockEventBus() is independently defined in 5+ files. Should be extracted to shared test utility.


Dead / Worthless Tests

Test File Reason
expect(true).toBe(true) real-e2e-crash.test.ts Can never fail. Hardcoded paths.
"logs comprehensive spawn information" process-manager.test.ts Asserts nothing meaningful
"logs comprehensive worktree creation details" process-manager.test.ts Asserts nothing meaningful
"should demonstrate the race condition exists" crash-race-condition.test.ts #4 Only asserts updateCalls.length > 0
"should emit stopped event (second test)" mock-manager.test.ts Exact duplicate of previous test