tas/Codewalkers

Fork 0

Files

Lukas May e9ec5143fd docs: Document cassette testing system in docs/testing.md and CLAUDE.md

2026-03-02 12:22:46 +09:00

10 KiB

Raw Blame History

Testing

src/test/ — Test infrastructure, fixtures, and test suites.

Framework

vitest (Vite-native test runner)

Test Categories

Unit Tests

Located alongside source files (*.test.ts):

src/agent/*.test.ts — Manager, output handler, completion detection, file I/O, process manager
src/db/repositories/drizzle/*.test.ts — Repository adapters
src/dispatch/*.test.ts — Dispatch manager
src/git/manager.test.ts — Worktree operations
src/process/*.test.ts — Process registry and manager
src/logging/*.test.ts — Log manager and writer

E2E Tests (Mocked Agents)

src/test/e2e/:

File	Scenarios
`happy-path.test.ts`	Single task, parallel, complex flows
`architect-workflow.test.ts`	Discussion + plan agent workflows
`detail-workflow.test.ts`	Task detail with child tasks
`phase-dispatch.test.ts`	Phase-level dispatch with dependencies
`recovery-scenarios.test.ts`	Crash recovery, agent resume
`edge-cases.test.ts`	Boundary conditions
`extended-scenarios.test.ts`	Advanced multi-phase workflows

These use MockAgentManager which bypasses the real subprocess pipeline. They test dispatch/coordination logic only.

Cassette Tests (Pipeline Integration, Zero API Cost)

src/test/cassette/ — Tests the full agent execution pipeline using pre-recorded cassettes.

Unlike E2E tests, cassette tests exercise the real ProcessManager → FileTailer → OutputHandler → SignalManager path. Unlike real provider tests, they cost nothing to run in CI.

See Cassette System below for full documentation.

Integration Tests (Real Providers)

src/test/integration/real-providers/ — skipped by default (cost real money):

File	Provider	Cost
`claude-manager.test.ts`	Claude CLI	~$0.10
`codex-manager.test.ts`	Codex	varies
`schema-retry.test.ts`	Claude CLI	~$0.10
`crash-recovery.test.ts`	Claude CLI	~$0.10

Enable with env vars: REAL_CLAUDE_TESTS=1, REAL_CODEX_TESTS=1

Test Infrastructure

TestHarness (`src/test/harness.ts`)

Central test utility providing:

In-memory SQLite database with schema applied
All 10 repository instances
MockAgentManager — simulates agent behavior (done, questions, error)
MockWorktreeManager — in-memory worktree simulator
CapturingEventBus — captures events for assertions
DefaultDispatchManager and DefaultPhaseDispatchManager
25+ helper methods for test scenarios

Fixtures (`src/test/fixtures.ts`)

Pre-built task hierarchies for testing:

Fixture	Structure
`SIMPLE_FIXTURE`	1 initiative → 1 phase → 1 group → 3 tasks (A→B, A→C deps)
`PARALLEL_FIXTURE`	1 initiative → 1 phase → 2 groups → 4 independent tasks
`COMPLEX_FIXTURE`	1 initiative → 2 phases → 4 groups → cross-phase dependencies

Real Provider Harness (`src/test/integration/real-providers/harness.ts`)

Creates real database, real agent manager with real CLI tools
Provides describeRealClaude() / describeRealCodex() that skip when env var not set
MINIMAL_PROMPTS — cheap prompts for testing output parsing

Test Inventory

See test-inventory.md for a complete catalog of every test, what it verifies, coverage gaps, redundancy map, and fragility assessment.

Running Tests

# Unit + E2E tests (no API cost)
npm test

# Specific test file
npm test -- src/agent/manager.test.ts

# Cassette tests — replay pre-recorded cassettes (no API cost)
npm test -- src/test/cassette/

# Record new cassettes locally (requires real Claude CLI)
CW_CASSETTE_RECORD=1 npm test -- src/test/integration/real-providers/claude-manager.test.ts

# Real provider tests (costs money!)
REAL_CLAUDE_TESTS=1 npm test -- src/test/integration/real-providers/ --test-timeout=300000

Cassette System

src/test/cassette/ — VCR-style recording and replay for the agent subprocess pipeline.

Why it exists

The MockAgentManager used in E2E tests skips from "spawn called" directly to "agent:stopped emitted". It never exercises ProcessManager, FileTailer, OutputHandler, or SignalManager. Bugs in those layers (signal.json race conditions, JSONL parsing failures, crash detection) are invisible to E2E tests.

Real provider tests do exercise those layers, but they are slow, expensive, and can't run in CI without credentials.

Cassette tests bridge this gap: they run the real MultiProviderAgentManager pipeline but replace the live Claude/Codex subprocess with a replay worker that writes pre-recorded output.

Coverage the cassette layer adds

FileTailer — fs.watch + poll cycle, incremental JSONL reading
OutputHandler — stream event parsing, signal detection, result capture
SignalManager — signal.json read/write/timing
LifecycleController — retry logic, missing signal recovery
ProcessManager — subprocess PID tracking, poll-for-completion
Prompt normalization drift detection — key mismatch = re-record = visible diff

Key generation

Each cassette is identified by a SHA256 hash of four components:

Component	What it captures
`normalizedPrompt`	Prompt with UUIDs, temp paths, timestamps, session numbers replaced with placeholders
`providerName`	e.g. `claude`, `codex`
`modelArgs`	Provider CLI args with the prompt value stripped (sorted for stability)
`worktreeHash`	SHA256 of all non-hidden files in the agent worktree at spawn time

The worktreeHash is what detects content drift for execute-mode agents: if the worktree changes, the key misses and the cassette is re-recorded.

Normalization (src/test/cassette/normalizer.ts) strips dynamic content that varies between runs but doesn't affect agent behavior:

UUIDs → __UUID__
Workspace root path → __WORKSPACE__
ISO 8601 timestamps → __TIMESTAMP__
Unix epoch milliseconds → __EPOCH__
Session numbers → session__N__

If a prompt template changes (e.g. someone edits buildExecutePrompt()), the normalized hash changes → cassette miss → test fails in CI → developer must re-record → the diff shows the new agent response in the PR. This makes prompt drift auditable.

Cassette file format

Cassettes live in src/test/cassettes/<32-char-hash>.json and are committed to git.

{
  "version": 1,
  "key": {
    "normalizedPrompt": "You are a Worker agent...",
    "providerName": "claude",
    "modelArgs": ["--dangerously-skip-permissions", "--verbose", "--output-format", "stream-json"],
    "worktreeHash": "empty"
  },
  "recording": {
    "jsonlLines": [
      "{\"type\":\"system\",\"session_id\":\"abc\"}",
      "{\"type\":\"result\",\"subtype\":\"success\",\"result\":\"ok\"}"
    ],
    "signalJson": { "status": "done", "message": "Task complete" },
    "exitCode": 0,
    "recordedAt": "2026-03-02T12:00:00.000Z"
  }
}

How replay works

CassetteProcessManager (extends ProcessManager) overrides two methods:

spawnDetached() — on a cache hit, spawns replay-worker.mjs instead of the real CLI. The worker writes the recorded JSONL lines to stdout (which spawnDetached redirects to the output file via fd) and writes signal.json relative to its cwd. Everything above — FileTailer, OutputHandler, poll loop — runs unmodified.
pollForCompletion() — on a cache miss (record mode), wraps the onComplete callback to read the output file and signal.json after the process exits, then saves the cassette before handing off to OutputHandler.

MultiProviderAgentManager accepts an optional processManagerOverride constructor parameter so CassetteProcessManager can be injected without changing production callers.

Mode control

Env var	Mode	Behaviour
(none)	`replay`	Cassette must exist; throws if missing. Safe for CI.
`CW_CASSETTE_RECORD=1`	`auto`	Replays if cassette exists, runs real agent and records if missing.
`CW_CASSETTE_FORCE_RECORD=1`	`record`	Always runs real agent; overwrites existing cassette. Use when prompt changed intentionally.

Writing cassette tests

import { createCassetteHarness } from '../cassette/index.js';
import { MINIMAL_PROMPTS } from '../integration/real-providers/prompts.js';
import type { RealProviderHarness } from '../integration/real-providers/harness.js';

describe('agent pipeline (cassette)', () => {
  let harness: RealProviderHarness;

  beforeAll(async () => {
    harness = await createCassetteHarness({ provider: 'claude' });
  });

  afterAll(() => harness.cleanup());

  it('completes a task and emits agent:stopped', async () => {
    const agent = await harness.agentManager.spawn({
      taskId: null,
      prompt: MINIMAL_PROMPTS.done,
      mode: 'execute',
      provider: 'claude',
    });

    const result = await harness.waitForAgentCompletion(agent.id);
    expect(result?.success).toBe(true);

    const stopped = harness.getEventsByType('agent:stopped');
    expect(stopped).toHaveLength(1);
  });
});

createCassetteHarness() returns a RealProviderHarness, so tests written for real providers work unchanged.

Cassette directory

src/test/cassettes/
  <hash>.json     ← committed to git; one file per recorded scenario
  .gitkeep

Cassettes are committed so CI can run without any AI API credentials. When a cassette needs updating (prompt changed, provider output format changed), re-record locally with CW_CASSETTE_RECORD=1 and commit the updated file.

Files

File	Purpose
`types.ts`	`CassetteKey`, `CassetteRecording`, `CassetteEntry` interfaces
`normalizer.ts`	`normalizePrompt()`, `stripPromptFromArgs()`
`key.ts`	`hashWorktreeFiles()`, `buildCassetteKey()`
`store.ts`	`CassetteStore` — find/save cassette JSON files
`replay-worker.mjs`	Subprocess that replays a cassette (plain JS ESM, no build step)
`process-manager.ts`	`CassetteProcessManager` — overrides `spawnDetached` and `pollForCompletion`
`harness.ts`	`createCassetteHarness()` — factory returning `RealProviderHarness`
`index.ts`	Barrel exports
`cassette.test.ts`	Unit tests for normalizer, key generation, and store

10 KiB Raw Blame History