Codewalkers/docs/testing.md

# Testing

`apps/server/test/` — Test infrastructure, fixtures, and test suites.

## Framework

**vitest** (Vite-native test runner)

## Test Categories

### Unit Tests
Located alongside source files (`*.test.ts`):
- `apps/server/agent/*.test.ts` — Manager, output handler, completion detection, file I/O, process manager
- `apps/server/db/repositories/drizzle/*.test.ts` — Repository adapters
- `apps/server/dispatch/*.test.ts` — Dispatch manager
- `apps/server/git/manager.test.ts` — Worktree operations
- `apps/server/process/*.test.ts` — Process registry and manager
- `apps/server/logging/*.test.ts` — Log manager and writer

### E2E Tests (Mocked Agents)
`apps/server/test/e2e/`:
| File | Scenarios |
|------|-----------|
| `happy-path.test.ts` | Single task, parallel, complex flows |
| `architect-workflow.test.ts` | Discussion + plan agent workflows |
| `detail-workflow.test.ts` | Task detail with child tasks |
| `phase-dispatch.test.ts` | Phase-level dispatch with dependencies |
| `recovery-scenarios.test.ts` | Crash recovery, agent resume |
| `edge-cases.test.ts` | Boundary conditions |
| `extended-scenarios.test.ts` | Advanced multi-phase workflows |

These use `MockAgentManager` which bypasses the real subprocess pipeline. They test dispatch/coordination logic only.

### Cassette Tests (Pipeline Integration, Zero API Cost)
`apps/server/test/cassette/` — Tests the full agent execution pipeline using pre-recorded cassettes.

Unlike E2E tests, cassette tests exercise the real `ProcessManager → FileTailer → OutputHandler → SignalManager` path. Unlike real provider tests, they cost nothing to run in CI.

See **[Cassette System](#cassette-system)** below for full documentation.

### Integration Tests (Real Providers)
`apps/server/test/integration/real-providers/` — **skipped by default** (cost real money):
| File | Provider | Cost |
|------|----------|------|
| `claude-manager.test.ts` | Claude CLI | ~$0.10 |
| `codex-manager.test.ts` | Codex | varies |
| `schema-retry.test.ts` | Claude CLI | ~$0.10 |
| `crash-recovery.test.ts` | Claude CLI | ~$0.10 |

Enable with env vars: `REAL_CLAUDE_TESTS=1`, `REAL_CODEX_TESTS=1`

## Test Infrastructure

### TestHarness (`apps/server/test/harness.ts`)
Central test utility providing:
- In-memory SQLite database with schema applied
- All 10 repository instances
- `MockAgentManager` — simulates agent behavior (done, questions, error)
- `MockWorktreeManager` — in-memory worktree simulator
- `CapturingEventBus` — captures events for assertions
- `DefaultDispatchManager` and `DefaultPhaseDispatchManager`
- 25+ helper methods for test scenarios

### Fixtures (`apps/server/test/fixtures.ts`)
Pre-built task hierarchies for testing:
| Fixture | Structure |
|---------|-----------|
| `SIMPLE_FIXTURE` | 1 initiative → 1 phase → 1 group → 3 tasks (A→B, A→C deps) |
| `PARALLEL_FIXTURE` | 1 initiative → 1 phase → 2 groups → 4 independent tasks |
| `COMPLEX_FIXTURE` | 1 initiative → 2 phases → 4 groups → cross-phase dependencies |

### Real Provider Harness (`apps/server/test/integration/real-providers/harness.ts`)
- Creates real database, real agent manager with real CLI tools
- Provides `describeRealClaude()` / `describeRealCodex()` that skip when env var not set
- `MINIMAL_PROMPTS` — cheap prompts for testing output parsing

## Test Inventory

See **[test-inventory.md](test-inventory.md)** for a complete catalog of every test, what it verifies, coverage gaps, redundancy map, and fragility assessment.

## Running Tests

```sh
# Unit + E2E tests (no API cost)
npm test

# Specific test file
npm test -- apps/server/agent/manager.test.ts

# Cassette tests — replay pre-recorded cassettes (no API cost)
npm test -- apps/server/test/cassette/

# Record new cassettes locally (requires real Claude CLI)
CW_CASSETTE_RECORD=1 npm test -- apps/server/test/integration/real-providers/claude-manager.test.ts

# Real provider tests (costs money!)
REAL_CLAUDE_TESTS=1 npm test -- apps/server/test/integration/real-providers/ --test-timeout=300000
```

---

## Cassette System

`apps/server/test/cassette/` — VCR-style recording and replay for the agent subprocess pipeline.

### Why it exists

The `MockAgentManager` used in E2E tests skips from "spawn called" directly to "agent:stopped emitted". It never exercises `ProcessManager`, `FileTailer`, `OutputHandler`, or `SignalManager`. Bugs in those layers (signal.json race conditions, JSONL parsing failures, crash detection) are invisible to E2E tests.

Real provider tests do exercise those layers, but they are slow, expensive, and can't run in CI without credentials.

Cassette tests bridge this gap: they run the **real** `MultiProviderAgentManager` pipeline but replace the live Claude/Codex subprocess with a replay worker that writes pre-recorded output.

### Coverage the cassette layer adds

- `FileTailer` — fs.watch + poll cycle, incremental JSONL reading
- `OutputHandler` — stream event parsing, signal detection, result capture
- `SignalManager` — signal.json read/write/timing
- `LifecycleController` — retry logic, missing signal recovery
- `ProcessManager` — subprocess PID tracking, poll-for-completion
- Prompt normalization drift detection — key mismatch = re-record = visible diff

### Key generation

Each cassette is identified by a SHA256 hash of four components:

| Component | What it captures |
|-----------|-----------------|
| `normalizedPrompt` | Prompt with UUIDs, temp paths, timestamps, session numbers replaced with placeholders |
| `providerName` | e.g. `claude`, `codex` |
| `modelArgs` | Provider CLI args with the prompt value stripped (sorted for stability) |
| `worktreeHash` | SHA256 of all non-hidden files in the agent worktree at spawn time |

The `worktreeHash` is what detects content drift for execute-mode agents: if the worktree changes, the key misses and the cassette is re-recorded.

**Normalization** (`src/test/cassette/normalizer.ts`) strips dynamic content that varies between runs but doesn't affect agent behavior:
- UUIDs → `__UUID__`
- Workspace root path → `__WORKSPACE__`
- ISO 8601 timestamps → `__TIMESTAMP__`
- Unix epoch milliseconds → `__EPOCH__`
- Session numbers → `session__N__`

If a prompt *template* changes (e.g. someone edits `buildExecutePrompt()`), the normalized hash changes → cassette miss → test fails in CI → developer must re-record → the diff shows the new agent response in the PR. This makes prompt drift auditable.

### Cassette file format

Cassettes live in `src/test/cassettes/<32-char-hash>.json` and are committed to git.

```json
{
  "version": 1,
  "key": {
    "normalizedPrompt": "You are a Worker agent...",
    "providerName": "claude",
    "modelArgs": ["--dangerously-skip-permissions", "--verbose", "--output-format", "stream-json"],
    "worktreeHash": "empty"
  },
  "recording": {
    "jsonlLines": [
      "{\"type\":\"system\",\"session_id\":\"abc\"}",
      "{\"type\":\"result\",\"subtype\":\"success\",\"result\":\"ok\"}"
    ],
    "signalJson": { "status": "done", "message": "Task complete" },
    "exitCode": 0,
    "recordedAt": "2026-03-02T12:00:00.000Z"
  }
}
```

### How replay works

`CassetteProcessManager` (extends `ProcessManager`) overrides two methods:

1. **`spawnDetached()`** — on a cache hit, spawns `replay-worker.mjs` instead of the real CLI. The worker writes the recorded JSONL lines to stdout (which `spawnDetached` redirects to the output file via fd) and writes `signal.json` relative to its cwd. Everything above — `FileTailer`, `OutputHandler`, poll loop — runs unmodified.

2. **`pollForCompletion()`** — on a cache miss (record mode), wraps the `onComplete` callback to read the output file and `signal.json` after the process exits, then saves the cassette before handing off to `OutputHandler`.

`MultiProviderAgentManager` accepts an optional `processManagerOverride` constructor parameter so `CassetteProcessManager` can be injected without changing production callers.

### Mode control

| Env var | Mode | Behaviour |
|---------|------|-----------|
| *(none)* | `replay` | Cassette must exist; throws if missing. Safe for CI. |
| `CW_CASSETTE_RECORD=1` | `auto` | Replays if cassette exists, runs real agent and records if missing. |
| `CW_CASSETTE_FORCE_RECORD=1` | `record` | Always runs real agent; overwrites existing cassette. Use when prompt changed intentionally. |

### Writing cassette tests

```ts
import { createCassetteHarness } from '../cassette/index.js';
import { MINIMAL_PROMPTS } from '../integration/real-providers/prompts.js';
import type { RealProviderHarness } from '../integration/real-providers/harness.js';

describe('agent pipeline (cassette)', () => {
  let harness: RealProviderHarness;

  beforeAll(async () => {
    harness = await createCassetteHarness({ provider: 'claude' });
  });

  afterAll(() => harness.cleanup());

  it('completes a task and emits agent:stopped', async () => {
    const agent = await harness.agentManager.spawn({
      taskId: null,
      prompt: MINIMAL_PROMPTS.done,
      mode: 'execute',
      provider: 'claude',
    });

    const result = await harness.waitForAgentCompletion(agent.id);
    expect(result?.success).toBe(true);

    const stopped = harness.getEventsByType('agent:stopped');
    expect(stopped).toHaveLength(1);
  });
});
```

`createCassetteHarness()` returns a `RealProviderHarness`, so tests written for real providers work unchanged.

### Cassette directory

```
apps/server/test/cassettes/
  <hash>.json     ← committed to git; one file per recorded scenario
  .gitkeep
```

Cassettes are committed so CI can run without any AI API credentials. When a cassette needs updating (prompt changed, provider output format changed), re-record locally with `CW_CASSETTE_RECORD=1` and commit the updated file.

### Files

| File | Purpose |
|------|---------|
| `types.ts` | `CassetteKey`, `CassetteRecording`, `CassetteEntry` interfaces |
| `normalizer.ts` | `normalizePrompt()`, `stripPromptFromArgs()` |
| `key.ts` | `hashWorktreeFiles()`, `buildCassetteKey()` |
| `store.ts` | `CassetteStore` — find/save cassette JSON files |
| `replay-worker.mjs` | Subprocess that replays a cassette (plain JS ESM, no build step) |
| `process-manager.ts` | `CassetteProcessManager` — overrides `spawnDetached` and `pollForCompletion` |
| `harness.ts` | `createCassetteHarness()` — factory returning `RealProviderHarness` |
| `index.ts` | Barrel exports |
| `cassette.test.ts` | Unit tests for normalizer, key generation, and store |