docs: Document cassette testing system in docs/testing.md and CLAUDE.md

This commit is contained in:
Lukas May
2026-03-02 12:22:46 +09:00
parent ec031211a2
commit e9ec5143fd
2 changed files with 170 additions and 3 deletions

View File

@@ -44,11 +44,12 @@ Run after any change to server-side code (`src/**`).
## Testing
```sh
npm test # Unit tests
npm test # Unit + E2E tests (no API cost)
CW_CASSETTE_RECORD=1 npm test -- <test-file> # Record new cassettes locally
REAL_CLAUDE_TESTS=1 npm test -- src/test/integration/real-providers/ --test-timeout=300000 # Real provider tests (~$0.50)
```
See [docs/testing.md](docs/testing.md) for details.
See [docs/testing.md](docs/testing.md) for details, including the **cassette system** for pipeline integration tests that run without API costs.
## Documentation Maintenance

View File

@@ -29,6 +29,15 @@ Located alongside source files (`*.test.ts`):
| `edge-cases.test.ts` | Boundary conditions |
| `extended-scenarios.test.ts` | Advanced multi-phase workflows |
These use `MockAgentManager` which bypasses the real subprocess pipeline. They test dispatch/coordination logic only.
### Cassette Tests (Pipeline Integration, Zero API Cost)
`src/test/cassette/` — Tests the full agent execution pipeline using pre-recorded cassettes.
Unlike E2E tests, cassette tests exercise the real `ProcessManager → FileTailer → OutputHandler → SignalManager` path. Unlike real provider tests, they cost nothing to run in CI.
See **[Cassette System](#cassette-system)** below for full documentation.
### Integration Tests (Real Providers)
`src/test/integration/real-providers/`**skipped by default** (cost real money):
| File | Provider | Cost |
@@ -65,15 +74,172 @@ Pre-built task hierarchies for testing:
- Provides `describeRealClaude()` / `describeRealCodex()` that skip when env var not set
- `MINIMAL_PROMPTS` — cheap prompts for testing output parsing
## Test Inventory
See **[test-inventory.md](test-inventory.md)** for a complete catalog of every test, what it verifies, coverage gaps, redundancy map, and fragility assessment.
## Running Tests
```sh
# Unit tests
# Unit + E2E tests (no API cost)
npm test
# Specific test file
npm test -- src/agent/manager.test.ts
# Cassette tests — replay pre-recorded cassettes (no API cost)
npm test -- src/test/cassette/
# Record new cassettes locally (requires real Claude CLI)
CW_CASSETTE_RECORD=1 npm test -- src/test/integration/real-providers/claude-manager.test.ts
# Real provider tests (costs money!)
REAL_CLAUDE_TESTS=1 npm test -- src/test/integration/real-providers/ --test-timeout=300000
```
---
## Cassette System
`src/test/cassette/` — VCR-style recording and replay for the agent subprocess pipeline.
### Why it exists
The `MockAgentManager` used in E2E tests skips from "spawn called" directly to "agent:stopped emitted". It never exercises `ProcessManager`, `FileTailer`, `OutputHandler`, or `SignalManager`. Bugs in those layers (signal.json race conditions, JSONL parsing failures, crash detection) are invisible to E2E tests.
Real provider tests do exercise those layers, but they are slow, expensive, and can't run in CI without credentials.
Cassette tests bridge this gap: they run the **real** `MultiProviderAgentManager` pipeline but replace the live Claude/Codex subprocess with a replay worker that writes pre-recorded output.
### Coverage the cassette layer adds
- `FileTailer` — fs.watch + poll cycle, incremental JSONL reading
- `OutputHandler` — stream event parsing, signal detection, result capture
- `SignalManager` — signal.json read/write/timing
- `LifecycleController` — retry logic, missing signal recovery
- `ProcessManager` — subprocess PID tracking, poll-for-completion
- Prompt normalization drift detection — key mismatch = re-record = visible diff
### Key generation
Each cassette is identified by a SHA256 hash of four components:
| Component | What it captures |
|-----------|-----------------|
| `normalizedPrompt` | Prompt with UUIDs, temp paths, timestamps, session numbers replaced with placeholders |
| `providerName` | e.g. `claude`, `codex` |
| `modelArgs` | Provider CLI args with the prompt value stripped (sorted for stability) |
| `worktreeHash` | SHA256 of all non-hidden files in the agent worktree at spawn time |
The `worktreeHash` is what detects content drift for execute-mode agents: if the worktree changes, the key misses and the cassette is re-recorded.
**Normalization** (`src/test/cassette/normalizer.ts`) strips dynamic content that varies between runs but doesn't affect agent behavior:
- UUIDs → `__UUID__`
- Workspace root path → `__WORKSPACE__`
- ISO 8601 timestamps → `__TIMESTAMP__`
- Unix epoch milliseconds → `__EPOCH__`
- Session numbers → `session__N__`
If a prompt *template* changes (e.g. someone edits `buildExecutePrompt()`), the normalized hash changes → cassette miss → test fails in CI → developer must re-record → the diff shows the new agent response in the PR. This makes prompt drift auditable.
### Cassette file format
Cassettes live in `src/test/cassettes/<32-char-hash>.json` and are committed to git.
```json
{
"version": 1,
"key": {
"normalizedPrompt": "You are a Worker agent...",
"providerName": "claude",
"modelArgs": ["--dangerously-skip-permissions", "--verbose", "--output-format", "stream-json"],
"worktreeHash": "empty"
},
"recording": {
"jsonlLines": [
"{\"type\":\"system\",\"session_id\":\"abc\"}",
"{\"type\":\"result\",\"subtype\":\"success\",\"result\":\"ok\"}"
],
"signalJson": { "status": "done", "message": "Task complete" },
"exitCode": 0,
"recordedAt": "2026-03-02T12:00:00.000Z"
}
}
```
### How replay works
`CassetteProcessManager` (extends `ProcessManager`) overrides two methods:
1. **`spawnDetached()`** — on a cache hit, spawns `replay-worker.mjs` instead of the real CLI. The worker writes the recorded JSONL lines to stdout (which `spawnDetached` redirects to the output file via fd) and writes `signal.json` relative to its cwd. Everything above — `FileTailer`, `OutputHandler`, poll loop — runs unmodified.
2. **`pollForCompletion()`** — on a cache miss (record mode), wraps the `onComplete` callback to read the output file and `signal.json` after the process exits, then saves the cassette before handing off to `OutputHandler`.
`MultiProviderAgentManager` accepts an optional `processManagerOverride` constructor parameter so `CassetteProcessManager` can be injected without changing production callers.
### Mode control
| Env var | Mode | Behaviour |
|---------|------|-----------|
| *(none)* | `replay` | Cassette must exist; throws if missing. Safe for CI. |
| `CW_CASSETTE_RECORD=1` | `auto` | Replays if cassette exists, runs real agent and records if missing. |
| `CW_CASSETTE_FORCE_RECORD=1` | `record` | Always runs real agent; overwrites existing cassette. Use when prompt changed intentionally. |
### Writing cassette tests
```ts
import { createCassetteHarness } from '../cassette/index.js';
import { MINIMAL_PROMPTS } from '../integration/real-providers/prompts.js';
import type { RealProviderHarness } from '../integration/real-providers/harness.js';
describe('agent pipeline (cassette)', () => {
let harness: RealProviderHarness;
beforeAll(async () => {
harness = await createCassetteHarness({ provider: 'claude' });
});
afterAll(() => harness.cleanup());
it('completes a task and emits agent:stopped', async () => {
const agent = await harness.agentManager.spawn({
taskId: null,
prompt: MINIMAL_PROMPTS.done,
mode: 'execute',
provider: 'claude',
});
const result = await harness.waitForAgentCompletion(agent.id);
expect(result?.success).toBe(true);
const stopped = harness.getEventsByType('agent:stopped');
expect(stopped).toHaveLength(1);
});
});
```
`createCassetteHarness()` returns a `RealProviderHarness`, so tests written for real providers work unchanged.
### Cassette directory
```
src/test/cassettes/
<hash>.json ← committed to git; one file per recorded scenario
.gitkeep
```
Cassettes are committed so CI can run without any AI API credentials. When a cassette needs updating (prompt changed, provider output format changed), re-record locally with `CW_CASSETTE_RECORD=1` and commit the updated file.
### Files
| File | Purpose |
|------|---------|
| `types.ts` | `CassetteKey`, `CassetteRecording`, `CassetteEntry` interfaces |
| `normalizer.ts` | `normalizePrompt()`, `stripPromptFromArgs()` |
| `key.ts` | `hashWorktreeFiles()`, `buildCassetteKey()` |
| `store.ts` | `CassetteStore` — find/save cassette JSON files |
| `replay-worker.mjs` | Subprocess that replays a cassette (plain JS ESM, no build step) |
| `process-manager.ts` | `CassetteProcessManager` — overrides `spawnDetached` and `pollForCompletion` |
| `harness.ts` | `createCassetteHarness()` — factory returning `RealProviderHarness` |
| `index.ts` | Barrel exports |
| `cassette.test.ts` | Unit tests for normalizer, key generation, and store |