Phase 13: Real Claude E2E Tests - 1 plan in 1 wave - Validates JSON schemas with actual Claude CLI - Fixes structured_output parsing in ClaudeAgentManager - Tests skipped by default (expensive)
5.6 KiB
phase, plan, type, wave, depends_on, files_modified, autonomous
| phase | plan | type | wave | depends_on | files_modified | autonomous | ||
|---|---|---|---|---|---|---|---|---|
| 13-real-claude-e2e-tests | 01 | execute | 1 |
|
true |
Purpose: Verify that the JSON schemas defined in src/agent/schema.ts work correctly with the actual Claude CLI, confirming MockAgentManager accurately simulates real behavior. Output: Integration test file with real Claude CLI tests (skipped by default due to cost/time), documented findings.
<execution_context>
@/.claude/get-shit-done/workflows/execute-plan.md
@/.claude/get-shit-done/templates/summary.md
</execution_context>
Key source files
@src/agent/manager.ts @src/agent/schema.ts @src/agent/prompts.ts @src/test/harness.ts
Task 1: Create real Claude CLI integration test file src/test/integration/real-claude.test.ts Create integration test file for real Claude CLI validation. Structure:-
Create
src/test/integration/directory if not exists -
Create test file with
describe.skipwrapper (tests are expensive, run manually) -
Add helper function to call Claude CLI directly using execa:
- Takes prompt and JSON schema
- Returns parsed structured_output from CLI response
- Handles timeout (30s default)
-
Add test cases for each agent mode:
- Execute mode: done status with result
- Execute mode: questions status with array
- Discuss mode: context_complete with decisions
- Breakdown mode: breakdown_complete with phases
- Decompose mode: decompose_complete with tasks
-
Each test should:
- Use minimal prompt that triggers expected output
- Verify structured_output field is populated
- Verify output matches Zod schema validation
- Log cost for documentation
Use describe.skip so tests don't run in CI. Add comment explaining how to run manually:
REAL_CLAUDE_TESTS=1 npm test -- --grep "Real Claude"
Key insight from validation: Claude CLI returns structured_output field (not result) when using --json-schema.
File exists at src/test/integration/real-claude.test.ts with skipped test suite
Integration test file created with all mode tests, skipped by default
Current code (line ~190):
const rawOutput = JSON.parse(cliResult.result);
The Claude CLI with --json-schema returns:
{
"type": "result",
"result": "",
"structured_output": { "status": "done", "result": "..." }
}
Update to:
// When --json-schema is used, structured output is in structured_output field
const rawOutput = cliResult.structured_output ?? JSON.parse(cliResult.result);
Also update ClaudeCliResult interface to include structured_output:
interface ClaudeCliResult {
type: 'result';
subtype: 'success' | 'error';
is_error: boolean;
session_id: string;
result: string;
structured_output?: unknown; // Add this
total_cost_usd?: number;
}
This is backwards compatible - if structured_output is missing, falls back to parsing result. npm run build passes, existing tests still pass ClaudeAgentManager correctly reads structured_output from Claude CLI response
Task 3: Run real Claude tests and document findings src/test/integration/real-claude.test.ts Run the real Claude tests manually and document findings:-
Enable tests temporarily by removing .skip or setting env var
-
Run:
npm test -- src/test/integration/real-claude.test.ts -
Capture results:
- Which tests pass/fail
- Response times
- Costs per test
- Any unexpected behavior
-
Add findings as comments in test file:
/** * Real Claude CLI Integration Tests * * Findings from validation run (DATE): * - Execute mode: Works, ~$X.XX, ~Xs * - Multi-question: Works, array format validated * - Discuss mode: Works, decisions array validated * - Breakdown mode: Works, phases array validated * - Decompose mode: Works, tasks array validated * * Total validation cost: $X.XX * * Conclusion: MockAgentManager accurately simulates real CLI behavior. * JSON schemas work correctly with Claude CLI --json-schema flag. */ -
Re-add .skip to prevent accidental runs in CI Tests run successfully when enabled, findings documented in file Real Claude CLI behavior validated, findings documented, tests skipped for CI
<success_criteria>
- Integration test file created with real Claude CLI tests
- Tests are skipped by default (cost/time)
- ClaudeAgentManager correctly parses structured_output
- At least one real test run validates expected behavior
- Findings documented in test file comments </success_criteria>