Architect agents (discuss, plan, detail, refine) were producing generic
analysis disconnected from the actual codebase. They had full tool access
in their worktrees but were never instructed to explore the code.
- Add CODEBASE_EXPLORATION shared constant: read project docs, explore
structure, check existing patterns, use subagents for parallel exploration
- Inject into all 4 architect prompts after INPUT_FILES
- Strengthen discuss prompt: analysis method references codebase, examples
cite specific paths, definition_of_done requires codebase references
- Fix spawnArchitectDiscuss to pass full context (pages/phases/tasks) via
gatherInitiativeContext() — was only passing bare initiative metadata
- Update docs/agent.md with new tag ordering and shared block table
Move drizzle/, dist/, and coverage/ into apps/server/ so all
server-specific artifacts live alongside the source they belong to.
- git mv drizzle/ → apps/server/drizzle/
- drizzle.config.ts: out → ./apps/server/drizzle
- tsconfig.json: outDir → ./apps/server/dist, exclude drizzle dir
- package.json: main/bin/clean point to apps/server/dist/
- vitest.config.ts: reportsDirectory → ./apps/server/coverage
- .gitignore: add coverage/ entry
- ensure-schema.ts: update getMigrationsPath() for new layout
- docs/database-migrations.md: update drizzle/ references
Move src/ → apps/server/ and packages/web/ → apps/web/ to adopt
standard monorepo conventions (apps/ for runnable apps, packages/
for reusable libraries). Update all config files, shared package
imports, test fixtures, and documentation to reflect new paths.
Key fixes:
- Update workspace config to ["apps/*", "packages/*"]
- Update tsconfig.json rootDir/include for apps/server/
- Add apps/web/** to vitest exclude list
- Update drizzle.config.ts schema path
- Fix ensure-schema.ts migration path detection (3 levels up in dev,
2 levels up in dist)
- Fix tests/integration/cli-server.test.ts import paths
- Update packages/shared imports to apps/server/ paths
- Update all docs/ files with new paths
The cassette-backed test (full-flow-cassette.test.ts) covers the same
discuss→plan→detail→execute pipeline without API cost. The real-agent
test added no unique value once cassettes were committed, and the
Stage 6 npm-test validation it included was soft (warn, not fail).
Also removes the now-unused shouldRunFullFlowTests export and the
FULL_FLOW_TESTS=1 entry from CLAUDE.md.
Three issues discovered and fixed after initial recording:
1. Agent workdir names not normalized — random animal names (e.g.
"available-sheep") embedded in workspace paths caused key drift.
Added AGENT_WORKDIR_RE to replace agent-workdirs/<name> with
agent-workdirs/__AGENT__ in normalizer.ts.
2. Phase/task files missing on replay — plan/detail agents write output
to .cw/output/ (phases/, tasks/) which the server reads on completion.
The replay worker only emits JSONL; it doesn't re-execute file writes.
Extended cassette format with outputFiles field and added capture
(walkOutputDir) + restore (restoreOutputFiles) logic to process-manager.
3. Recording timeout too short — fixed CASSETTE_FLOW_TIMEOUT to be
mode-aware: 60 min for recording runs, 5 min for replay.
Also commit the 4 recorded cassettes (discuss/plan/detail/execute)
that make the full-flow cassette test runnable in CI without API costs.
14 files in docs/wireframes/v2/ addressing 13 UX gaps from v1:
- Theme spec with indigo brand, status tokens, terminal/diff tokens,
dark mode, Geist typography, 6px radius, layered shadows
- Wireframes for all pages with loading/error/empty states
- Shared component specs (SaveIndicator, EmptyState, ErrorState,
CommandPalette, ThemeToggle)
- normalizer.ts: Add NANOID_RE (21-char alphanumeric) → __ID__ as step 2.5,
fixing cassette key instability from nanoid agent IDs in prompts
- harness.ts: Add FullFlowHarnessOptions.processManagerFactory for injecting
CassetteProcessManager without duplicating harness setup
- full-flow-cassette.test.ts: New cassette-backed variant of full-flow test;
skips automatically when no cassettes exist (fresh clone), runs in ~seconds
once cassettes are recorded and committed
- CLAUDE.md: Document cassette recording command for the full-flow test
- driveToCompletion() now catches inner waitForAgentAttention timeouts
instead of letting them propagate — long-running execute/detail agents
(>3 min without transitioning to waiting_for_input) no longer crash the
polling loop; the outer deadline handles termination correctly
- Switch execute stage from waitForAgentCompletion to driveToCompletion
so any clarifying questions get auto-answered
- Increase DETAIL_TIMEOUT_MS 8→15 min, PLAN_TIMEOUT_MS 8→12 min,
EXECUTE_TIMEOUT_MS 10→20 min — architect agents are variable in
practice; these are upper bounds not expectations
- Raise FULL_FLOW_TIMEOUT 30→60 min to cover worst-case stacking
- Update CLAUDE.md test command with correct --test-timeout=3600000
Verified: full pipeline (discuss→plan→detail→execute) passes in ~499s
Replace ## Heading sections with descriptive XML tags (<role>, <task>,
<execution_protocol>, <examples>, etc.) for unambiguous first-order vs
second-order delimiter separation per Anthropic best practices.
- shared.ts: All constants wrapped in their XML tag
- Mode prompts: Consistent tag vocabulary and ordering across all 5 modes
- Examples use <examples> > <example label="good/bad"> nesting
- workspace.ts: Output wrapped in <workspace> tags
- Delete dead src/agent/prompts.ts (zero imports)
- Update docs/agent.md with XML tag documentation
Adds a complete multi-agent workflow test gated behind FULL_FLOW_TESTS=1:
- src/test/fixtures/todo-api/ — minimal JS project with missing complete()
method and failing tests; gives execute agents a concrete, verifiable task
- src/test/integration/full-flow/harness.ts — FullFlowHarness wiring all 11
repos + real MultiProviderAgentManager + tRPC caller + driveToCompletion()
helper for Q&A loops
- src/test/integration/full-flow/report.ts — stage-by-stage console formatters
(discuss/plan/detail/execute/git diff/final summary)
- src/test/integration/full-flow/full-flow.test.ts — staged integration test
that validates breakdown granularity, agent output quality, and that npm test
passes in the project worktree after execution
Run with:
FULL_FLOW_TESTS=1 npm test -- src/test/integration/full-flow/ --test-timeout=1800000
Audited all 44 test files one by one. Documents what each test verifies,
identifies 12 redundant test pairs, 13 coverage gaps (prioritized), fragility
assessment, and mock style inconsistencies.
Implements cassette recording/replay to test the full agent execution
pipeline (ProcessManager → FileTailer → OutputHandler → SignalManager)
without real AI API calls.
Key components:
- `CassetteProcessManager`: extends ProcessManager, intercepts spawnDetached
to replay cassettes or record real runs on completion
- `replay-worker.mjs`: standalone node script that replays JSONL + signal.json
as a subprocess, exercising the complete file-based output pipeline
- `CassetteStore`: reads/writes cassette JSON files keyed by SHA256 hash
- `normalizer.ts`: strips dynamic content (UUIDs, temp paths, timestamps,
session numbers) from prompts for stable cassette keys
- `key.ts`: hashes normalized prompt + provider args + worktree file content
(worktree hash detects content drift for execute-mode agents)
- `createCassetteHarness()`: wraps RealProviderHarness with cassette support,
same interface so existing real-provider tests work unchanged
Mode control via env vars:
(default) → replay: cassette must exist (safe for CI)
CW_CASSETTE_RECORD=1 → auto: replay if exists, record if missing
CW_CASSETTE_FORCE_RECORD=1 → record: always run real agent, overwrite cassette
MultiProviderAgentManager gains an optional `processManagerOverride` constructor
parameter for clean dependency injection without changing existing callers.
Cassette files live in src/test/cassettes/ and are intended to be committed
to git so CI runs without API access.
- Add withFakeTimers(fn) helper to TestHarness for scoped timer control
- Replace all vi.runAllTimersAsync() with harness.advanceTimers() in E2E
and harness tests (37 call sites across 5 files)
- Keep vi.useFakeTimers() per-test activation pattern (intentional)
- Add @vitest/coverage-v8 dep so `npm run test:coverage` actually works
- Add exclude patterns to vitest config (node_modules, dist, packages)
- Replace dynamic import('vitest') in advanceTimers with direct vi import
Nulls out agents.initiativeId before deleting the initiative row,
ensuring the delete succeeds even on databases where migration 0025
(which adds ON DELETE SET NULL to the FK) hasn't been applied.
Dispatch manager now wraps task descriptions with buildExecutePrompt()
so agents receive the full execution protocol. Update test to match
prompt wrapping. Add worktree isolation note to workspace layout.
Drop redundant Specificity Test section (covered by examples and checklist),
remove Task Design Rules (implied by entire prompt), flatten frontmatter
docs, trim good example, tighten sizing/checkpoint/context sections.
Remove CODEBASE_VERIFICATION references, document new shared constants
(TEST_INTEGRITY, SESSION_STARTUP, PROGRESS_TRACKING), update mode prompt
descriptions with TDD protocol, Definition of Done checklists, and
mandatory test specifications.
Replace the weak 7-step execution protocol with an explicit red-green-refactor
cycle that requires agents to write failing tests before implementing. Move
anti-patterns and scope rules above deviation/git sections so critical
constraints get more attention. Add session startup verification, progress
tracking, and a mandatory definition-of-done checklist that must pass before
signaling completion. Remove dead CODEBASE_VERIFICATION import.
Detail: Replace vague "how to verify" requirement with mandatory test specification
(file path, scenarios, run command) for execute-category tasks. Update good-task
example to demonstrate the new format. Add Definition of Done checklist.
Plan: Add Testing Strategy section requiring tests within each implementation phase
instead of trailing test phases. Add Definition of Done checklist.
Anchor on ~150 lines changed as the sweet spot based on SWE-bench Pro
data (107 lines / 4.1 files = 46% success for best agents). Old rules
used file count as the primary proxy which correlates poorly with task
difficulty compared to lines changed.
Add CONTEXT_MANAGEMENT shared block to plan and detail mode prompts so
architect agents also benefit from compaction awareness and parallel
execution hints. Update index.ts re-exports and agent docs.
- Add CONTEXT_MANAGEMENT constant: tells agents to keep working through
context compaction and parallelize reads
- Add "why" reasoning to each GIT_WORKFLOW rule so agents understand the
purpose, not just the rule
- Slim buildInterAgentCommunication: replace verbose bash code blocks with
a brief usage pattern paragraph, condense CLI docs to bullet list
Standalone agents (no initiative or 0 linked projects) run in a
workspace/ subdirectory, but signal.json lookups used the parent
directory. This caused all standalone agents to be marked "crashed"
despite successful completion.
Track the actual agent cwd at spawn time via ActiveAgent.agentCwd
and probe for the workspace/ subdirectory during reconciliation and
crash detection paths.
- Add deleteTask tRPC mutation (repo already had delete method)
- Add X button to TaskRow, hidden until hover, with confirmation dialog
- Shift+click bypasses confirmation for fast bulk deletion
- Invalidates listInitiativeTasks on success
- Document shift+click pattern in CLAUDE.md as standard for destructive actions