14 files in docs/wireframes/v2/ addressing 13 UX gaps from v1:
- Theme spec with indigo brand, status tokens, terminal/diff tokens,
dark mode, Geist typography, 6px radius, layered shadows
- Wireframes for all pages with loading/error/empty states
- Shared component specs (SaveIndicator, EmptyState, ErrorState,
CommandPalette, ThemeToggle)
- normalizer.ts: Add NANOID_RE (21-char alphanumeric) → __ID__ as step 2.5,
fixing cassette key instability from nanoid agent IDs in prompts
- harness.ts: Add FullFlowHarnessOptions.processManagerFactory for injecting
CassetteProcessManager without duplicating harness setup
- full-flow-cassette.test.ts: New cassette-backed variant of full-flow test;
skips automatically when no cassettes exist (fresh clone), runs in ~seconds
once cassettes are recorded and committed
- CLAUDE.md: Document cassette recording command for the full-flow test
- driveToCompletion() now catches inner waitForAgentAttention timeouts
instead of letting them propagate — long-running execute/detail agents
(>3 min without transitioning to waiting_for_input) no longer crash the
polling loop; the outer deadline handles termination correctly
- Switch execute stage from waitForAgentCompletion to driveToCompletion
so any clarifying questions get auto-answered
- Increase DETAIL_TIMEOUT_MS 8→15 min, PLAN_TIMEOUT_MS 8→12 min,
EXECUTE_TIMEOUT_MS 10→20 min — architect agents are variable in
practice; these are upper bounds not expectations
- Raise FULL_FLOW_TIMEOUT 30→60 min to cover worst-case stacking
- Update CLAUDE.md test command with correct --test-timeout=3600000
Verified: full pipeline (discuss→plan→detail→execute) passes in ~499s
Replace ## Heading sections with descriptive XML tags (<role>, <task>,
<execution_protocol>, <examples>, etc.) for unambiguous first-order vs
second-order delimiter separation per Anthropic best practices.
- shared.ts: All constants wrapped in their XML tag
- Mode prompts: Consistent tag vocabulary and ordering across all 5 modes
- Examples use <examples> > <example label="good/bad"> nesting
- workspace.ts: Output wrapped in <workspace> tags
- Delete dead src/agent/prompts.ts (zero imports)
- Update docs/agent.md with XML tag documentation
Adds a complete multi-agent workflow test gated behind FULL_FLOW_TESTS=1:
- src/test/fixtures/todo-api/ — minimal JS project with missing complete()
method and failing tests; gives execute agents a concrete, verifiable task
- src/test/integration/full-flow/harness.ts — FullFlowHarness wiring all 11
repos + real MultiProviderAgentManager + tRPC caller + driveToCompletion()
helper for Q&A loops
- src/test/integration/full-flow/report.ts — stage-by-stage console formatters
(discuss/plan/detail/execute/git diff/final summary)
- src/test/integration/full-flow/full-flow.test.ts — staged integration test
that validates breakdown granularity, agent output quality, and that npm test
passes in the project worktree after execution
Run with:
FULL_FLOW_TESTS=1 npm test -- src/test/integration/full-flow/ --test-timeout=1800000
Audited all 44 test files one by one. Documents what each test verifies,
identifies 12 redundant test pairs, 13 coverage gaps (prioritized), fragility
assessment, and mock style inconsistencies.
Implements cassette recording/replay to test the full agent execution
pipeline (ProcessManager → FileTailer → OutputHandler → SignalManager)
without real AI API calls.
Key components:
- `CassetteProcessManager`: extends ProcessManager, intercepts spawnDetached
to replay cassettes or record real runs on completion
- `replay-worker.mjs`: standalone node script that replays JSONL + signal.json
as a subprocess, exercising the complete file-based output pipeline
- `CassetteStore`: reads/writes cassette JSON files keyed by SHA256 hash
- `normalizer.ts`: strips dynamic content (UUIDs, temp paths, timestamps,
session numbers) from prompts for stable cassette keys
- `key.ts`: hashes normalized prompt + provider args + worktree file content
(worktree hash detects content drift for execute-mode agents)
- `createCassetteHarness()`: wraps RealProviderHarness with cassette support,
same interface so existing real-provider tests work unchanged
Mode control via env vars:
(default) → replay: cassette must exist (safe for CI)
CW_CASSETTE_RECORD=1 → auto: replay if exists, record if missing
CW_CASSETTE_FORCE_RECORD=1 → record: always run real agent, overwrite cassette
MultiProviderAgentManager gains an optional `processManagerOverride` constructor
parameter for clean dependency injection without changing existing callers.
Cassette files live in src/test/cassettes/ and are intended to be committed
to git so CI runs without API access.
- Add withFakeTimers(fn) helper to TestHarness for scoped timer control
- Replace all vi.runAllTimersAsync() with harness.advanceTimers() in E2E
and harness tests (37 call sites across 5 files)
- Keep vi.useFakeTimers() per-test activation pattern (intentional)
- Add @vitest/coverage-v8 dep so `npm run test:coverage` actually works
- Add exclude patterns to vitest config (node_modules, dist, packages)
- Replace dynamic import('vitest') in advanceTimers with direct vi import
Nulls out agents.initiativeId before deleting the initiative row,
ensuring the delete succeeds even on databases where migration 0025
(which adds ON DELETE SET NULL to the FK) hasn't been applied.
Dispatch manager now wraps task descriptions with buildExecutePrompt()
so agents receive the full execution protocol. Update test to match
prompt wrapping. Add worktree isolation note to workspace layout.
Drop redundant Specificity Test section (covered by examples and checklist),
remove Task Design Rules (implied by entire prompt), flatten frontmatter
docs, trim good example, tighten sizing/checkpoint/context sections.
Remove CODEBASE_VERIFICATION references, document new shared constants
(TEST_INTEGRITY, SESSION_STARTUP, PROGRESS_TRACKING), update mode prompt
descriptions with TDD protocol, Definition of Done checklists, and
mandatory test specifications.
Replace the weak 7-step execution protocol with an explicit red-green-refactor
cycle that requires agents to write failing tests before implementing. Move
anti-patterns and scope rules above deviation/git sections so critical
constraints get more attention. Add session startup verification, progress
tracking, and a mandatory definition-of-done checklist that must pass before
signaling completion. Remove dead CODEBASE_VERIFICATION import.
Detail: Replace vague "how to verify" requirement with mandatory test specification
(file path, scenarios, run command) for execute-category tasks. Update good-task
example to demonstrate the new format. Add Definition of Done checklist.
Plan: Add Testing Strategy section requiring tests within each implementation phase
instead of trailing test phases. Add Definition of Done checklist.
Anchor on ~150 lines changed as the sweet spot based on SWE-bench Pro
data (107 lines / 4.1 files = 46% success for best agents). Old rules
used file count as the primary proxy which correlates poorly with task
difficulty compared to lines changed.
Add CONTEXT_MANAGEMENT shared block to plan and detail mode prompts so
architect agents also benefit from compaction awareness and parallel
execution hints. Update index.ts re-exports and agent docs.
- Add CONTEXT_MANAGEMENT constant: tells agents to keep working through
context compaction and parallelize reads
- Add "why" reasoning to each GIT_WORKFLOW rule so agents understand the
purpose, not just the rule
- Slim buildInterAgentCommunication: replace verbose bash code blocks with
a brief usage pattern paragraph, condense CLI docs to bullet list
Standalone agents (no initiative or 0 linked projects) run in a
workspace/ subdirectory, but signal.json lookups used the parent
directory. This caused all standalone agents to be marked "crashed"
despite successful completion.
Track the actual agent cwd at spawn time via ActiveAgent.agentCwd
and probe for the workspace/ subdirectory during reconciliation and
crash detection paths.
- Add deleteTask tRPC mutation (repo already had delete method)
- Add X button to TaskRow, hidden until hover, with confirmation dialog
- Shift+click bypasses confirmation for fast bulk deletion
- Invalidates listInitiativeTasks on success
- Document shift+click pattern in CLAUDE.md as standard for destructive actions
- New onConversationAnswer subscription: listens for conversation:answered
events matching a specific conversation ID, yields the answer text
- cw ask now subscribes via SSE instead of polling getConversation
- Removed --poll-interval and --timeout flags from cw ask
- Updated prompt to reflect SSE-based cw ask (no polling options)
- INTER_AGENT_COMMUNICATION constant → buildInterAgentCommunication(agentId) function
- Manager injects actual agent ID into prompt after DB record creation
- Agent ID hardcoded in cw listen/ask commands — no manifest.json indirection
- cw listen now uses onPendingConversation SSE subscription instead of polling
- CLI trpc-client upgraded with splitLink for subscription support
- All CLI flags (--agent-id, --from, --timeout, --poll-interval) documented in prompt
- conversation:created/answered added to ALL_EVENT_TYPES
- Execution mode badge toggles between YOLO/REVIEW on click
- Branch badge opens inline editor (input + save/cancel)
- Branch editing locked once any task has left pending status
- Server-side guard rejects branch changes after work has started
- getInitiative returns branchLocked flag
- updateInitiativeConfig now accepts optional branch field
Body: height 100vh + overflow hidden instead of min-height 100vh,
so the browser never shows a scrollbar on html/body.
AppLayout: h-screen flex column with shrink-0 header and flex-1
min-h-0 overflow-auto main. Pages like initiatives scroll within
main; agents page uses h-full with internal panel scrollers.
Left agent list gets min-h-0 for proper overflow containment in grid.
Right output panel gets overflow-hidden so AgentOutputViewer stays
within the available grid cell height.
Replace full CoordinationServer with a lightweight mock that serves only
conversation tRPC procedures backed by an in-memory repository. Agents
now have real coding tasks (write spec, ask questions, create summary)
and the two-question flow proves the listen→answer→re-listen cycle works.
Two-session test: Agent A listens for questions and answers, Agent B
asks a question and captures the response. Also fixes missing
conversationRepository passthrough in tRPC adapter.
Three bugs fixed in auto-cleanup commit retry flow:
1. resumeForCommit now calls getDirtyWorktreePaths() to include specific
project subdirectory names in the prompt, so the agent knows which
dirs to cd into and commit (instead of running git from the non-repo
parent dir).
2. Removed finally block in tryAutoCleanup that reset the retry counter
after every call, making MAX_COMMIT_RETRIES ineffective. Counter is
now only cleaned up on success, max retries, or error.
3. resumeForCommit returns false early if no worktrees are actually
dirty, preventing unnecessary commit retries for clean agents.
Enables parallel agents to communicate through a CLI-based conversation
mechanism coordinated via tRPC. Agents can ask questions to peers and
receive answers, with target resolution by agent ID, task ID, or phase ID.
Preview deployments let reviewers spin up the app at a specific branch
in local Docker containers, accessible through a single Caddy reverse
proxy port. Docker is the source of truth — no database table needed.
New module: src/preview/ with config discovery (.cw-preview.yml →
compose → Dockerfile fallback), compose generation, Docker CLI wrapper,
health checking, and port allocation (9100-9200 range).
Setup tokens from `claude setup-token` can't query the usage API,
resulting in a useless "Usage API request failed" message. Now shows
the actual HTTP status and guides users to complete OAuth setup.
Also distinguishes warning state (yellow) from error state (red)
in the AccountCard UI.
DB log chunk insertion is now the sole trigger for agent:output events.
Eliminates triple emission (FileTailer, handleStreamEvent, output buffer)
in favor of: FileTailer.onRawContent → DB insert → EventBus emit.
- createLogChunkCallback emits agent:output after successful DB insert
- spawnInternal now wires onRawContent callback (fixes session 1 gap)
- Remove eventBus from FileTailer (no longer touches EventBus)
- Remove eventBus from ProcessManager constructor (dead parameter)
- Remove agent:output emission from handleStreamEvent text_delta
- Remove outputBuffers map and all buffer helpers from manager/handler
- Remove getOutputBuffer from AgentManager interface and implementations
- getAgentOutput tRPC: DB-only, no file fallback
- onAgentOutput subscription: no initial buffer yield, events only
- AgentOutputViewer: accumulates raw JSONL chunks, parses uniformly