Phase model changes:
- Drop `number` column (ordering now by createdAt + dependency DAG)
- Replace `description` (plain text) with `content` (Tiptap JSON)
- Add `approved` status as dispatch gate
- Add phase dependency management (list, remove, dependents)
- Approval gate in PhaseDispatchManager.queuePhase()
Agent log chunks:
- New `agent_log_chunks` table for DB-first output persistence
- LogChunkRepository port + DrizzleLogChunkRepository adapter
- FileTailer onRawContent callback streams chunks to DB
- getAgentOutput reads from DB first, falls back to file
Agent lifecycle module (src/agent/lifecycle/):
- SignalManager: atomic signal.json read/write/wait operations
- RetryPolicy: exponential backoff with error-specific strategies
- ErrorAnalyzer: pattern-based error classification
- CleanupStrategy: debug archival vs production cleanup
- AgentLifecycleController: orchestrates retry/recovery flow
- Missing signal recovery with instruction injection
Completion detection fixes:
- Read signal.json file instead of parsing stdout as JSON
- Cancellable pollForCompletion with { cancel } handle
- Centralized state cleanup via cleanupAgentState()
- Credential handler consolidation (prepareProcessEnv)
Prompts refactor:
- Split monolithic prompts.ts into per-mode modules
- Add workspace layout section to agent prompts
- Fix markdown-to-tiptap double-serialization
Server/tRPC:
- Subscription heartbeat (30s) and bounded queue (1000 max)
- Phase CRUD: approvePhase, deletePhase, dependency queries
- Page: findByIds, getPageUpdatedAtMap
- Wire new repositories through container and context
3.6 KiB
Crash Marking Race Condition Fix
Problem
Agents were being incorrectly marked as "crashed" despite completing successfully with valid signal.json files. This happened because of a race condition in the output polling logic.
Root Cause
In src/agent/output-handler.ts, the handleCompletion() method had two code paths for completion detection:
- Main path (line 273): Checked for
signal.jsoncompletion - ✅ WORKED - Error path (line 306): Marked agent as crashed when no new output detected - ❌ PROBLEM
The race condition occurred when:
- Agent completes and writes
signal.json - Polling happens before all output is flushed to disk
- No new output detected in
output.jsonl - Error path triggers
handleAgentError()→ marks agent as crashed - Completion detection never runs because agent already marked as crashed
Specific Case: slim-wildebeest
- Agent ID:
t9itQywbC0aZBZyc_SL0V - Status:
crashed(incorrect) - Exit Code:
NULL(agent marked crashed before process exit) - Signal File: Valid
signal.jsonwithstatus: "questions"✅ - Problem: Error path triggered before completion detection
Solution
Added completion detection in the error path before marking agent as crashed.
Code Change
In src/agent/output-handler.ts around line 305:
// BEFORE (line 306)
log.warn({ agentId }, 'no result text from stream or file');
await this.handleAgentError(agentId, new Error('No output received'), provider, getAgentWorkdir);
// AFTER
// Before marking as crashed, check if the agent actually completed successfully
const agentWorkdir = getAgentWorkdir(agentId);
if (await this.checkSignalCompletion(agentWorkdir)) {
const signalPath = join(agentWorkdir, '.cw/output/signal.json');
const signalContent = await readFile(signalPath, 'utf-8');
log.debug({ agentId, signalPath }, 'detected completion via signal.json in error path');
this.filePositions.delete(agentId); // Clean up tracking
await this.processSignalAndFiles(agentId, signalContent, agent.mode as AgentMode, getAgentWorkdir, active?.streamSessionId);
return;
}
log.warn({ agentId }, 'no result text from stream or file');
await this.handleAgentError(agentId, new Error('No output received'), provider, getAgentWorkdir);
Logic
The fix adds a final completion check right before marking an agent as crashed. If the agent has a valid signal.json with status done, questions, or error, it processes the completion instead of marking as crashed.
Impact
- ✅ Prevents false crash marking for agents that completed successfully
- ✅ No performance impact - only runs when no new output detected (rare)
- ✅ Backward compatible - still marks truly crashed agents as crashed
- ✅ Comprehensive - uses same robust
checkSignalCompletion()logic as main path
Testing
Verified with completion detection tests:
npm test -- src/agent/completion-detection.test.ts
Manual verification showed slim-wildebeest's signal.json would be detected:
- Signal file exists:
true - Status:
questions - Should complete:
true✅
Future Improvements
- Unified completion detection - Consider consolidating all completion logic into a single method
- Enhanced logging - Add more detailed logs for completion vs crash decisions
- Metrics - Track completion detection success rates to identify remaining edge cases
Related Files
src/agent/output-handler.ts- Main fix locationsrc/agent/completion-detection.test.ts- Existing test coveragesrc/agent/manager.ts- Secondary crash handling (different logic)