feat: Phase schema refactor, agent lifecycle module, and log chunks

Phase model changes: - Drop `number` column (ordering now by createdAt + dependency DAG) - Replace `description` (plain text) with `content` (Tiptap JSON) - Add `approved` status as dispatch gate - Add phase dependency management (list, remove, dependents) - Approval gate in PhaseDispatchManager.queuePhase() Agent log chunks: - New `agent_log_chunks` table for DB-first output persistence - LogChunkRepository port + DrizzleLogChunkRepository adapter - FileTailer onRawContent callback streams chunks to DB - getAgentOutput reads from DB first, falls back to file Agent lifecycle module (src/agent/lifecycle/): - SignalManager: atomic signal.json read/write/wait operations - RetryPolicy: exponential backoff with error-specific strategies - ErrorAnalyzer: pattern-based error classification - CleanupStrategy: debug archival vs production cleanup - AgentLifecycleController: orchestrates retry/recovery flow - Missing signal recovery with instruction injection Completion detection fixes: - Read signal.json file instead of parsing stdout as JSON - Cancellable pollForCompletion with { cancel } handle - Centralized state cleanup via cleanupAgentState() - Credential handler consolidation (prepareProcessEnv) Prompts refactor: - Split monolithic prompts.ts into per-mode modules - Add workspace layout section to agent prompts - Fix markdown-to-tiptap double-serialization Server/tRPC: - Subscription heartbeat (30s) and bounded queue (1000 max) - Phase CRUD: approvePhase, deletePhase, dependency queries - Page: findByIds, getPageUpdatedAtMap - Wire new repositories through container and context
2026-02-09 22:33:28 +01:00
parent 43e2c8b0ba
commit fab7706f5c
81 changed files with 4113 additions and 495 deletions
--- a/docs/crash-marking-fix.md
+++ b/docs/crash-marking-fix.md
@@ -0,0 +1,91 @@
+# Crash Marking Race Condition Fix
+
+## Problem
+
+Agents were being incorrectly marked as "crashed" despite completing successfully with valid `signal.json` files. This happened because of a **race condition** in the output polling logic.
+
+### Root Cause
+
+In `src/agent/output-handler.ts`, the `handleCompletion()` method had two code paths for completion detection:
+
+1. **Main path** (line 273): Checked for `signal.json` completion - ✅ **WORKED**
+2. **Error path** (line 306): Marked agent as crashed when no new output detected - ❌ **PROBLEM**
+
+The race condition occurred when:
+1. Agent completes and writes `signal.json`
+2. Polling happens before all output is flushed to disk
+3. No new output detected in `output.jsonl`
+4. Error path triggers `handleAgentError()` → marks agent as crashed
+5. Completion detection never runs because agent already marked as crashed
+
+### Specific Case: slim-wildebeest
+
+- **Agent ID**: `t9itQywbC0aZBZyc_SL0V`
+- **Status**: `crashed` (incorrect)
+- **Exit Code**: `NULL` (agent marked crashed before process exit)
+- **Signal File**: Valid `signal.json` with `status: "questions"` ✅
+- **Problem**: Error path triggered before completion detection
+
+## Solution
+
+**Added completion detection in the error path** before marking agent as crashed.
+
+### Code Change
+
+In `src/agent/output-handler.ts` around line 305:
+
+```typescript
+// BEFORE (line 306)
+log.warn({ agentId }, 'no result text from stream or file');
+await this.handleAgentError(agentId, new Error('No output received'), provider, getAgentWorkdir);
+
+// AFTER
+// Before marking as crashed, check if the agent actually completed successfully
+const agentWorkdir = getAgentWorkdir(agentId);
+if (await this.checkSignalCompletion(agentWorkdir)) {
+  const signalPath = join(agentWorkdir, '.cw/output/signal.json');
+  const signalContent = await readFile(signalPath, 'utf-8');
+  log.debug({ agentId, signalPath }, 'detected completion via signal.json in error path');
+  this.filePositions.delete(agentId); // Clean up tracking
+  await this.processSignalAndFiles(agentId, signalContent, agent.mode as AgentMode, getAgentWorkdir, active?.streamSessionId);
+  return;
+}
+
+log.warn({ agentId }, 'no result text from stream or file');
+await this.handleAgentError(agentId, new Error('No output received'), provider, getAgentWorkdir);
+```
+
+### Logic
+
+The fix adds a **final completion check** right before marking an agent as crashed. If the agent has a valid `signal.json` with status `done`, `questions`, or `error`, it processes the completion instead of marking as crashed.
+
+## Impact
+
+- ✅ **Prevents false crash marking** for agents that completed successfully
+- ✅ **No performance impact** - only runs when no new output detected (rare)
+- ✅ **Backward compatible** - still marks truly crashed agents as crashed
+- ✅ **Comprehensive** - uses same robust `checkSignalCompletion()` logic as main path
+
+## Testing
+
+Verified with completion detection tests:
+```bash
+npm test -- src/agent/completion-detection.test.ts
+```
+
+Manual verification showed slim-wildebeest's signal.json would be detected:
+- Signal file exists: `true`
+- Status: `questions`
+- Should complete: `true` ✅
+
+## Future Improvements
+
+1. **Unified completion detection** - Consider consolidating all completion logic into a single method
+2. **Enhanced logging** - Add more detailed logs for completion vs crash decisions
+3. **Metrics** - Track completion detection success rates to identify remaining edge cases
+
+## Related Files
+
+- `src/agent/output-handler.ts` - Main fix location
+- `src/agent/completion-detection.test.ts` - Existing test coverage
+- `src/agent/manager.ts` - Secondary crash handling (different logic)