feat: Phase schema refactor, agent lifecycle module, and log chunks

Phase model changes:
- Drop `number` column (ordering now by createdAt + dependency DAG)
- Replace `description` (plain text) with `content` (Tiptap JSON)
- Add `approved` status as dispatch gate
- Add phase dependency management (list, remove, dependents)
- Approval gate in PhaseDispatchManager.queuePhase()

Agent log chunks:
- New `agent_log_chunks` table for DB-first output persistence
- LogChunkRepository port + DrizzleLogChunkRepository adapter
- FileTailer onRawContent callback streams chunks to DB
- getAgentOutput reads from DB first, falls back to file

Agent lifecycle module (src/agent/lifecycle/):
- SignalManager: atomic signal.json read/write/wait operations
- RetryPolicy: exponential backoff with error-specific strategies
- ErrorAnalyzer: pattern-based error classification
- CleanupStrategy: debug archival vs production cleanup
- AgentLifecycleController: orchestrates retry/recovery flow
- Missing signal recovery with instruction injection

Completion detection fixes:
- Read signal.json file instead of parsing stdout as JSON
- Cancellable pollForCompletion with { cancel } handle
- Centralized state cleanup via cleanupAgentState()
- Credential handler consolidation (prepareProcessEnv)

Prompts refactor:
- Split monolithic prompts.ts into per-mode modules
- Add workspace layout section to agent prompts
- Fix markdown-to-tiptap double-serialization

Server/tRPC:
- Subscription heartbeat (30s) and bounded queue (1000 max)
- Phase CRUD: approvePhase, deletePhase, dependency queries
- Page: findByIds, getPageUpdatedAtMap
- Wire new repositories through container and context
This commit is contained in:
Lukas May
2026-02-09 22:33:28 +01:00
parent 43e2c8b0ba
commit fab7706f5c
81 changed files with 4113 additions and 495 deletions

91
docs/crash-marking-fix.md Normal file
View File

@@ -0,0 +1,91 @@
# Crash Marking Race Condition Fix
## Problem
Agents were being incorrectly marked as "crashed" despite completing successfully with valid `signal.json` files. This happened because of a **race condition** in the output polling logic.
### Root Cause
In `src/agent/output-handler.ts`, the `handleCompletion()` method had two code paths for completion detection:
1. **Main path** (line 273): Checked for `signal.json` completion - ✅ **WORKED**
2. **Error path** (line 306): Marked agent as crashed when no new output detected - ❌ **PROBLEM**
The race condition occurred when:
1. Agent completes and writes `signal.json`
2. Polling happens before all output is flushed to disk
3. No new output detected in `output.jsonl`
4. Error path triggers `handleAgentError()` → marks agent as crashed
5. Completion detection never runs because agent already marked as crashed
### Specific Case: slim-wildebeest
- **Agent ID**: `t9itQywbC0aZBZyc_SL0V`
- **Status**: `crashed` (incorrect)
- **Exit Code**: `NULL` (agent marked crashed before process exit)
- **Signal File**: Valid `signal.json` with `status: "questions"`
- **Problem**: Error path triggered before completion detection
## Solution
**Added completion detection in the error path** before marking agent as crashed.
### Code Change
In `src/agent/output-handler.ts` around line 305:
```typescript
// BEFORE (line 306)
log.warn({ agentId }, 'no result text from stream or file');
await this.handleAgentError(agentId, new Error('No output received'), provider, getAgentWorkdir);
// AFTER
// Before marking as crashed, check if the agent actually completed successfully
const agentWorkdir = getAgentWorkdir(agentId);
if (await this.checkSignalCompletion(agentWorkdir)) {
const signalPath = join(agentWorkdir, '.cw/output/signal.json');
const signalContent = await readFile(signalPath, 'utf-8');
log.debug({ agentId, signalPath }, 'detected completion via signal.json in error path');
this.filePositions.delete(agentId); // Clean up tracking
await this.processSignalAndFiles(agentId, signalContent, agent.mode as AgentMode, getAgentWorkdir, active?.streamSessionId);
return;
}
log.warn({ agentId }, 'no result text from stream or file');
await this.handleAgentError(agentId, new Error('No output received'), provider, getAgentWorkdir);
```
### Logic
The fix adds a **final completion check** right before marking an agent as crashed. If the agent has a valid `signal.json` with status `done`, `questions`, or `error`, it processes the completion instead of marking as crashed.
## Impact
-**Prevents false crash marking** for agents that completed successfully
-**No performance impact** - only runs when no new output detected (rare)
-**Backward compatible** - still marks truly crashed agents as crashed
-**Comprehensive** - uses same robust `checkSignalCompletion()` logic as main path
## Testing
Verified with completion detection tests:
```bash
npm test -- src/agent/completion-detection.test.ts
```
Manual verification showed slim-wildebeest's signal.json would be detected:
- Signal file exists: `true`
- Status: `questions`
- Should complete: `true`
## Future Improvements
1. **Unified completion detection** - Consider consolidating all completion logic into a single method
2. **Enhanced logging** - Add more detailed logs for completion vs crash decisions
3. **Metrics** - Track completion detection success rates to identify remaining edge cases
## Related Files
- `src/agent/output-handler.ts` - Main fix location
- `src/agent/completion-detection.test.ts` - Existing test coverage
- `src/agent/manager.ts` - Secondary crash handling (different logic)