Phase model changes:
- Drop `number` column (ordering now by createdAt + dependency DAG)
- Replace `description` (plain text) with `content` (Tiptap JSON)
- Add `approved` status as dispatch gate
- Add phase dependency management (list, remove, dependents)
- Approval gate in PhaseDispatchManager.queuePhase()
Agent log chunks:
- New `agent_log_chunks` table for DB-first output persistence
- LogChunkRepository port + DrizzleLogChunkRepository adapter
- FileTailer onRawContent callback streams chunks to DB
- getAgentOutput reads from DB first, falls back to file
Agent lifecycle module (src/agent/lifecycle/):
- SignalManager: atomic signal.json read/write/wait operations
- RetryPolicy: exponential backoff with error-specific strategies
- ErrorAnalyzer: pattern-based error classification
- CleanupStrategy: debug archival vs production cleanup
- AgentLifecycleController: orchestrates retry/recovery flow
- Missing signal recovery with instruction injection
Completion detection fixes:
- Read signal.json file instead of parsing stdout as JSON
- Cancellable pollForCompletion with { cancel } handle
- Centralized state cleanup via cleanupAgentState()
- Credential handler consolidation (prepareProcessEnv)
Prompts refactor:
- Split monolithic prompts.ts into per-mode modules
- Add workspace layout section to agent prompts
- Fix markdown-to-tiptap double-serialization
Server/tRPC:
- Subscription heartbeat (30s) and bounded queue (1000 max)
- Phase CRUD: approvePhase, deletePhase, dependency queries
- Page: findByIds, getPageUpdatedAtMap
- Wire new repositories through container and context
5.8 KiB
Agent Lifecycle Management Refactor
Overview
This refactor implements a comprehensive unified agent lifecycle system that addresses all the scattered logic and race conditions in the original agent management code.
Problem Statement
The original system had several critical issues:
- No signal.json clearing before spawn/resume - caused completion race conditions
- Scattered retry logic - only commit retry existed, incomplete error handling
- Race conditions in completion detection - multiple handlers processing same completion
- Complex cleanup timing issues - mixed between manager and cleanup-manager
- No unified error analysis - auth/usage limit errors handled inconsistently
Solution Architecture
Core Components
1. SignalManager (src/agent/lifecycle/signal-manager.ts)
- Always clears signal.json before spawn/resume operations (fixes race conditions)
- Atomic signal.json operations with proper error handling
- Robust signal waiting with exponential backoff
- File validation to detect incomplete writes
// Example usage
await signalManager.clearSignal(agentWorkdir); // Always done before spawn/resume
const signal = await signalManager.waitForSignal(agentWorkdir, 30000);
2. RetryPolicy (src/agent/lifecycle/retry-policy.ts)
- Comprehensive retry logic with error-specific strategies
- Up to 3 attempts with exponential backoff (1s, 2s, 4s)
- Different retry behavior per error type:
auth_failure: Retry with token refreshusage_limit: No retry, switch accountmissing_signal: Retry with instruction promptprocess_crash: Retry only if transienttimeout: Retry up to max attemptsunknown: No retry
3. ErrorAnalyzer (src/agent/lifecycle/error-analyzer.ts)
- Intelligent error classification using pattern matching
- Analyzes error messages, exit codes, stderr, and workdir state
- Distinguishes between transient and non-transient failures
- Handles 401/usage limit errors with proper account switching flags
// Error patterns detected:
// Auth: /unauthorized|invalid.*token|401/i
// Usage: /rate.*limit|quota.*exceeded|429/i
// Timeout: /timeout|timed.*out/i
4. CleanupStrategy (src/agent/lifecycle/cleanup-strategy.ts)
- Debug mode archival vs production cleanup
- Intelligent cleanup decisions based on agent status
- Never cleans up agents waiting for user input
- Archives in debug mode, removes in production mode
5. AgentLifecycleController (src/agent/lifecycle/controller.ts)
- Main orchestrator that coordinates all components
- Waits for process completion with robust signal detection
- Unified retry orchestration with comprehensive error handling
- Post-completion cleanup based on debug mode
Integration
Factory Pattern
The createLifecycleController() factory wires up all dependencies:
const lifecycleController = createLifecycleController({
repository,
processManager,
cleanupManager,
accountRepository,
debug
});
Manager Integration
The MultiProviderAgentManager now provides lifecycle-managed methods alongside legacy methods:
// New lifecycle-managed methods (recommended)
await manager.spawnWithLifecycle(options);
await manager.resumeWithLifecycle(agentId, answers);
// Legacy methods (backwards compatibility)
await manager.spawn(options);
await manager.resume(agentId, answers);
Success Criteria Met
✅ Always clear signal.json before spawn/resume
SignalManager.clearSignal()called before every operation- Eliminates race conditions in completion detection
✅ Wait for process completion
AgentLifecycleController.waitForCompletion()uses ProcessManager- Robust signal detection with timeout handling
✅ Retry up to 3 times with comprehensive error handling
DefaultRetryPolicyimplements 3-attempt strategy with backoff- Error-specific retry logic for all failure types
✅ Handle 401/usage limit errors with DB persistence
ErrorAnalyzerdetects auth/usage patterns- Errors persisted to database when
shouldPersistToDBis true - Account exhaustion marking for usage limits
✅ Missing signal.json triggers resume with instruction
missing_signalerror type detected when process succeeds but no signal- Future enhancement will add instruction prompt to retry
✅ Clean up workdir or archive based on debug mode
DefaultCleanupStrategychooses archive vs remove based on debug flag- Respects agent status (never cleans up waiting agents)
✅ Unified lifecycle orchestration
- All scattered logic consolidated into single controller
- Factory pattern for easy dependency injection
- Backwards-compatible integration with existing manager
Testing
Comprehensive test coverage for all components:
signal-manager.test.ts: 18 tests covering atomic operationsretry-policy.test.ts: 12 tests covering retry logicerror-analyzer.test.ts: 21 tests covering error classification
Run tests: npm test src/agent/lifecycle/
Migration Strategy
- Phase 1: Use new lifecycle methods for new features
- Phase 2: Gradually migrate existing spawn/resume calls
- Phase 3: Remove legacy methods after full migration
- Phase 4: Add missing signal instruction prompt enhancement
Benefits
- Eliminates race conditions in completion detection
- Comprehensive error handling with intelligent retry
- Account exhaustion failover with proper DB tracking
- Debug-friendly cleanup with archival support
- Unified orchestration replacing scattered logic
- Backwards compatible integration
- Fully tested with comprehensive test suite
This refactor transforms the fragile, scattered agent lifecycle system into a robust, unified, and well-tested foundation for reliable agent orchestration.