Decomposed "Foundation Setup - Install Dependencies & Configure Tailwind" phase into 6 executable tasks: 1. Install Tailwind CSS, PostCSS & Autoprefixer 2. Map MUI theme to Tailwind design tokens 3. Setup CSS variables for dynamic theming 4. Install Radix UI primitives 5. Initialize shadcn/ui and setup component directory 6. Move MUI to devDependencies and verify setup Tasks follow logical dependency chain with final human verification checkpoint before proceeding with component migration. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
5.8 KiB
Agent Lifecycle Management Refactor
Overview
This refactor implements a comprehensive unified agent lifecycle system that addresses all the scattered logic and race conditions in the original agent management code.
Problem Statement
The original system had several critical issues:
- No signal.json clearing before spawn/resume - caused completion race conditions
- Scattered retry logic - only commit retry existed, incomplete error handling
- Race conditions in completion detection - multiple handlers processing same completion
- Complex cleanup timing issues - mixed between manager and cleanup-manager
- No unified error analysis - auth/usage limit errors handled inconsistently
Solution Architecture
Core Components
1. SignalManager (src/agent/lifecycle/signal-manager.ts)
- Always clears signal.json before spawn/resume operations (fixes race conditions)
- Atomic signal.json operations with proper error handling
- Robust signal waiting with exponential backoff
- File validation to detect incomplete writes
// Example usage
await signalManager.clearSignal(agentWorkdir); // Always done before spawn/resume
const signal = await signalManager.waitForSignal(agentWorkdir, 30000);
2. RetryPolicy (src/agent/lifecycle/retry-policy.ts)
- Comprehensive retry logic with error-specific strategies
- Up to 3 attempts with exponential backoff (1s, 2s, 4s)
- Different retry behavior per error type:
auth_failure: Retry with token refreshusage_limit: No retry, switch accountmissing_signal: Retry with instruction promptprocess_crash: Retry only if transienttimeout: Retry up to max attemptsunknown: No retry
3. ErrorAnalyzer (src/agent/lifecycle/error-analyzer.ts)
- Intelligent error classification using pattern matching
- Analyzes error messages, exit codes, stderr, and workdir state
- Distinguishes between transient and non-transient failures
- Handles 401/usage limit errors with proper account switching flags
// Error patterns detected:
// Auth: /unauthorized|invalid.*token|401/i
// Usage: /rate.*limit|quota.*exceeded|429/i
// Timeout: /timeout|timed.*out/i
4. CleanupStrategy (src/agent/lifecycle/cleanup-strategy.ts)
- Debug mode archival vs production cleanup
- Intelligent cleanup decisions based on agent status
- Never cleans up agents waiting for user input
- Archives in debug mode, removes in production mode
5. AgentLifecycleController (src/agent/lifecycle/controller.ts)
- Main orchestrator that coordinates all components
- Waits for process completion with robust signal detection
- Unified retry orchestration with comprehensive error handling
- Post-completion cleanup based on debug mode
Integration
Factory Pattern
The createLifecycleController() factory wires up all dependencies:
const lifecycleController = createLifecycleController({
repository,
processManager,
cleanupManager,
accountRepository,
debug
});
Manager Integration
The MultiProviderAgentManager now provides lifecycle-managed methods alongside legacy methods:
// New lifecycle-managed methods (recommended)
await manager.spawnWithLifecycle(options);
await manager.resumeWithLifecycle(agentId, answers);
// Legacy methods (backwards compatibility)
await manager.spawn(options);
await manager.resume(agentId, answers);
Success Criteria Met
✅ Always clear signal.json before spawn/resume
SignalManager.clearSignal()called before every operation- Eliminates race conditions in completion detection
✅ Wait for process completion
AgentLifecycleController.waitForCompletion()uses ProcessManager- Robust signal detection with timeout handling
✅ Retry up to 3 times with comprehensive error handling
DefaultRetryPolicyimplements 3-attempt strategy with backoff- Error-specific retry logic for all failure types
✅ Handle 401/usage limit errors with DB persistence
ErrorAnalyzerdetects auth/usage patterns- Errors persisted to database when
shouldPersistToDBis true - Account exhaustion marking for usage limits
✅ Missing signal.json triggers resume with instruction
missing_signalerror type detected when process succeeds but no signal- Future enhancement will add instruction prompt to retry
✅ Clean up workdir or archive based on debug mode
DefaultCleanupStrategychooses archive vs remove based on debug flag- Respects agent status (never cleans up waiting agents)
✅ Unified lifecycle orchestration
- All scattered logic consolidated into single controller
- Factory pattern for easy dependency injection
- Backwards-compatible integration with existing manager
Testing
Comprehensive test coverage for all components:
signal-manager.test.ts: 18 tests covering atomic operationsretry-policy.test.ts: 12 tests covering retry logicerror-analyzer.test.ts: 21 tests covering error classification
Run tests: npm test src/agent/lifecycle/
Migration Strategy
- Phase 1: Use new lifecycle methods for new features
- Phase 2: Gradually migrate existing spawn/resume calls
- Phase 3: Remove legacy methods after full migration
- Phase 4: Add missing signal instruction prompt enhancement
Benefits
- Eliminates race conditions in completion detection
- Comprehensive error handling with intelligent retry
- Account exhaustion failover with proper DB tracking
- Debug-friendly cleanup with archival support
- Unified orchestration replacing scattered logic
- Backwards compatible integration
- Fully tested with comprehensive test suite
This refactor transforms the fragile, scattered agent lifecycle system into a robust, unified, and well-tested foundation for reliable agent orchestration.