Decomposed "Foundation Setup - Install Dependencies & Configure Tailwind" phase into 6 executable tasks: 1. Install Tailwind CSS, PostCSS & Autoprefixer 2. Map MUI theme to Tailwind design tokens 3. Setup CSS variables for dynamic theming 4. Install Radix UI primitives 5. Initialize shadcn/ui and setup component directory 6. Move MUI to devDependencies and verify setup Tasks follow logical dependency chain with final human verification checkpoint before proceeding with component migration. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
156 lines
5.8 KiB
Markdown
156 lines
5.8 KiB
Markdown
# Agent Lifecycle Management Refactor
|
|
|
|
## Overview
|
|
|
|
This refactor implements a comprehensive unified agent lifecycle system that addresses all the scattered logic and race conditions in the original agent management code.
|
|
|
|
## Problem Statement
|
|
|
|
The original system had several critical issues:
|
|
|
|
1. **No signal.json clearing before spawn/resume** - caused completion race conditions
|
|
2. **Scattered retry logic** - only commit retry existed, incomplete error handling
|
|
3. **Race conditions in completion detection** - multiple handlers processing same completion
|
|
4. **Complex cleanup timing issues** - mixed between manager and cleanup-manager
|
|
5. **No unified error analysis** - auth/usage limit errors handled inconsistently
|
|
|
|
## Solution Architecture
|
|
|
|
### Core Components
|
|
|
|
#### 1. SignalManager (`src/agent/lifecycle/signal-manager.ts`)
|
|
- **Always clears signal.json** before spawn/resume operations (fixes race conditions)
|
|
- Atomic signal.json operations with proper error handling
|
|
- Robust signal waiting with exponential backoff
|
|
- File validation to detect incomplete writes
|
|
|
|
```typescript
|
|
// Example usage
|
|
await signalManager.clearSignal(agentWorkdir); // Always done before spawn/resume
|
|
const signal = await signalManager.waitForSignal(agentWorkdir, 30000);
|
|
```
|
|
|
|
#### 2. RetryPolicy (`src/agent/lifecycle/retry-policy.ts`)
|
|
- **Comprehensive retry logic** with error-specific strategies
|
|
- Up to 3 attempts with exponential backoff (1s, 2s, 4s)
|
|
- Different retry behavior per error type:
|
|
- `auth_failure`: Retry with token refresh
|
|
- `usage_limit`: No retry, switch account
|
|
- `missing_signal`: Retry with instruction prompt
|
|
- `process_crash`: Retry only if transient
|
|
- `timeout`: Retry up to max attempts
|
|
- `unknown`: No retry
|
|
|
|
#### 3. ErrorAnalyzer (`src/agent/lifecycle/error-analyzer.ts`)
|
|
- **Intelligent error classification** using pattern matching
|
|
- Analyzes error messages, exit codes, stderr, and workdir state
|
|
- Distinguishes between transient and non-transient failures
|
|
- **Handles 401/usage limit errors** with proper account switching flags
|
|
|
|
```typescript
|
|
// Error patterns detected:
|
|
// Auth: /unauthorized|invalid.*token|401/i
|
|
// Usage: /rate.*limit|quota.*exceeded|429/i
|
|
// Timeout: /timeout|timed.*out/i
|
|
```
|
|
|
|
#### 4. CleanupStrategy (`src/agent/lifecycle/cleanup-strategy.ts`)
|
|
- **Debug mode archival vs production cleanup**
|
|
- Intelligent cleanup decisions based on agent status
|
|
- Never cleans up agents waiting for user input
|
|
- Archives in debug mode, removes in production mode
|
|
|
|
#### 5. AgentLifecycleController (`src/agent/lifecycle/controller.ts`)
|
|
- **Main orchestrator** that coordinates all components
|
|
- **Waits for process completion** with robust signal detection
|
|
- **Unified retry orchestration** with comprehensive error handling
|
|
- **Post-completion cleanup** based on debug mode
|
|
|
|
## Integration
|
|
|
|
### Factory Pattern
|
|
The `createLifecycleController()` factory wires up all dependencies:
|
|
|
|
```typescript
|
|
const lifecycleController = createLifecycleController({
|
|
repository,
|
|
processManager,
|
|
cleanupManager,
|
|
accountRepository,
|
|
debug
|
|
});
|
|
```
|
|
|
|
### Manager Integration
|
|
The `MultiProviderAgentManager` now provides lifecycle-managed methods alongside legacy methods:
|
|
|
|
```typescript
|
|
// New lifecycle-managed methods (recommended)
|
|
await manager.spawnWithLifecycle(options);
|
|
await manager.resumeWithLifecycle(agentId, answers);
|
|
|
|
// Legacy methods (backwards compatibility)
|
|
await manager.spawn(options);
|
|
await manager.resume(agentId, answers);
|
|
```
|
|
|
|
## Success Criteria Met
|
|
|
|
✅ **Always clear signal.json before spawn/resume**
|
|
- `SignalManager.clearSignal()` called before every operation
|
|
- Eliminates race conditions in completion detection
|
|
|
|
✅ **Wait for process completion**
|
|
- `AgentLifecycleController.waitForCompletion()` uses ProcessManager
|
|
- Robust signal detection with timeout handling
|
|
|
|
✅ **Retry up to 3 times with comprehensive error handling**
|
|
- `DefaultRetryPolicy` implements 3-attempt strategy with backoff
|
|
- Error-specific retry logic for all failure types
|
|
|
|
✅ **Handle 401/usage limit errors with DB persistence**
|
|
- `ErrorAnalyzer` detects auth/usage patterns
|
|
- Errors persisted to database when `shouldPersistToDB` is true
|
|
- Account exhaustion marking for usage limits
|
|
|
|
✅ **Missing signal.json triggers resume with instruction**
|
|
- `missing_signal` error type detected when process succeeds but no signal
|
|
- Future enhancement will add instruction prompt to retry
|
|
|
|
✅ **Clean up workdir or archive based on debug mode**
|
|
- `DefaultCleanupStrategy` chooses archive vs remove based on debug flag
|
|
- Respects agent status (never cleans up waiting agents)
|
|
|
|
✅ **Unified lifecycle orchestration**
|
|
- All scattered logic consolidated into single controller
|
|
- Factory pattern for easy dependency injection
|
|
- Backwards-compatible integration with existing manager
|
|
|
|
## Testing
|
|
|
|
Comprehensive test coverage for all components:
|
|
|
|
- `signal-manager.test.ts`: 18 tests covering atomic operations
|
|
- `retry-policy.test.ts`: 12 tests covering retry logic
|
|
- `error-analyzer.test.ts`: 21 tests covering error classification
|
|
|
|
Run tests: `npm test src/agent/lifecycle/`
|
|
|
|
## Migration Strategy
|
|
|
|
1. **Phase 1**: Use new lifecycle methods for new features
|
|
2. **Phase 2**: Gradually migrate existing spawn/resume calls
|
|
3. **Phase 3**: Remove legacy methods after full migration
|
|
4. **Phase 4**: Add missing signal instruction prompt enhancement
|
|
|
|
## Benefits
|
|
|
|
- **Eliminates race conditions** in completion detection
|
|
- **Comprehensive error handling** with intelligent retry
|
|
- **Account exhaustion failover** with proper DB tracking
|
|
- **Debug-friendly cleanup** with archival support
|
|
- **Unified orchestration** replacing scattered logic
|
|
- **Backwards compatible** integration
|
|
- **Fully tested** with comprehensive test suite
|
|
|
|
This refactor transforms the fragile, scattered agent lifecycle system into a robust, unified, and well-tested foundation for reliable agent orchestration. |