Files
Codewalkers/docs/archive/agent-lifecycle-refactor.md
Lukas May 342b490fe7 feat: Task decomposition for Tailwind/Radix/shadcn foundation setup
Decomposed "Foundation Setup - Install Dependencies & Configure Tailwind"
phase into 6 executable tasks:

1. Install Tailwind CSS, PostCSS & Autoprefixer
2. Map MUI theme to Tailwind design tokens
3. Setup CSS variables for dynamic theming
4. Install Radix UI primitives
5. Initialize shadcn/ui and setup component directory
6. Move MUI to devDependencies and verify setup

Tasks follow logical dependency chain with final human verification
checkpoint before proceeding with component migration.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-10 09:48:51 +01:00

5.8 KiB

Agent Lifecycle Management Refactor

Overview

This refactor implements a comprehensive unified agent lifecycle system that addresses all the scattered logic and race conditions in the original agent management code.

Problem Statement

The original system had several critical issues:

  1. No signal.json clearing before spawn/resume - caused completion race conditions
  2. Scattered retry logic - only commit retry existed, incomplete error handling
  3. Race conditions in completion detection - multiple handlers processing same completion
  4. Complex cleanup timing issues - mixed between manager and cleanup-manager
  5. No unified error analysis - auth/usage limit errors handled inconsistently

Solution Architecture

Core Components

1. SignalManager (src/agent/lifecycle/signal-manager.ts)

  • Always clears signal.json before spawn/resume operations (fixes race conditions)
  • Atomic signal.json operations with proper error handling
  • Robust signal waiting with exponential backoff
  • File validation to detect incomplete writes
// Example usage
await signalManager.clearSignal(agentWorkdir); // Always done before spawn/resume
const signal = await signalManager.waitForSignal(agentWorkdir, 30000);

2. RetryPolicy (src/agent/lifecycle/retry-policy.ts)

  • Comprehensive retry logic with error-specific strategies
  • Up to 3 attempts with exponential backoff (1s, 2s, 4s)
  • Different retry behavior per error type:
    • auth_failure: Retry with token refresh
    • usage_limit: No retry, switch account
    • missing_signal: Retry with instruction prompt
    • process_crash: Retry only if transient
    • timeout: Retry up to max attempts
    • unknown: No retry

3. ErrorAnalyzer (src/agent/lifecycle/error-analyzer.ts)

  • Intelligent error classification using pattern matching
  • Analyzes error messages, exit codes, stderr, and workdir state
  • Distinguishes between transient and non-transient failures
  • Handles 401/usage limit errors with proper account switching flags
// Error patterns detected:
// Auth: /unauthorized|invalid.*token|401/i
// Usage: /rate.*limit|quota.*exceeded|429/i
// Timeout: /timeout|timed.*out/i

4. CleanupStrategy (src/agent/lifecycle/cleanup-strategy.ts)

  • Debug mode archival vs production cleanup
  • Intelligent cleanup decisions based on agent status
  • Never cleans up agents waiting for user input
  • Archives in debug mode, removes in production mode

5. AgentLifecycleController (src/agent/lifecycle/controller.ts)

  • Main orchestrator that coordinates all components
  • Waits for process completion with robust signal detection
  • Unified retry orchestration with comprehensive error handling
  • Post-completion cleanup based on debug mode

Integration

Factory Pattern

The createLifecycleController() factory wires up all dependencies:

const lifecycleController = createLifecycleController({
  repository,
  processManager,
  cleanupManager,
  accountRepository,
  debug
});

Manager Integration

The MultiProviderAgentManager now provides lifecycle-managed methods alongside legacy methods:

// New lifecycle-managed methods (recommended)
await manager.spawnWithLifecycle(options);
await manager.resumeWithLifecycle(agentId, answers);

// Legacy methods (backwards compatibility)
await manager.spawn(options);
await manager.resume(agentId, answers);

Success Criteria Met

Always clear signal.json before spawn/resume

  • SignalManager.clearSignal() called before every operation
  • Eliminates race conditions in completion detection

Wait for process completion

  • AgentLifecycleController.waitForCompletion() uses ProcessManager
  • Robust signal detection with timeout handling

Retry up to 3 times with comprehensive error handling

  • DefaultRetryPolicy implements 3-attempt strategy with backoff
  • Error-specific retry logic for all failure types

Handle 401/usage limit errors with DB persistence

  • ErrorAnalyzer detects auth/usage patterns
  • Errors persisted to database when shouldPersistToDB is true
  • Account exhaustion marking for usage limits

Missing signal.json triggers resume with instruction

  • missing_signal error type detected when process succeeds but no signal
  • Future enhancement will add instruction prompt to retry

Clean up workdir or archive based on debug mode

  • DefaultCleanupStrategy chooses archive vs remove based on debug flag
  • Respects agent status (never cleans up waiting agents)

Unified lifecycle orchestration

  • All scattered logic consolidated into single controller
  • Factory pattern for easy dependency injection
  • Backwards-compatible integration with existing manager

Testing

Comprehensive test coverage for all components:

  • signal-manager.test.ts: 18 tests covering atomic operations
  • retry-policy.test.ts: 12 tests covering retry logic
  • error-analyzer.test.ts: 21 tests covering error classification

Run tests: npm test src/agent/lifecycle/

Migration Strategy

  1. Phase 1: Use new lifecycle methods for new features
  2. Phase 2: Gradually migrate existing spawn/resume calls
  3. Phase 3: Remove legacy methods after full migration
  4. Phase 4: Add missing signal instruction prompt enhancement

Benefits

  • Eliminates race conditions in completion detection
  • Comprehensive error handling with intelligent retry
  • Account exhaustion failover with proper DB tracking
  • Debug-friendly cleanup with archival support
  • Unified orchestration replacing scattered logic
  • Backwards compatible integration
  • Fully tested with comprehensive test suite

This refactor transforms the fragile, scattered agent lifecycle system into a robust, unified, and well-tested foundation for reliable agent orchestration.