Files

Lukas May 342b490fe7 feat: Task decomposition for Tailwind/Radix/shadcn foundation setup

Decomposed "Foundation Setup - Install Dependencies & Configure Tailwind"
phase into 6 executable tasks:

1. Install Tailwind CSS, PostCSS & Autoprefixer
2. Map MUI theme to Tailwind design tokens
3. Setup CSS variables for dynamic theming
4. Install Radix UI primitives
5. Initialize shadcn/ui and setup component directory
6. Move MUI to devDependencies and verify setup

Tasks follow logical dependency chain with final human verification
checkpoint before proceeding with component migration.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-10 09:48:51 +01:00

5.8 KiB

Raw Blame History

Agent Lifecycle Management Refactor

Overview

This refactor implements a comprehensive unified agent lifecycle system that addresses all the scattered logic and race conditions in the original agent management code.

Problem Statement

The original system had several critical issues:

No signal.json clearing before spawn/resume - caused completion race conditions
Scattered retry logic - only commit retry existed, incomplete error handling
Race conditions in completion detection - multiple handlers processing same completion
Complex cleanup timing issues - mixed between manager and cleanup-manager
No unified error analysis - auth/usage limit errors handled inconsistently

Solution Architecture

Core Components

1. SignalManager (`src/agent/lifecycle/signal-manager.ts`)

Always clears signal.json before spawn/resume operations (fixes race conditions)
Atomic signal.json operations with proper error handling
Robust signal waiting with exponential backoff
File validation to detect incomplete writes

// Example usage
await signalManager.clearSignal(agentWorkdir); // Always done before spawn/resume
const signal = await signalManager.waitForSignal(agentWorkdir, 30000);

2. RetryPolicy (`src/agent/lifecycle/retry-policy.ts`)

Comprehensive retry logic with error-specific strategies
Up to 3 attempts with exponential backoff (1s, 2s, 4s)
Different retry behavior per error type:
- auth_failure: Retry with token refresh
- usage_limit: No retry, switch account
- missing_signal: Retry with instruction prompt
- process_crash: Retry only if transient
- timeout: Retry up to max attempts
- unknown: No retry

3. ErrorAnalyzer (`src/agent/lifecycle/error-analyzer.ts`)

Intelligent error classification using pattern matching
Analyzes error messages, exit codes, stderr, and workdir state
Distinguishes between transient and non-transient failures
Handles 401/usage limit errors with proper account switching flags

// Error patterns detected:
// Auth: /unauthorized|invalid.*token|401/i
// Usage: /rate.*limit|quota.*exceeded|429/i
// Timeout: /timeout|timed.*out/i

4. CleanupStrategy (`src/agent/lifecycle/cleanup-strategy.ts`)

Debug mode archival vs production cleanup
Intelligent cleanup decisions based on agent status
Never cleans up agents waiting for user input
Archives in debug mode, removes in production mode

5. AgentLifecycleController (`src/agent/lifecycle/controller.ts`)

Main orchestrator that coordinates all components
Waits for process completion with robust signal detection
Unified retry orchestration with comprehensive error handling
Post-completion cleanup based on debug mode

Integration

Factory Pattern

The createLifecycleController() factory wires up all dependencies:

const lifecycleController = createLifecycleController({
  repository,
  processManager,
  cleanupManager,
  accountRepository,
  debug
});

Manager Integration

The MultiProviderAgentManager now provides lifecycle-managed methods alongside legacy methods:

// New lifecycle-managed methods (recommended)
await manager.spawnWithLifecycle(options);
await manager.resumeWithLifecycle(agentId, answers);

// Legacy methods (backwards compatibility)
await manager.spawn(options);
await manager.resume(agentId, answers);

Success Criteria Met

✅ Always clear signal.json before spawn/resume

SignalManager.clearSignal() called before every operation
Eliminates race conditions in completion detection

✅ Wait for process completion

AgentLifecycleController.waitForCompletion() uses ProcessManager
Robust signal detection with timeout handling

✅ Retry up to 3 times with comprehensive error handling

DefaultRetryPolicy implements 3-attempt strategy with backoff
Error-specific retry logic for all failure types

✅ Handle 401/usage limit errors with DB persistence

ErrorAnalyzer detects auth/usage patterns
Errors persisted to database when shouldPersistToDB is true
Account exhaustion marking for usage limits

✅ Missing signal.json triggers resume with instruction

missing_signal error type detected when process succeeds but no signal
Future enhancement will add instruction prompt to retry

✅ Clean up workdir or archive based on debug mode

DefaultCleanupStrategy chooses archive vs remove based on debug flag
Respects agent status (never cleans up waiting agents)

✅ Unified lifecycle orchestration

All scattered logic consolidated into single controller
Factory pattern for easy dependency injection
Backwards-compatible integration with existing manager

Testing

Comprehensive test coverage for all components:

signal-manager.test.ts: 18 tests covering atomic operations
retry-policy.test.ts: 12 tests covering retry logic
error-analyzer.test.ts: 21 tests covering error classification

Run tests: npm test src/agent/lifecycle/

Migration Strategy

Phase 1: Use new lifecycle methods for new features
Phase 2: Gradually migrate existing spawn/resume calls
Phase 3: Remove legacy methods after full migration
Phase 4: Add missing signal instruction prompt enhancement

Benefits

Eliminates race conditions in completion detection
Comprehensive error handling with intelligent retry
Account exhaustion failover with proper DB tracking
Debug-friendly cleanup with archival support
Unified orchestration replacing scattered logic
Backwards compatible integration
Fully tested with comprehensive test suite

This refactor transforms the fragile, scattered agent lifecycle system into a robust, unified, and well-tested foundation for reliable agent orchestration.

5.8 KiB Raw Blame History