Files
Codewalkers/docs/agent-lifecycle-refactor.md
Lukas May fab7706f5c feat: Phase schema refactor, agent lifecycle module, and log chunks
Phase model changes:
- Drop `number` column (ordering now by createdAt + dependency DAG)
- Replace `description` (plain text) with `content` (Tiptap JSON)
- Add `approved` status as dispatch gate
- Add phase dependency management (list, remove, dependents)
- Approval gate in PhaseDispatchManager.queuePhase()

Agent log chunks:
- New `agent_log_chunks` table for DB-first output persistence
- LogChunkRepository port + DrizzleLogChunkRepository adapter
- FileTailer onRawContent callback streams chunks to DB
- getAgentOutput reads from DB first, falls back to file

Agent lifecycle module (src/agent/lifecycle/):
- SignalManager: atomic signal.json read/write/wait operations
- RetryPolicy: exponential backoff with error-specific strategies
- ErrorAnalyzer: pattern-based error classification
- CleanupStrategy: debug archival vs production cleanup
- AgentLifecycleController: orchestrates retry/recovery flow
- Missing signal recovery with instruction injection

Completion detection fixes:
- Read signal.json file instead of parsing stdout as JSON
- Cancellable pollForCompletion with { cancel } handle
- Centralized state cleanup via cleanupAgentState()
- Credential handler consolidation (prepareProcessEnv)

Prompts refactor:
- Split monolithic prompts.ts into per-mode modules
- Add workspace layout section to agent prompts
- Fix markdown-to-tiptap double-serialization

Server/tRPC:
- Subscription heartbeat (30s) and bounded queue (1000 max)
- Phase CRUD: approvePhase, deletePhase, dependency queries
- Page: findByIds, getPageUpdatedAtMap
- Wire new repositories through container and context
2026-02-09 22:33:28 +01:00

5.8 KiB

Agent Lifecycle Management Refactor

Overview

This refactor implements a comprehensive unified agent lifecycle system that addresses all the scattered logic and race conditions in the original agent management code.

Problem Statement

The original system had several critical issues:

  1. No signal.json clearing before spawn/resume - caused completion race conditions
  2. Scattered retry logic - only commit retry existed, incomplete error handling
  3. Race conditions in completion detection - multiple handlers processing same completion
  4. Complex cleanup timing issues - mixed between manager and cleanup-manager
  5. No unified error analysis - auth/usage limit errors handled inconsistently

Solution Architecture

Core Components

1. SignalManager (src/agent/lifecycle/signal-manager.ts)

  • Always clears signal.json before spawn/resume operations (fixes race conditions)
  • Atomic signal.json operations with proper error handling
  • Robust signal waiting with exponential backoff
  • File validation to detect incomplete writes
// Example usage
await signalManager.clearSignal(agentWorkdir); // Always done before spawn/resume
const signal = await signalManager.waitForSignal(agentWorkdir, 30000);

2. RetryPolicy (src/agent/lifecycle/retry-policy.ts)

  • Comprehensive retry logic with error-specific strategies
  • Up to 3 attempts with exponential backoff (1s, 2s, 4s)
  • Different retry behavior per error type:
    • auth_failure: Retry with token refresh
    • usage_limit: No retry, switch account
    • missing_signal: Retry with instruction prompt
    • process_crash: Retry only if transient
    • timeout: Retry up to max attempts
    • unknown: No retry

3. ErrorAnalyzer (src/agent/lifecycle/error-analyzer.ts)

  • Intelligent error classification using pattern matching
  • Analyzes error messages, exit codes, stderr, and workdir state
  • Distinguishes between transient and non-transient failures
  • Handles 401/usage limit errors with proper account switching flags
// Error patterns detected:
// Auth: /unauthorized|invalid.*token|401/i
// Usage: /rate.*limit|quota.*exceeded|429/i
// Timeout: /timeout|timed.*out/i

4. CleanupStrategy (src/agent/lifecycle/cleanup-strategy.ts)

  • Debug mode archival vs production cleanup
  • Intelligent cleanup decisions based on agent status
  • Never cleans up agents waiting for user input
  • Archives in debug mode, removes in production mode

5. AgentLifecycleController (src/agent/lifecycle/controller.ts)

  • Main orchestrator that coordinates all components
  • Waits for process completion with robust signal detection
  • Unified retry orchestration with comprehensive error handling
  • Post-completion cleanup based on debug mode

Integration

Factory Pattern

The createLifecycleController() factory wires up all dependencies:

const lifecycleController = createLifecycleController({
  repository,
  processManager,
  cleanupManager,
  accountRepository,
  debug
});

Manager Integration

The MultiProviderAgentManager now provides lifecycle-managed methods alongside legacy methods:

// New lifecycle-managed methods (recommended)
await manager.spawnWithLifecycle(options);
await manager.resumeWithLifecycle(agentId, answers);

// Legacy methods (backwards compatibility)
await manager.spawn(options);
await manager.resume(agentId, answers);

Success Criteria Met

Always clear signal.json before spawn/resume

  • SignalManager.clearSignal() called before every operation
  • Eliminates race conditions in completion detection

Wait for process completion

  • AgentLifecycleController.waitForCompletion() uses ProcessManager
  • Robust signal detection with timeout handling

Retry up to 3 times with comprehensive error handling

  • DefaultRetryPolicy implements 3-attempt strategy with backoff
  • Error-specific retry logic for all failure types

Handle 401/usage limit errors with DB persistence

  • ErrorAnalyzer detects auth/usage patterns
  • Errors persisted to database when shouldPersistToDB is true
  • Account exhaustion marking for usage limits

Missing signal.json triggers resume with instruction

  • missing_signal error type detected when process succeeds but no signal
  • Future enhancement will add instruction prompt to retry

Clean up workdir or archive based on debug mode

  • DefaultCleanupStrategy chooses archive vs remove based on debug flag
  • Respects agent status (never cleans up waiting agents)

Unified lifecycle orchestration

  • All scattered logic consolidated into single controller
  • Factory pattern for easy dependency injection
  • Backwards-compatible integration with existing manager

Testing

Comprehensive test coverage for all components:

  • signal-manager.test.ts: 18 tests covering atomic operations
  • retry-policy.test.ts: 12 tests covering retry logic
  • error-analyzer.test.ts: 21 tests covering error classification

Run tests: npm test src/agent/lifecycle/

Migration Strategy

  1. Phase 1: Use new lifecycle methods for new features
  2. Phase 2: Gradually migrate existing spawn/resume calls
  3. Phase 3: Remove legacy methods after full migration
  4. Phase 4: Add missing signal instruction prompt enhancement

Benefits

  • Eliminates race conditions in completion detection
  • Comprehensive error handling with intelligent retry
  • Account exhaustion failover with proper DB tracking
  • Debug-friendly cleanup with archival support
  • Unified orchestration replacing scattered logic
  • Backwards compatible integration
  • Fully tested with comprehensive test suite

This refactor transforms the fragile, scattered agent lifecycle system into a robust, unified, and well-tested foundation for reliable agent orchestration.