From 777d1c504378cc36415b79a8c8bd3b0184234216 Mon Sep 17 00:00:00 2001 From: Lukas May Date: Fri, 30 Jan 2026 12:39:28 +0100 Subject: [PATCH] docs: research multi-agent orchestration ecosystem Researched stack, features, architecture, and pitfalls for Codewalk District. Key findings: - Stack: Commander.js + tRPC + Drizzle/better-sqlite3 + execa - Architecture: Hexagonal monolith with SQLite task queue - Critical pitfall: Process tree cleanup, SQLite WAL handling Ready for roadmap creation. --- .planning/research/ARCHITECTURE.md | 1207 ++++++++++++++++++++++++++++ .planning/research/FEATURES.md | 285 +++++++ .planning/research/PITFALLS.md | 437 ++++++++++ .planning/research/STACK.md | 327 ++++++++ .planning/research/SUMMARY.md | 158 ++++ 5 files changed, 2414 insertions(+) create mode 100644 .planning/research/ARCHITECTURE.md create mode 100644 .planning/research/FEATURES.md create mode 100644 .planning/research/PITFALLS.md create mode 100644 .planning/research/STACK.md create mode 100644 .planning/research/SUMMARY.md diff --git a/.planning/research/ARCHITECTURE.md b/.planning/research/ARCHITECTURE.md new file mode 100644 index 0000000..0f7e65f --- /dev/null +++ b/.planning/research/ARCHITECTURE.md @@ -0,0 +1,1207 @@ +# Task Orchestration Architecture Research + +- **Domain**: Multi-agent orchestration / Developer tooling +- **Researched**: 2026-01-30 +- **Confidence**: HIGH + +--- + +## System Overview + +``` + Codewalk District Architecture + ================================================================================ + + +------------------+ + | CLI Entry | + | (cw binary) | + +--------+---------+ + | + +------------------+------------------+ + | | + +--------v--------+ +--------v--------+ + | Server Mode | | Client Mode | + | (Background) | | (Commands) | + +--------+--------+ +--------+--------+ + | | + +------------------+------------------+ + | + +--------v--------+ + | tRPC Router | + | (API Layer) | + +--------+--------+ + | + +-------------+----------------+----------------+-------------+ + | | | | | ++-------v------+ +----v-----+ +--------v-------+ +------v------+ +----v-----+ +| Task | | Agent | | Session | | Workspace | | Config | +| Scheduler | | Pool | | Manager | | Manager | | Store | ++--------------+ +----------+ +----------------+ +-------------+ +----------+ + | | | | | + +-------------+----------------+----------------+-------------+ + | + +--------v--------+ + | Event Bus | + | (EventEmitter) | + +--------+--------+ + | + +-------------+----------------+----------------+-------------+ + | | | | | ++-------v------+ +----v-----+ +--------v-------+ +------v------+ +----v-----+ +| SQLite | | Process | | MCP STDIO | | File | | Event | +| Database | | Spawner | | Transport | | Watcher | | Store | ++--------------+ +----------+ +----------------+ +-------------+ +----------+ + + Infrastructure Layer + ================================================================================ +``` + +--- + +## Component Responsibilities + +| Component | Responsibility | Pattern | Key Interfaces | +|-----------|---------------|---------|----------------| +| **CLI Entry** | Single binary entry point for all modes | Command Pattern | `run()`, `serve()`, `spawn()` | +| **tRPC Router** | Type-safe API layer between client/server | RPC Pattern | Procedures (Query/Mutation/Subscription) | +| **Task Scheduler** | Job queue with priority, retry, delay | Task Queue | `enqueue()`, `dequeue()`, `complete()`, `fail()` | +| **Agent Pool** | Manage Claude Code agent lifecycle | Pool Pattern | `acquire()`, `release()`, `spawn()`, `terminate()` | +| **Session Manager** | Track active sessions and state | State Machine | `create()`, `pause()`, `resume()`, `close()` | +| **Workspace Manager** | File system operations, project context | Repository Pattern | `scan()`, `watch()`, `sync()` | +| **Event Bus** | Decoupled module communication | Pub/Sub | `emit()`, `on()`, `off()` | +| **SQLite Database** | Persistent storage for tasks, state | Repository Pattern | Repositories per entity | +| **Process Spawner** | Child process lifecycle management | Supervisor Pattern | `spawn()`, `monitor()`, `restart()` | +| **MCP Transport** | STDIO-based agent communication | Message Protocol | `send()`, `receive()`, `handshake()` | +| **File Watcher** | Detect file system changes | Observer Pattern | `watch()`, `onChange()`, `onAdd()` | +| **Event Store** | Persist domain events (optional CQRS) | Event Sourcing | `append()`, `read()`, `replay()` | + +--- + +## Recommended Project Structure + +``` +src/ +├── bin/ +│ └── cw.ts # CLI entry point +│ +├── application/ # Use Cases (Orchestration) +│ ├── commands/ # Write operations +│ │ ├── spawn-agent.ts +│ │ ├── enqueue-task.ts +│ │ └── create-session.ts +│ ├── queries/ # Read operations +│ │ ├── list-agents.ts +│ │ ├── get-task-status.ts +│ │ └── session-summary.ts +│ └── services/ # Application services +│ ├── task-orchestrator.ts +│ └── session-coordinator.ts +│ +├── domain/ # Core Business Logic +│ ├── agent/ +│ │ ├── agent.ts # Aggregate root +│ │ ├── agent-pool.ts +│ │ └── agent.repository.ts # Port (interface) +│ ├── task/ +│ │ ├── task.ts # Aggregate root +│ │ ├── task-queue.ts +│ │ ├── task.repository.ts # Port (interface) +│ │ └── task.events.ts # Domain events +│ ├── session/ +│ │ ├── session.ts +│ │ └── session.repository.ts +│ └── workspace/ +│ ├── workspace.ts +│ └── workspace.repository.ts +│ +├── infrastructure/ # Adapters (Implementations) +│ ├── persistence/ +│ │ ├── sqlite/ +│ │ │ ├── connection.ts +│ │ │ ├── migrations/ +│ │ │ ├── task.repository.sqlite.ts +│ │ │ ├── agent.repository.sqlite.ts +│ │ │ └── session.repository.sqlite.ts +│ │ └── repositories.ts # Repository factory +│ ├── process/ +│ │ ├── process-spawner.ts +│ │ ├── process-supervisor.ts +│ │ └── stdio-transport.ts +│ ├── mcp/ +│ │ ├── mcp-client.ts +│ │ └── mcp-server.ts +│ ├── filesystem/ +│ │ ├── file-watcher.ts +│ │ └── workspace-scanner.ts +│ └── events/ +│ ├── event-bus.ts # In-memory EventEmitter +│ └── event-store.ts # SQLite-backed persistence +│ +├── interfaces/ # Entry Points +│ ├── api/ +│ │ ├── router.ts # tRPC root router +│ │ ├── procedures/ +│ │ │ ├── agent.procedures.ts +│ │ │ ├── task.procedures.ts +│ │ │ └── session.procedures.ts +│ │ └── context.ts # tRPC context +│ ├── cli/ +│ │ ├── commands/ +│ │ │ ├── spawn.ts +│ │ │ ├── status.ts +│ │ │ └── serve.ts +│ │ └── parser.ts +│ └── server/ +│ └── http-server.ts +│ +├── shared/ # Cross-cutting concerns +│ ├── types/ +│ ├── errors/ +│ ├── utils/ +│ └── constants/ +│ +└── config/ + ├── index.ts + └── schema.ts # Zod validation +``` + +--- + +## Architectural Patterns + +### 1. Hexagonal Architecture (Ports & Adapters) + +The core principle: business logic at the center, external concerns at the edges. + +```typescript +// domain/task/task.repository.ts (PORT - Interface) +export interface TaskRepository { + findById(id: TaskId): Promise; + findPending(): Promise; + save(task: Task): Promise; + delete(id: TaskId): Promise; +} + +// infrastructure/persistence/sqlite/task.repository.sqlite.ts (ADAPTER) +export class SqliteTaskRepository implements TaskRepository { + constructor(private db: Database) {} + + async findById(id: TaskId): Promise { + const row = this.db.prepare( + 'SELECT * FROM tasks WHERE id = ?' + ).get(id.value); + return row ? TaskMapper.toDomain(row) : null; + } + + async findPending(): Promise { + const rows = this.db.prepare( + 'SELECT * FROM tasks WHERE status = ? ORDER BY priority DESC, created_at ASC' + ).all('pending'); + return rows.map(TaskMapper.toDomain); + } + + async save(task: Task): Promise { + const data = TaskMapper.toPersistence(task); + this.db.prepare(` + INSERT OR REPLACE INTO tasks (id, type, payload, status, priority, created_at, updated_at) + VALUES (@id, @type, @payload, @status, @priority, @created_at, @updated_at) + `).run(data); + } +} +``` + +### 2. Task Queue with SQLite + +A lightweight, persistent task queue without external dependencies. + +```typescript +// domain/task/task-queue.ts +export interface TaskQueuePort { + enqueue(task: Task): Promise; + dequeue(): Promise; + acknowledge(taskId: TaskId): Promise; + fail(taskId: TaskId, error: Error): Promise; + requeue(taskId: TaskId, delay?: number): Promise; +} + +// infrastructure/persistence/sqlite/task-queue.sqlite.ts +export class SqliteTaskQueue implements TaskQueuePort { + private db: Database; + private processingTimeout: number = 30000; // 30s + + async enqueue(task: Task): Promise { + this.db.prepare(` + INSERT INTO task_queue (id, payload, status, priority, run_at, created_at) + VALUES (?, ?, 'pending', ?, ?, ?) + `).run( + task.id.value, + JSON.stringify(task.payload), + task.priority, + task.runAt?.toISOString() ?? new Date().toISOString(), + new Date().toISOString() + ); + } + + async dequeue(): Promise { + // Atomic claim: SELECT + UPDATE in transaction + const task = this.db.transaction(() => { + // Reclaim stale tasks (processing timeout) + this.db.prepare(` + UPDATE task_queue + SET status = 'pending', claimed_at = NULL + WHERE status = 'processing' + AND claimed_at < datetime('now', '-' || ? || ' seconds') + `).run(this.processingTimeout / 1000); + + // Claim next available task + const row = this.db.prepare(` + SELECT * FROM task_queue + WHERE status = 'pending' AND run_at <= datetime('now') + ORDER BY priority DESC, created_at ASC + LIMIT 1 + `).get(); + + if (!row) return null; + + this.db.prepare(` + UPDATE task_queue + SET status = 'processing', claimed_at = datetime('now') + WHERE id = ? + `).run(row.id); + + return row; + })(); + + return task ? TaskMapper.toDomain(task) : null; + } + + async acknowledge(taskId: TaskId): Promise { + this.db.prepare(` + UPDATE task_queue + SET status = 'completed', completed_at = datetime('now') + WHERE id = ? + `).run(taskId.value); + } + + async fail(taskId: TaskId, error: Error): Promise { + this.db.prepare(` + UPDATE task_queue + SET status = 'failed', + error = ?, + retry_count = retry_count + 1, + failed_at = datetime('now') + WHERE id = ? + `).run(error.message, taskId.value); + } +} +``` + +### 3. Process Supervision Pattern + +Managing Claude Code agent processes with automatic restart. + +```typescript +// infrastructure/process/process-supervisor.ts +import { spawn, ChildProcess } from 'child_process'; +import { EventEmitter } from 'events'; + +export interface SupervisorConfig { + command: string; + args: string[]; + maxRestarts: number; + restartDelay: number; + env?: Record; +} + +export class ProcessSupervisor extends EventEmitter { + private process: ChildProcess | null = null; + private restartCount = 0; + private isRunning = false; + + constructor( + private id: string, + private config: SupervisorConfig + ) { + super(); + } + + async start(): Promise { + if (this.isRunning) return; + this.isRunning = true; + await this.spawn(); + } + + private async spawn(): Promise { + this.process = spawn(this.config.command, this.config.args, { + stdio: ['pipe', 'pipe', 'pipe'], + env: { ...process.env, ...this.config.env }, + }); + + this.process.on('exit', (code, signal) => { + this.emit('exit', { code, signal }); + + if (this.isRunning && code !== 0) { + this.handleCrash(); + } + }); + + this.process.stdout?.on('data', (data) => { + this.emit('stdout', data.toString()); + }); + + this.process.stderr?.on('data', (data) => { + this.emit('stderr', data.toString()); + }); + + this.emit('started', { pid: this.process.pid }); + } + + private async handleCrash(): Promise { + if (this.restartCount >= this.config.maxRestarts) { + this.emit('max-restarts', { restartCount: this.restartCount }); + this.isRunning = false; + return; + } + + this.restartCount++; + this.emit('restarting', { + attempt: this.restartCount, + delay: this.config.restartDelay + }); + + await this.delay(this.config.restartDelay); + + if (this.isRunning) { + await this.spawn(); + } + } + + async stop(): Promise { + this.isRunning = false; + if (this.process) { + this.process.kill('SIGTERM'); + // Give process time to graceful shutdown + await this.delay(1000); + if (!this.process.killed) { + this.process.kill('SIGKILL'); + } + } + } + + private delay(ms: number): Promise { + return new Promise(resolve => setTimeout(resolve, ms)); + } +} +``` + +### 4. Worker Pool Pattern + +Managing a pool of Claude Code agents with Piscina-inspired design. + +```typescript +// domain/agent/agent-pool.ts +export interface AgentPoolConfig { + minAgents: number; + maxAgents: number; + idleTimeout: number; + taskTimeout: number; +} + +export class AgentPool { + private agents: Map = new Map(); + private available: Agent[] = []; + private pending: Array<{ + resolve: (agent: Agent) => void; + reject: (error: Error) => void; + }> = []; + + constructor( + private config: AgentPoolConfig, + private spawner: AgentSpawner, + private eventBus: EventBus + ) {} + + async acquire(): Promise { + // Try to get an available agent + if (this.available.length > 0) { + const agent = this.available.pop()!; + agent.markBusy(); + return agent; + } + + // Spawn new agent if under limit + if (this.agents.size < this.config.maxAgents) { + const agent = await this.spawnAgent(); + agent.markBusy(); + return agent; + } + + // Wait for an agent to become available + return new Promise((resolve, reject) => { + const timeout = setTimeout(() => { + const idx = this.pending.findIndex(p => p.resolve === resolve); + if (idx !== -1) this.pending.splice(idx, 1); + reject(new Error('Agent acquisition timeout')); + }, this.config.taskTimeout); + + this.pending.push({ + resolve: (agent) => { + clearTimeout(timeout); + resolve(agent); + }, + reject + }); + }); + } + + release(agent: Agent): void { + agent.markIdle(); + + // Fulfill pending request if any + if (this.pending.length > 0) { + const { resolve } = this.pending.shift()!; + agent.markBusy(); + resolve(agent); + return; + } + + // Return to pool + this.available.push(agent); + this.eventBus.emit('agent:released', { agentId: agent.id }); + + // Schedule idle cleanup + this.scheduleIdleCleanup(agent); + } + + private async spawnAgent(): Promise { + const agent = await this.spawner.spawn(); + this.agents.set(agent.id, agent); + this.eventBus.emit('agent:spawned', { agentId: agent.id }); + return agent; + } + + private scheduleIdleCleanup(agent: Agent): void { + setTimeout(() => { + if (agent.isIdle && this.agents.size > this.config.minAgents) { + this.terminateAgent(agent); + } + }, this.config.idleTimeout); + } + + private async terminateAgent(agent: Agent): Promise { + this.agents.delete(agent.id); + const idx = this.available.indexOf(agent); + if (idx !== -1) this.available.splice(idx, 1); + await agent.terminate(); + this.eventBus.emit('agent:terminated', { agentId: agent.id }); + } +} +``` + +### 5. Event Bus for Module Communication + +```typescript +// infrastructure/events/event-bus.ts +import { EventEmitter } from 'events'; + +export type DomainEvent = { + type: string; + payload: unknown; + timestamp: Date; + correlationId?: string; +}; + +export interface EventBusPort { + emit(event: DomainEvent): void; + on(eventType: string, handler: (event: DomainEvent) => void): void; + off(eventType: string, handler: (event: DomainEvent) => void): void; + once(eventType: string, handler: (event: DomainEvent) => void): void; +} + +export class InMemoryEventBus implements EventBusPort { + private emitter = new EventEmitter(); + private eventStore?: EventStore; + + constructor(eventStore?: EventStore) { + this.eventStore = eventStore; + this.emitter.setMaxListeners(100); + } + + emit(event: DomainEvent): void { + const enrichedEvent = { + ...event, + timestamp: event.timestamp ?? new Date(), + }; + + // Persist event if store is configured + if (this.eventStore) { + this.eventStore.append(enrichedEvent); + } + + this.emitter.emit(event.type, enrichedEvent); + this.emitter.emit('*', enrichedEvent); // Wildcard listener + } + + on(eventType: string, handler: (event: DomainEvent) => void): void { + this.emitter.on(eventType, handler); + } + + off(eventType: string, handler: (event: DomainEvent) => void): void { + this.emitter.off(eventType, handler); + } + + once(eventType: string, handler: (event: DomainEvent) => void): void { + this.emitter.once(eventType, handler); + } +} +``` + +### 6. MCP STDIO Transport + +```typescript +// infrastructure/mcp/stdio-transport.ts +import { ChildProcess } from 'child_process'; +import { createInterface, Interface } from 'readline'; + +export interface MCPMessage { + jsonrpc: '2.0'; + id?: number | string; + method?: string; + params?: unknown; + result?: unknown; + error?: { code: number; message: string; data?: unknown }; +} + +export class StdioTransport { + private readline: Interface; + private messageId = 0; + private pendingRequests = new Map< + number | string, + { resolve: (result: unknown) => void; reject: (error: Error) => void } + >(); + + constructor(private process: ChildProcess) { + this.readline = createInterface({ + input: process.stdout!, + crlfDelay: Infinity, + }); + + this.readline.on('line', (line) => this.handleLine(line)); + } + + private handleLine(line: string): void { + try { + const message: MCPMessage = JSON.parse(line); + + if (message.id !== undefined && (message.result !== undefined || message.error)) { + // Response to our request + const pending = this.pendingRequests.get(message.id); + if (pending) { + this.pendingRequests.delete(message.id); + if (message.error) { + pending.reject(new Error(message.error.message)); + } else { + pending.resolve(message.result); + } + } + } else if (message.method) { + // Notification or request from server + this.handleServerMessage(message); + } + } catch (error) { + console.error('Failed to parse MCP message:', error); + } + } + + async request(method: string, params?: unknown): Promise { + const id = ++this.messageId; + const message: MCPMessage = { + jsonrpc: '2.0', + id, + method, + params, + }; + + return new Promise((resolve, reject) => { + this.pendingRequests.set(id, { resolve, reject }); + this.send(message); + }); + } + + notify(method: string, params?: unknown): void { + const message: MCPMessage = { + jsonrpc: '2.0', + method, + params, + }; + this.send(message); + } + + private send(message: MCPMessage): void { + this.process.stdin!.write(JSON.stringify(message) + '\n'); + } + + private handleServerMessage(message: MCPMessage): void { + // Handle notifications from the agent + // Override in subclass or use event emitter + } + + close(): void { + this.readline.close(); + for (const [id, pending] of this.pendingRequests) { + pending.reject(new Error('Transport closed')); + } + this.pendingRequests.clear(); + } +} +``` + +### 7. tRPC Router Setup + +```typescript +// interfaces/api/router.ts +import { initTRPC } from '@trpc/server'; +import { z } from 'zod'; +import type { Context } from './context'; + +const t = initTRPC.context().create(); + +export const router = t.router; +export const publicProcedure = t.procedure; + +// interfaces/api/procedures/task.procedures.ts +export const taskRouter = router({ + enqueue: publicProcedure + .input(z.object({ + type: z.string(), + payload: z.unknown(), + priority: z.number().optional().default(0), + runAt: z.date().optional(), + })) + .mutation(async ({ ctx, input }) => { + const command = new EnqueueTaskCommand(input); + return ctx.commandBus.execute(command); + }), + + status: publicProcedure + .input(z.object({ taskId: z.string() })) + .query(async ({ ctx, input }) => { + const query = new GetTaskStatusQuery(input.taskId); + return ctx.queryBus.execute(query); + }), + + list: publicProcedure + .input(z.object({ + status: z.enum(['pending', 'processing', 'completed', 'failed']).optional(), + limit: z.number().optional().default(50), + })) + .query(async ({ ctx, input }) => { + const query = new ListTasksQuery(input); + return ctx.queryBus.execute(query); + }), + + // Real-time task updates via subscription + onUpdate: publicProcedure + .input(z.object({ taskId: z.string().optional() })) + .subscription(async function* ({ ctx, input }) { + const eventBus = ctx.eventBus; + + for await (const event of eventBus.subscribe('task:*')) { + if (!input.taskId || event.payload.taskId === input.taskId) { + yield event; + } + } + }), +}); + +// Root router +export const appRouter = router({ + task: taskRouter, + agent: agentRouter, + session: sessionRouter, + workspace: workspaceRouter, +}); + +export type AppRouter = typeof appRouter; +``` + +### 8. File Watcher with Chokidar + +```typescript +// infrastructure/filesystem/file-watcher.ts +import chokidar, { FSWatcher } from 'chokidar'; +import { EventEmitter } from 'events'; + +export interface FileWatcherConfig { + paths: string[]; + ignored?: string[]; + persistent?: boolean; + usePolling?: boolean; + interval?: number; +} + +export class FileWatcher extends EventEmitter { + private watcher: FSWatcher | null = null; + + constructor(private config: FileWatcherConfig) { + super(); + } + + start(): void { + this.watcher = chokidar.watch(this.config.paths, { + ignored: this.config.ignored ?? [ + '**/node_modules/**', + '**/.git/**', + '**/dist/**', + ], + persistent: this.config.persistent ?? true, + usePolling: this.config.usePolling ?? false, + interval: this.config.interval ?? 100, + ignoreInitial: true, + awaitWriteFinish: { + stabilityThreshold: 100, + pollInterval: 50, + }, + }); + + this.watcher + .on('add', (path) => this.emit('add', { path, type: 'file' })) + .on('addDir', (path) => this.emit('add', { path, type: 'directory' })) + .on('change', (path) => this.emit('change', { path })) + .on('unlink', (path) => this.emit('remove', { path, type: 'file' })) + .on('unlinkDir', (path) => this.emit('remove', { path, type: 'directory' })) + .on('error', (error) => this.emit('error', error)) + .on('ready', () => this.emit('ready')); + } + + async stop(): Promise { + if (this.watcher) { + await this.watcher.close(); + this.watcher = null; + } + } + + addPath(path: string): void { + this.watcher?.add(path); + } + + removePath(path: string): void { + this.watcher?.unwatch(path); + } +} +``` + +--- + +## Data Flow Diagrams + +### Task Execution Flow + +``` +User Command Database Agent Pool + | | | + | 1. cw run "task" | | + v | | ++----------+ | | +| CLI/API | | | ++----+-----+ | | + | | | + | 2. EnqueueTaskCommand | | + v | | ++----------+ | | +| Command | | | +| Handler | | | ++----+-----+ | | + | | | + | 3. taskRepo.save() | | + +------------------------->| | + | | (persisted) | + | 4. emit('task:enqueued') | | + v | | ++----------+ | | +| Event | | | +| Bus | | | ++----+-----+ | | + | | | + | 5. Scheduler picks up | | + v | | ++----------+ | | +| Task | 6. dequeue() | | +| Scheduler|<-------------------+ | ++----+-----+ | | + | | | + | 7. pool.acquire() | | + +-------------------------------------------------->| + | | | + | | 8. agent.execute() | + | | +------------------->| + | | | | + | | | 9. MCP Protocol | + | | | (STDIO) | + | | | | + | | | 10. result | + | | |<-------------------| + | | | | + | 11. acknowledge(taskId) | | | + +------------------------->| | | + | | | | + | 12. pool.release() | | | + +-------------------------------------------------->| + | | | + | 13. emit('task:completed') | + v | | +``` + +### Event Flow for Multi-Agent Coordination + +``` + +-------------+ + | Event Bus | + +------+------+ + | + +-----------------------+-----------------------+ + | | | + v v v ++---------------+ +---------------+ +---------------+ +| Task Module | | Agent Module | | Session Module| +| Subscribers: | | Subscribers: | | Subscribers: | +| - task:* | | - agent:* | | - session:* | ++-------+-------+ +-------+-------+ +-------+-------+ + | | | + v v v ++---------------+ +---------------+ +---------------+ +| task:enqueued | | agent:spawned | | session:start | +| task:started | | agent:busy | | session:pause | +| task:completed| | agent:idle | | session:resume| +| task:failed | | agent:error | | session:close | ++---------------+ +---------------+ +---------------+ +``` + +--- + +## Scaling Considerations + +### For SQLite + Local Processes (Single Machine) + +| Concern | Solution | Limit | +|---------|----------|-------| +| **Concurrent agents** | Agent pool with configurable max | ~10-20 agents per machine (memory bound) | +| **Task throughput** | SQLite WAL mode + connection pool | ~10,000-15,000 tasks/sec | +| **Write contention** | Single writer pattern, queue writes | Serialize via transaction | +| **Memory pressure** | Streaming results, lazy loading | Monitor with V8 heap stats | +| **File descriptors** | Limit open agent processes | `ulimit -n` considerations | + +### SQLite Optimization for Task Queue + +```typescript +// infrastructure/persistence/sqlite/connection.ts +import Database from 'better-sqlite3'; + +export function createDatabase(path: string): Database.Database { + const db = new Database(path); + + // Enable WAL mode for better concurrent read/write + db.pragma('journal_mode = WAL'); + + // Faster writes, acceptable for local use + db.pragma('synchronous = NORMAL'); + + // Increase cache size (negative = KB) + db.pragma('cache_size = -64000'); // 64MB + + // Enable foreign keys + db.pragma('foreign_keys = ON'); + + // Busy timeout for lock contention + db.pragma('busy_timeout = 5000'); + + return db; +} +``` + +### When NOT to Scale (Anti-Pattern Avoidance) + +``` +DON'T: +- Run multiple `cw` servers on same SQLite file (write contention) +- Spawn unlimited agents (memory exhaustion) +- Use polling intervals < 100ms (CPU waste) +- Store large blobs in task payloads (use file references) + +DO: +- Single server process, multiple agent child processes +- Bounded agent pool (min: 1, max: CPU cores) +- Use SQLite WAL mode +- Store task results as file references, not inline +``` + +--- + +## Anti-Patterns to Avoid + +### 1. Distributed Queue Without Need + +```typescript +// BAD: Using Redis/BullMQ when SQLite suffices +const queue = new Queue('tasks', { connection: redisConfig }); + +// GOOD: SQLite-backed queue for local orchestration +const queue = new SqliteTaskQueue(db); +``` + +### 2. Tight Module Coupling + +```typescript +// BAD: Direct cross-module imports +import { AgentRepository } from '../agent/agent.repository.sqlite'; + +class TaskService { + constructor() { + this.agentRepo = new AgentRepository(db); // Direct coupling! + } +} + +// GOOD: Dependency injection via ports +class TaskService { + constructor(private agentRepo: AgentRepositoryPort) {} +} +``` + +### 3. Synchronous Process Management + +```typescript +// BAD: Blocking on child process +const result = execSync('claude-code run task'); + +// GOOD: Async with proper supervision +const supervisor = new ProcessSupervisor(config); +supervisor.on('exit', handleExit); +await supervisor.start(); +``` + +### 4. Unbounded Event Listeners + +```typescript +// BAD: Memory leak from unremoved listeners +eventBus.on('task:completed', handler); +// ... never removed + +// GOOD: Proper cleanup +const unsubscribe = eventBus.on('task:completed', handler); +// Later... +unsubscribe(); +``` + +### 5. Missing Transaction Boundaries + +```typescript +// BAD: Race condition in dequeue +const task = await taskRepo.findPending(); // Another process could grab this +await taskRepo.updateStatus(task.id, 'processing'); + +// GOOD: Atomic claim in transaction +const task = db.transaction(() => { + const task = db.prepare('SELECT ...').get(); + db.prepare('UPDATE ... SET status = "processing"').run(task.id); + return task; +})(); +``` + +### 6. Polling Without Backoff + +```typescript +// BAD: Constant polling +setInterval(() => checkForTasks(), 100); + +// GOOD: Exponential backoff or event-driven +let delay = 100; +async function poll() { + const task = await queue.dequeue(); + if (task) { + delay = 100; // Reset on success + await process(task); + } else { + delay = Math.min(delay * 2, 5000); // Backoff + } + setTimeout(poll, delay); +} +``` + +--- + +## Integration Points + +### 1. CLI Entry Point + +```typescript +// bin/cw.ts +#!/usr/bin/env node +import { Command } from 'commander'; +import { createContainer } from '../config/container'; + +const program = new Command(); +const container = createContainer(); + +program + .name('cw') + .description('Codewalk District - Multi-agent workspace orchestrator') + .version('0.1.0'); + +program + .command('serve') + .description('Start the orchestration server') + .option('-p, --port ', 'Server port', '3000') + .action(async (options) => { + const server = container.resolve('Server'); + await server.start(options.port); + }); + +program + .command('spawn') + .description('Spawn a Claude Code agent') + .argument('', 'Task description') + .action(async (task) => { + const client = container.resolve('TRPCClient'); + const result = await client.task.enqueue.mutate({ type: 'run', payload: { task } }); + console.log('Task enqueued:', result.id); + }); + +program.parse(); +``` + +### 2. Server Mode Integration + +```typescript +// interfaces/server/http-server.ts +import { createHTTPServer } from '@trpc/server/adapters/standalone'; +import { appRouter } from '../api/router'; +import { createContext } from '../api/context'; + +export class Server { + private httpServer: ReturnType; + + constructor(private container: Container) {} + + async start(port: number): Promise { + this.httpServer = createHTTPServer({ + router: appRouter, + createContext: () => createContext(this.container), + }); + + this.httpServer.listen(port); + console.log(`Server listening on port ${port}`); + + // Start background services + await this.container.resolve('TaskScheduler').start(); + await this.container.resolve('AgentPool').initialize(); + } + + async stop(): Promise { + await this.container.resolve('TaskScheduler').stop(); + await this.container.resolve('AgentPool').shutdown(); + this.httpServer.server.close(); + } +} +``` + +### 3. MCP Server for External Tools + +```typescript +// infrastructure/mcp/mcp-server.ts +export class MCPServer { + private tools: Map = new Map(); + + registerTool(name: string, handler: ToolHandler): void { + this.tools.set(name, handler); + } + + async handleRequest(request: MCPMessage): Promise { + if (request.method === 'tools/list') { + return this.listTools(); + } + + if (request.method === 'tools/call') { + const { name, arguments: args } = request.params as ToolCallParams; + const handler = this.tools.get(name); + + if (!handler) { + return this.error(request.id, -32601, `Tool not found: ${name}`); + } + + const result = await handler(args); + return this.success(request.id, result); + } + + return this.error(request.id, -32601, 'Method not found'); + } +} +``` + +--- + +## Sources + +### Task Queue & Job Processing +- [BullMQ Documentation](https://docs.bullmq.io) +- [BullMQ GitHub](https://github.com/taskforcesh/bullmq) +- [Better Stack - Job Scheduling with BullMQ](https://betterstack.com/community/guides/scaling-nodejs/bullmq-scheduled-tasks/) +- [DigitalOcean - Async Tasks with BullMQ](https://www.digitalocean.com/community/tutorials/how-to-handle-asynchronous-tasks-with-node-js-and-bullmq) +- [SQLite Background Job System](https://jasongorman.uk/writing/sqlite-background-job-system/) +- [plainjob - SQLite Job Queue](https://github.com/justplainstuff/plainjob) + +### Process Management & Supervision +- [Node.js Child Process Documentation](https://nodejs.org/api/child_process.html) +- [PM2 Documentation](https://pm2.keymetrics.io/docs/usage/quick-start/) +- [PM2 Cluster Mode](https://pm2.keymetrics.io/docs/usage/cluster-mode/) +- [TheLinuxCode - Mastering Child Processes](https://thelinuxcode.com/mastering-node-js-child-processes-an-in-depth-practical-guide/) + +### Worker Pools +- [Piscina GitHub](https://github.com/piscinajs/piscina) +- [Piscina Documentation](https://piscinajs.dev/) +- [NearForm - Learning Piscina](https://www.nearform.com/blog/learning-to-swim-with-piscina-the-node-js-worker-pool/) +- [Workerpool npm](https://www.npmjs.com/package/workerpool) +- [Advanced Web - Worker Pools in Node.js](https://advancedweb.hu/using-worker-pools-in-nodejs/) + +### Hexagonal Architecture +- [Software Patterns Lexicon - Ports and Adapters](https://softwarepatternslexicon.com/patterns-ts/7/7/2/) +- [Better Programming - Ports and Adapters with TypeScript](https://betterprogramming.pub/how-to-ports-and-adapter-with-typescript-32a50a0fc9eb) +- [Alex Rusin - Ports & Adapters Guide](https://blog.alexrusin.com/future-proof-your-code-a-guide-to-ports-adapters-hexagonal-architecture/) +- [DEV - Hexagonal Architecture Examples](https://dev.to/dyarleniber/hexagonal-architecture-and-clean-architecture-with-examples-48oi) + +### Event-Driven Architecture +- [Medium - EDA with Node.js](https://medium.com/@erickzanetti/event-driven-architecture-eda-with-node-js-a-modern-approach-and-challenges-82e7d9932b34) +- [FreeCodeCamp - Event-Based Architectures](https://www.freecodecamp.org/news/event-based-architectures-in-javascript-a-handbook-for-devs/) +- [GeeksforGeeks - Node.js Event Architecture](https://www.geeksforgeeks.org/explain-the-event-driven-architecture-of-node-js/) + +### File System Watching +- [Chokidar GitHub](https://github.com/paulmillr/chokidar) +- [Chokidar npm](https://www.npmjs.com/package/chokidar) + +### Workflow Engines & CI/CD +- [Temporal Documentation](https://docs.temporal.io/evaluate/use-cases-design-patterns) +- [Temporal Blog - Workflow Engine Principles](https://temporal.io/blog/workflow-engine-principles) +- [Temporal TypeScript SDK](https://github.com/temporalio/sdk-typescript) +- [GitHub Actions Runner Architecture](https://depot.dev/blog/github-actions-runner-architecture-part-1-the-listener) +- [nektos/act - Local GitHub Actions](https://github.com/nektos/act) + +### tRPC +- [tRPC Documentation](https://trpc.io/docs/) +- [Better Stack - From REST to tRPC](https://betterstack.com/community/guides/scaling-nodejs/trpc-explained/) +- [Medium - Understanding tRPC](https://medium.com/@ignatovich.dm/understanding-trpc-building-type-safe-apis-in-typescript-45258c6c3b73) + +### Model Context Protocol +- [MCP Architecture Overview](https://modelcontextprotocol.io/docs/learn/architecture) +- [IBM - What is MCP](https://www.ibm.com/think/topics/model-context-protocol) +- [Google Cloud - MCP Guide](https://cloud.google.com/discover/what-is-model-context-protocol) +- [Composio - MCP Explained](https://composio.dev/blog/what-is-model-context-protocol-mcp-explained) + +### Modular Monolith +- [Software Architecture Guild - Modular Monolith](https://software-architecture-guild.com/guide/architecture/styles/modular-monolith/) +- [DEV - Structuring Modular Monoliths](https://dev.to/xoubaman/modular-monolith-3fg1) +- [Medium - Scalable Express API with Modular Monolith](https://medium.com/@mwwtstq/building-a-scalable-express-api-using-clean-architecture-and-a-modular-monolith-with-typescript-c855614b05dc) +- [GitHub - Modular Monolith Node.js](https://github.com/mgce/modular-monolith-nodejs) + +### CQRS & Event Sourcing +- [Event-Driven.io - Emmett](https://event-driven.io/en/introducing_emmett/) +- [Medium - CQRS From Scratch with TypeScript](https://medium.com/swlh/cqrs-from-scratch-with-typescript-e2ccf7fc2b64) +- [DEV - Event Sourcing in TypeScript](https://dev.to/dariowoollover/an-opinionated-guide-to-event-sourcing-in-typescript-kickoff-42d6) diff --git a/.planning/research/FEATURES.md b/.planning/research/FEATURES.md new file mode 100644 index 0000000..e7dd292 --- /dev/null +++ b/.planning/research/FEATURES.md @@ -0,0 +1,285 @@ +# Feature Research: Codewalk District + +## Metadata +- **Domain**: Multi-agent orchestration / Developer tooling +- **Researched**: 2026-01-30 +- **Confidence**: HIGH +- **Focus**: Solo developer orchestrating multiple Claude Code agents + +--- + +## Executive Summary + +The multi-agent AI tooling space is maturing rapidly. Tools like Claude Squad, Claude Flow, and par have established baseline expectations. Your differentiator isn't the *ability* to run multiple agents—that's table stakes now. The differentiator is in *coordination quality*: preventing conflicts before they happen, making review manageable, and keeping the developer in control without drowning in context switches. + +--- + +## Table Stakes (Must Have) + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| **Git worktree isolation** | Prevents agents from stepping on each other. Every serious parallel workflow tool uses this. | Low | Claude Squad, par, incident.io all use this pattern | +| **tmux session management** | Persistent, detachable sessions. Users expect to close terminal and come back. | Low | Standard pattern; unreliable without delays for send-keys | +| **Background execution** | "Yolo mode" / auto-accept. Users want agents working while they're away. | Low | Claude Squad pioneered this; now expected | +| **Task status visibility** | Users need to see what each agent is doing at a glance | Medium | X of Y pattern or spinner per agent | +| **Agent lifecycle management** | Start, stop, pause, resume agents | Low | Basic CRUD for agent sessions | +| **Branch isolation** | Each agent works on its own branch | Low | Natural consequence of worktree approach | +| **Session persistence** | Resume work after disconnect/reboot | Medium | tmux provides this; need to restore state | +| **Basic logging** | Capture stdout/stderr per agent | Low | Essential for debugging and review | +| **CLI interface** | Terminal-native; developers live here | Low | No web UI needed for v1 | +| **Git diff preview** | See changes before merging | Low | Standard git tooling | + +--- + +## Differentiators (Competitive Advantage) + +| Feature | Value Proposition | Complexity | Notes | +|---------|-------------------|------------|-------| +| **Intelligent task breakdown** | Turn "implement auth" into parallelizable subtasks. Most tools leave this to the user. | High | CrewAI/LangGraph do this for agents; nobody does it well for Claude Code orchestration | +| **Conflict prevention via file-level coordination** | Prevent merge hell by knowing which files each agent is touching | High | Agent-MCP has "file-level locking"; huge pain point | +| **Review-first workflow** | Bottleneck is review, not generation. Tools that acknowledge this win. | Medium | Most tools assume shipping is the goal; quality-first is rare | +| **Dependency-aware task ordering** | Don't start task B until task A creates the interface it depends on | High | Graph-based execution like LangGraph | +| **Context summarization between agents** | Agent B can learn from Agent A's work without reading everything | High | Reduces token waste; improves coherence | +| **Token/cost tracking per agent** | Know what each agent costs; budget accordingly | Medium | PM2 has resource monitoring; AI tools rarely have cost visibility | +| **Merge conflict prediction** | Warn before conflicts happen, not after | High | Harmony AI does post-hoc resolution; prediction is better | +| **Interactive task tree** | Visual hierarchy of tasks, dependencies, status | Medium | ClickUp-style breakdown but terminal-native | +| **Automatic PR generation** | Merge agent's work, create PR with description | Medium | Tedious step that can be automated | +| **Cross-agent communication** | Agent A can ask Agent B a question mid-task | Very High | CrewAI has this; complex to implement well | + +--- + +## Anti-Features (Avoid Building) + +| Feature | Why Requested | Why Problematic | Alternative | +|---------|---------------|-----------------|-------------| +| **Web dashboard** | "Modern" / visual appeal | Adds complexity; developers prefer terminal; maintenance burden | Rich terminal UI (blessed, ink, etc.) | +| **Agent-to-agent direct chat** | "Collaboration" | Unpredictable behavior; context explosion; debugging nightmare | Structured handoffs via task completion | +| **Automatic merge without review** | "Faster" | Quality disaster; merge conflicts; broken code | Preview + one-command merge | +| **Unlimited parallel agents** | "Maximum throughput" | Review bottleneck; resource exhaustion; diminishing returns | Soft cap with warning (3-5 agents) | +| **Plugin system (v1)** | "Extensibility" | Premature; need to learn what users actually want first | Hardcode common patterns; plugins in v2 | +| **Cloud sync** | "Work anywhere" | Privacy concerns; complexity; not the core problem | Local-first; git is your sync | +| **AI task decomposition (v1)** | "Smarter" | Unpredictable; users need to understand their tasks first | Manual breakdown with templates | +| **Real-time collaboration** | "Team feature" | Solo developer focus; complexity explosion | git push/pull is collaboration | + +--- + +## Feature Dependencies + +``` +[Git worktree isolation] + | + v +[tmux session management] -----> [Session persistence] + | + v +[Agent lifecycle management] + | + +----+----+ + | | + v v +[Task status [Background + visibility] execution] + | + v +[Basic logging] + | + v +[Git diff preview] + | + v +[Interactive task tree] -----> [Dependency-aware ordering] + | + v +[Conflict prevention] -----> [Merge conflict prediction] + | + v +[Automatic PR generation] +``` + +--- + +## MVP Definition + +### v1.0 - "It Works" +Core value: Run multiple Claude Code agents without losing track. + +- Git worktree creation/cleanup per agent +- tmux session per agent +- Start/stop/list agents CLI +- Basic status display (which agents running, which branch) +- Background execution mode +- Session resume after terminal close +- Simple logging (stdout capture) +- `cs new ` / `cs list` / `cs stop ` / `cs status` + +**Success metric**: Can run 3 agents in parallel on different features without conflicts. + +### v1.x - "It's Pleasant" +Core value: Workflow feels natural, not janky. + +- Rich terminal UI (status dashboard) +- Task description per agent +- Git diff preview per agent +- One-command merge to main +- Token/cost tracking per agent +- PR generation from merged work +- Task templates for common patterns +- Configurable agent limits with warnings + +**Success metric**: User can manage 5 parallel agents without context-switch fatigue. + +### v2.0 - "It's Smart" +Core value: The tool helps you work better, not just faster. + +- Intelligent task breakdown suggestions +- File-level coordination (who's touching what) +- Dependency-aware task ordering +- Conflict prediction before merge +- Cross-agent context summarization +- Review workflow integration +- Historical task/cost analytics + +**Success metric**: 50% fewer merge conflicts; review time per agent drops. + +--- + +## Feature Prioritization Matrix + +| Feature | User Impact | Implementation Effort | Priority | +|---------|-------------|----------------------|----------| +| Git worktree isolation | Critical | Low | **P0** | +| tmux session management | Critical | Low | **P0** | +| Agent lifecycle (start/stop/list) | Critical | Low | **P0** | +| Basic status display | High | Low | **P0** | +| Background execution | High | Low | **P0** | +| Session persistence | High | Medium | **P1** | +| Basic logging | High | Low | **P1** | +| Git diff preview | Medium | Low | **P1** | +| Rich terminal UI | Medium | Medium | **P2** | +| Token tracking | Medium | Medium | **P2** | +| PR generation | Medium | Medium | **P2** | +| File-level coordination | High | High | **P2** | +| Conflict prediction | High | High | **P3** | +| Task dependency ordering | Medium | High | **P3** | +| AI task breakdown | Medium | Very High | **P4** | + +--- + +## Competitor Feature Analysis + +### Claude Squad +**Strengths**: +- First-mover; established pattern +- Clean tmux + worktree architecture +- Background daemon for auto-accept +- Supports multiple agent types (Claude, Aider, Codex) + +**Weaknesses**: +- No task breakdown/planning +- No conflict prevention +- No cost tracking +- Basic status visibility + +**Opportunity**: They solved "run multiple agents." You solve "coordinate multiple agents." + +### par (Parallel Worktree & Session Manager) +**Strengths**: +- Global CLI tool +- Simple mental model +- Designed for AI assistants + +**Weaknesses**: +- Minimal coordination features +- No agent-specific intelligence +- Basic session management + +**Opportunity**: par is infrastructure; you're workflow. + +### Claude Flow +**Strengths**: +- Enterprise positioning +- MCP integration +- "Swarm intelligence" marketing + +**Weaknesses**: +- Over-engineered for solo developers +- Complex configuration +- Unclear value prop for simple use cases + +**Opportunity**: Simplicity wins for solo developers. + +### Native Claude Code Subagents +**Strengths**: +- Built-in; no extra tools +- Resumable sessions +- Tool access control + +**Weaknesses**: +- Single main context (subagents are delegated) +- No true parallelism +- No isolation between subagents + +**Opportunity**: True parallel execution that native subagents can't provide. + +--- + +## Key Insights + +1. **Review is the bottleneck**: Tools optimize for generation speed. Smart tools optimize for review speed. Two well-reviewed PRs beat five half-reviewed ones. + +2. **Conflict prevention > conflict resolution**: Everyone's building AI merge conflict resolvers. Nobody's preventing conflicts in the first place. File-level coordination is the gap. + +3. **Token costs are invisible**: Developers have no idea what parallel agents cost. First tool to make costs visible wins trust. + +4. **3-5 agents is the sweet spot**: Research consistently shows diminishing returns beyond 5 parallel agents. The bottleneck is human review capacity. + +5. **Graph-based coordination is coming**: LangGraph's architecture (nodes + edges + state) is the right mental model for multi-agent coordination. CrewAI's role-based model is simpler but less flexible. + +6. **Terminal-native wins**: Developers live in the terminal. Web dashboards are a distraction for this use case. + +7. **Local-first, always**: Privacy matters. No cloud sync needed when git is your collaboration layer. + +--- + +## Sources + +### Multi-Agent Orchestration +- [Top AI Agent Orchestration Frameworks for Developers 2025](https://www.kubiya.ai/blog/ai-agent-orchestration-frameworks) +- [AI Agent Orchestration Frameworks - n8n Blog](https://blog.n8n.io/ai-agent-orchestration-frameworks/) +- [CrewAI - The Leading Multi-Agent Platform](https://www.crewai.com/) +- [LangGraph](https://www.langchain.com/langgraph) +- [LangGraph vs CrewAI - ZenML Blog](https://www.zenml.io/blog/langgraph-vs-crewai) +- [CrewAI vs LangGraph vs AutoGen - DataCamp](https://www.datacamp.com/tutorial/crewai-vs-langgraph-vs-autogen) + +### Claude Code Multi-Agent Tools +- [Claude Code Subagents Documentation](https://code.claude.com/docs/en/sub-agents) +- [Claude Squad - GitHub](https://github.com/smtg-ai/claude-squad) +- [Claude Flow - GitHub](https://github.com/ruvnet/claude-flow) +- [par - Parallel Worktree & Session Manager](https://github.com/coplane/par) +- [Advanced Claude Code Techniques - Medium](https://medium.com/@salwan.mohamed/advanced-claude-code-techniques-multi-agent-workflows-and-parallel-development-for-devops-89377460252c) + +### Git Worktree Workflows +- [How Git Worktrees Changed My AI Agent Workflow - Nx Blog](https://nx.dev/blog/git-worktrees-ai-agents) +- [Parallel Workflows: Git Worktrees and Multiple AI Agents - Medium](https://medium.com/@dennis.somerville/parallel-workflows-git-worktrees-and-the-art-of-managing-multiple-ai-agents-6fa3dc5eec1d) +- [How we're shipping faster with Claude Code and Git Worktrees - incident.io](https://incident.io/blog/shipping-faster-with-claude-code-and-git-worktrees) +- [gwq - Git Worktree Manager](https://github.com/d-kuro/gwq) + +### AI Coding Assistants +- [Aider vs Cursor - Sider](https://sider.ai/blog/ai-tools/ai-aider-vs-cursor-which-ai-coding-assistant-wins-for-2025) +- [Best AI Coding Assistants 2026 - Shakudo](https://www.shakudo.io/blog/best-ai-coding-assistants) +- [Top 9 Cursor Alternatives - Cline](https://cline.bot/blog/top-9-cursor-alternatives-in-2025-best-open-source-ai-dev-tools-for-developers) + +### Process Management +- [PM2 vs Supervisord - StackShare](https://stackshare.io/stackups/pm2-vs-supervisord) +- [PM2 - Production Process Manager](https://pm2.keymetrics.io/) +- [Cronicle - Multi-Server Task Scheduler](https://github.com/jhuckaby/Cronicle) + +### CLI UX & Monitoring +- [CLI UX Best Practices: Progress Displays - Evil Martians](https://evilmartians.com/chronicles/cli-ux-best-practices-3-patterns-for-improving-progress-displays) +- [WTF - Terminal Dashboard](https://wtfutil.com/) +- [Sampler - Console Dashboards](https://dzone.com/articles/build-beautiful-console-dashboards-with-sampler) + +### Conflict Resolution +- [The Role of AI in Merge Conflict Resolution - Graphite](https://graphite.com/guides/ai-code-merge-conflict-resolution) +- [Agent-MCP - Multi-Agent Framework](https://github.com/rinadelph/Agent-MCP) +- [Multi-Agent Parallel Execution - Skywork](https://skywork.ai/blog/agent/multi-agent-parallel-execution-running-multiple-ai-agents-simultaneously/) diff --git a/.planning/research/PITFALLS.md b/.planning/research/PITFALLS.md new file mode 100644 index 0000000..126534b --- /dev/null +++ b/.planning/research/PITFALLS.md @@ -0,0 +1,437 @@ +# Codewalk District: Critical Pitfalls Reference + +- **Domain:** Multi-agent orchestration / Developer tooling +- **Researched:** 2026-01-30 +- **Confidence:** HIGH + +This document catalogs preventable mistakes for CLI tools that manage background processes, SQLite databases, git worktrees, and file system watchers. Focused on what actually breaks in production, not theoretical edge cases. + +--- + +## Critical Pitfalls + +### 1. Zombie and Orphan Processes + +**What goes wrong:** Child processes (Claude Code agents) continue running after the parent exits, consuming resources indefinitely. `child.kill()` does not release the process from memory - it only stops the process. Orphaned processes accumulate with every request. + +**Why it happens:** Node.js doesn't call `wait/waitpid` automatically. When killing only the parent process, children become orphaned with PPID of 1. The `child.kill()` method stops the process but the kernel still holds the process entry until the parent acknowledges it. + +**How to avoid:** +- Use the `terminate` npm package to kill process trees (parent + all children) +- Listen for parent exit in child processes: `process.on('exit', () => process.exit(0))` +- For background daemons: use `detached: true` with `child.unref()` intentionally +- Track all spawned PIDs and clean them up on shutdown + +**Warning signs:** +- Memory usage grows over time +- `ps aux | grep node` shows unexpected processes +- System becomes sluggish after extended use + +**Phase to address:** Core Architecture (before spawning any agents) + +**Sources:** +- [5 Tips for Cleaning Orphaned Node.js Processes](https://medium.com/@arunangshudas/5-tips-for-cleaning-orphaned-node-js-processes-196ceaa6d85e) +- [Node.js Issue #14445: Zombie Processes](https://github.com/nodejs/node/issues/14445) +- [Killing Process Families with Node](https://medium.com/@almenon214/killing-processes-with-node-772ffdd19aad) + +--- + +### 2. SQLite WAL File Separation + +**What goes wrong:** Database becomes corrupted or loses committed transactions when the `-wal` and `-shm` files are separated from the main `.db` file. This happens during backups, file moves, or container restarts. + +**Why it happens:** SQLite WAL mode uses three files: `.db`, `.db-wal` (write-ahead log), and `.db-shm` (shared memory). The WAL file contains transactions not yet checkpointed to the main database. Copying only the `.db` file loses uncommitted data or corrupts the database. + +**How to avoid:** +- Never use `fs.copyFile()` on an active SQLite database +- Use SQLite's backup API (`sqlite3_backup_*`) or `sqlite3_rsync` (v3.47.0+) +- If you must copy: start an `IMMEDIATE` transaction first to block writers +- Always copy all three files atomically +- Run `PRAGMA wal_checkpoint(TRUNCATE)` before backups + +**Warning signs:** +- "database disk image is malformed" errors +- Missing recent data after restart +- `-wal` file grows unboundedly + +**Phase to address:** Data Layer (first database operation) + +**Sources:** +- [SQLite Corruption with fs.copyFile() in WAL Mode](https://scottspence.com/posts/sqlite-corruption-fs-copyfile-issue) +- [How To Corrupt An SQLite Database File](https://www.sqlite.org/howtocorrupt.html) +- [High Performance SQLite: How to Corrupt SQLite](https://highperformancesqlite.com/watch/corrupt) + +--- + +### 3. SQLITE_BUSY Despite busy_timeout + +**What goes wrong:** You set `busy_timeout` but still get "database is locked" errors. Transactions fail unexpectedly under concurrent load. + +**Why it happens:** `busy_timeout` doesn't help when a `DEFERRED` transaction tries to upgrade from read to write mid-transaction while another connection holds the lock. SQLite returns `SQLITE_BUSY` immediately without invoking the busy handler because retrying wouldn't help (you're mid-transaction). + +**How to avoid:** +- Use `BEGIN IMMEDIATE` for any transaction that will write +- Never upgrade from read-only to read-write mid-transaction +- Implement retry logic with exponential backoff at the application level +- Consider using a write queue/serialization for all writes + +**Warning signs:** +- Sporadic "database is locked" errors under load +- Errors only appear with concurrent operations +- busy_timeout appears ineffective + +**Phase to address:** Data Layer (transaction design) + +**Sources:** +- [SQLITE_BUSY Errors Despite Setting Timeout](https://berthub.eu/articles/posts/a-brief-post-on-sqlite3-database-locked-despite-timeout/) +- [Improving SQLite Concurrency](https://fractaledmind.com/2023/12/11/sqlite-on-rails-improving-concurrency/) +- [SQLite Forum: SQLITE_BUSY and Deadlock](https://sqlite.org/forum/forumpost/4350638e78869137) + +--- + +### 4. Checkpoint Starvation in SQLite + +**What goes wrong:** The WAL file grows indefinitely, disk space runs out, and query performance degrades progressively. + +**Why it happens:** Checkpointing requires no active readers. If you have constant read traffic (e.g., polling for agent status), the WAL file cannot be recycled. Long-running read transactions prevent checkpointing. + +**How to avoid:** +- Call `db.checkpoint()` periodically (better-sqlite3) +- Use `PRAGMA wal_autocheckpoint=N` with appropriate N +- Avoid long-running read transactions +- Monitor WAL file size and alert when it exceeds threshold +- Consider read connection pooling with short-lived connections + +**Warning signs:** +- `-wal` file growing continuously +- Slowing read performance over time +- Disk space warnings + +**Phase to address:** Data Layer (connection management) + +**Sources:** +- [Write-Ahead Logging](https://sqlite.org/wal.html) +- [Improving Concurrency with better-sqlite3](https://wchargin.com/better-sqlite3/performance.html) + +--- + +### 5. Graceful Shutdown Failures + +**What goes wrong:** Agents are killed mid-operation, database writes are lost, file handles leak, and users lose work. Docker/Kubernetes sends SIGKILL after timeout. + +**Why it happens:** Node.js doesn't handle shutdown gracefully by default. `server.close()` doesn't close keep-alive connections. `process.exit()` abandons pending async operations. Docker sends SIGKILL after 10 seconds if SIGTERM isn't handled. + +**How to avoid:** +- Listen for SIGTERM and SIGINT: `process.on('SIGTERM', shutdown)` +- Set `process.exitCode` instead of calling `process.exit()` directly +- Track in-flight operations and wait for completion +- Implement timeout: graceful shutdown attempt, then force after N seconds +- For child processes: propagate signals to all children before exiting + +**Warning signs:** +- Data loss after restarts +- "Port already in use" after crash +- Orphaned temporary files +- Corrupted state files + +**Phase to address:** Core Architecture (process lifecycle) + +**Sources:** +- [Graceful Shutdown with Node.js and Kubernetes](https://blog.risingstack.com/graceful-shutdown-node-js-kubernetes/) +- [Implementing NodeJS HTTP Graceful Shutdown](https://blog.dashlane.com/implementing-nodejs-http-graceful-shutdown/) +- [Graceful Shutdown in NodeJS](https://hackernoon.com/graceful-shutdown-in-nodejs-2f8f59d1c357) + +--- + +### 6. Git Worktree Orphaning + +**What goes wrong:** Worktrees become "orphaned" - git thinks they exist but they don't, or they exist but git doesn't know about them. Operations fail silently or with cryptic errors. + +**Why it happens:** Worktrees moved/deleted without using `git worktree move/remove`. Parent repository moved, breaking the worktree links. Worktree on unmounted network/portable storage gets auto-pruned. + +**How to avoid:** +- Always use `git worktree remove` to delete worktrees +- Run `git worktree prune` periodically to clean stale entries +- Use `git worktree lock` for worktrees on removable storage +- After moving the main repo, run `git worktree repair` +- Never `rm -rf` a worktree directory directly + +**Warning signs:** +- "fatal: not a git repository" in worktree +- `git worktree list` shows paths that don't exist +- Operations succeed in main repo but fail in worktree + +**Phase to address:** Git Integration (worktree lifecycle) + +**Sources:** +- [Git Worktree Documentation](https://git-scm.com/docs/git-worktree) +- [Worktree Cleanup Fails Issue](https://github.com/obra/superpowers/issues/238) + +--- + +### 7. File Watcher Race Conditions + +**What goes wrong:** Files created between directory scan and watcher setup are never detected. Events fire before files are fully written. Multiple watcher instances conflict. + +**Why it happens:** Chokidar scans directory contents, then sets up fs.watch - any file created between these operations is missed. Large files emit 'add' before write completes. Multiple chokidar instances (e.g., your app + Vite) interfere with each other. + +**How to avoid:** +- Use `awaitWriteFinish: true` for large files (polls until size stabilizes) +- Call `.close()` with `await` - it's async and returns a Promise +- Avoid multiple watcher instances on same directory +- Increase `fs.inotify.max_user_watches` on Linux +- Add delay/debounce for rapid file changes + +**Warning signs:** +- Newly created files sometimes not detected +- Events fire twice for same file +- "ENOSPC: System limit for file watchers reached" +- HMR/hot reload stops working + +**Phase to address:** File System UI (watcher setup) + +**Sources:** +- [Chokidar GitHub](https://github.com/paulmillr/chokidar) +- [Race Condition When Watching Dirs](https://github.com/paulmillr/chokidar/issues/1112) +- [Vite Issue #12495](https://github.com/vitejs/vite/issues/12495) + +--- + +### 8. process.exit() Kills Async Operations + +**What goes wrong:** Log messages never reach their destination. Database writes are lost. Cleanup callbacks don't complete. + +**Why it happens:** `process.exit()` forces immediate termination even with pending I/O. Writes to stdout are sometimes async in Node.js. Exit event listeners can only run synchronous code. + +**How to avoid:** +- Set `process.exitCode = N` instead of calling `process.exit(N)` +- Let the event loop drain naturally +- Use async-cleanup or node-cleanup packages for async cleanup +- Keep cleanup fast (< 2 seconds) to avoid SIGKILL from OS + +**Warning signs:** +- Log messages truncated or missing +- Final state not persisted +- "Cleanup completed" message never appears + +**Phase to address:** Core Architecture (exit handling) + +**Sources:** +- [Node.js process.exit() Documentation](https://nodejs.org/api/process.html) +- [Exiting Node.js Asynchronously](https://blog.soulserv.net/exiting-node-js-asynchronously/) +- [async-cleanup Package](https://github.com/trevorr/async-cleanup) + +--- + +### 9. Cross-Platform Path Handling + +**What goes wrong:** Paths work on macOS but break on Windows. Case-sensitivity issues cause deployment failures. Scripts fail with "command not found." + +**Why it happens:** Windows uses backslash, Unix uses forward slash. Windows is case-insensitive, Linux is case-sensitive. `spawn` without `shell: true` can't run `.cmd`/`.bat` files on Windows. Environment variable syntax differs (`SET` vs `export`). + +**How to avoid:** +- Always use `path.join()` and `path.normalize()` - never string concatenation +- Use `path.sep` for platform-specific separators +- Test with actual case used in `require()` statements +- Use `cross-env` for environment variables in npm scripts +- Use `shell: true` or explicitly handle `.cmd` files on Windows +- Use `rimraf` instead of `rm -rf` + +**Warning signs:** +- "File not found" only on certain platforms +- Import works locally but fails in CI +- npm scripts fail on Windows + +**Phase to address:** Core Architecture (all file operations) + +**Sources:** +- [Writing Cross-Platform Node.js](https://shapeshed.com/writing-cross-platform-node/) +- [Cross-Platform Node Guide](https://github.com/ehmicky/cross-platform-node-guide) +- [Awesome Cross-Platform Node.js](https://github.com/bcoe/awesome-cross-platform-nodejs) + +--- + +### 10. Multi-Agent State Race Conditions + +**What goes wrong:** Two agents update the same resource simultaneously. Plan ownership is ambiguous. State gets duplicated or lost between agents. Deadlocks occur when agents wait on each other. + +**Why it happens:** Race conditions increase quadratically with agent count: N agents = N(N-1)/2 potential concurrent interactions. Tool contention when one agent reads while another writes. No clear source of truth for shared state. + +**How to avoid:** +- Define clear resource ownership - one agent owns each resource +- Use optimistic locking with version numbers +- Implement file/resource locking for shared assets +- Design for eventual consistency where possible +- Use event-driven patterns with message queues +- Add comprehensive workflow checkpointing for rollback + +**Warning signs:** +- Inconsistent state after concurrent operations +- Operations that "should have worked" show wrong results +- Agents appear stuck waiting + +**Phase to address:** Agent Orchestration (coordination protocol) + +**Sources:** +- [AI Agent Orchestration for Production Systems](https://redis.io/blog/ai-agent-orchestration/) +- [Multi-Agent System Reliability](https://www.getmaxim.ai/articles/multi-agent-system-reliability-failure-patterns-root-causes-and-production-validation-strategies/) +- [Multi-Agent Coordination Strategies](https://galileo.ai/blog/multi-agent-coordination-strategies) + +--- + +## Technical Debt Patterns + +| Pattern | Symptom | Fix Cost (Early) | Fix Cost (Late) | +|---------|---------|------------------|-----------------| +| No process tree tracking | Orphan accumulation | 2h | 2d (debugging + cleanup) | +| String path concatenation | Cross-platform breaks | 1h | 8h (find all instances) | +| Missing graceful shutdown | Data loss, port conflicts | 4h | 2d (state recovery) | +| Synchronous DB operations | UI freezes, timeouts | 4h | 1w (async refactor) | +| No WAL checkpoint strategy | Disk full, slow queries | 2h | 1d (emergency cleanup) | +| Implicit transactions | SQLITE_BUSY under load | 2h | 3d (transaction audit) | +| Direct worktree deletion | Orphan worktrees | 1h | 4h (manual repair) | +| Single watcher instance assumed | Silent event drops | 2h | 1d (debugging) | + +--- + +## Integration Gotchas + +| Integration | Gotcha | Mitigation | +|-------------|--------|------------| +| SQLite + Network FS | WAL doesn't work over NFS/SMB - silent corruption | Only use local filesystems | +| SQLite + Docker volumes | Shared memory files can't be shared across containers | One container per database | +| Git worktree + Docker | Worktree paths break when container paths differ | Use consistent mount paths | +| chokidar + Vite | Multiple watchers conflict, HMR stops | Single watcher, or namespace paths | +| Node.js + Docker | PID 1 doesn't forward signals | Use `--init` or tini | +| npm scripts + signals | npm doesn't forward SIGTERM to children properly | Run node directly, not via npm | +| spawn + Windows | `.cmd` files need `shell: true` | Check platform, use shell on Windows | +| better-sqlite3 + threads | GVL blocking during busy_timeout | Use worker threads for writes | + +--- + +## Performance Traps + +| Trap | Impact | Solution | +|------|--------|----------| +| Polling SQLite for agent status | CPU burn, battery drain | Use file watchers or IPC | +| Not batching SQLite writes | 1000x slower inserts | Wrap in transaction | +| Watching entire repo | inotify limit exhaustion | Watch specific directories | +| Synchronous file operations | Event loop blocked | Use fs.promises, worker threads | +| Spawning shell for every command | 10-100ms overhead per spawn | Use `shell: false` where possible | +| No connection reuse | Connection overhead per query | Reuse database connection | +| Logging to file synchronously | I/O blocks main thread | Async logging, buffer writes | + +--- + +## Security Mistakes + +| Mistake | Risk | Mitigation | +|---------|------|------------| +| Storing credentials in SQLite | Database theft exposes secrets | Use OS keychain, env vars | +| Executing user input in shell | Command injection | Never `shell: true` with user input | +| World-readable socket files | Unauthorized access | Set restrictive permissions | +| No validation on IPC messages | Malicious command execution | Schema validation, allowlists | +| Logging sensitive data | Credential exposure in logs | Redact tokens, keys, passwords | +| Worktree in /tmp | Other users can access | Use user-specific directories | + +--- + +## UX Pitfalls + +| Pitfall | User Experience | Fix | +|---------|-----------------|-----| +| Silent failures | User thinks it worked | Always report errors clearly | +| No progress indication | User thinks it's frozen | Show spinners, progress bars | +| Cryptic error messages | User can't fix the problem | Include actionable guidance | +| Requiring too many flags | High learning curve | Interactive mode for discovery | +| Inconsistent command names | User guesses wrong | Follow conventions (git-style) | +| No examples in help | User doesn't know how to start | Show common use cases first | +| Breaking changes in config | User's setup stops working | Version configs, migrate automatically | +| Long startup time | User abandons tool | Lazy loading, fast path for common cases | + +--- + +## "Looks Done But Isn't" Checklist + +- [ ] **Process cleanup**: Verified all child processes terminate on parent exit +- [ ] **Signal handling**: SIGTERM, SIGINT handled; cleanup completes before SIGKILL timeout +- [ ] **Database integrity**: WAL files included in backups; checkpoint strategy defined +- [ ] **Concurrent writes**: Tested with multiple agents writing simultaneously +- [ ] **Worktree cleanup**: Verified `git worktree remove` used, not `rm -rf` +- [ ] **File watcher coverage**: Tested files created during watcher startup +- [ ] **Cross-platform**: Tested on Windows (paths, signals, shell commands) +- [ ] **Error messages**: All errors have actionable guidance +- [ ] **Graceful degradation**: System handles missing files, network failures +- [ ] **Resource limits**: Tested with max expected agents/worktrees/watchers +- [ ] **Restart recovery**: System recovers correct state after crash/kill +- [ ] **Long-running stability**: No memory leaks, file handle leaks over hours + +--- + +## Recovery Strategies + +| Failure | Detection | Recovery | +|---------|-----------|----------| +| Orphan processes | `ps` shows unexpected node processes | Kill process tree, track PIDs | +| Corrupted SQLite | "malformed" error on query | Restore from backup, rebuild | +| Orphan worktrees | `git worktree list` shows missing paths | `git worktree prune` | +| Stuck file watcher | Events stop firing | Restart watcher, check limits | +| Checkpoint starvation | WAL file > 100MB | Force checkpoint, close long readers | +| Zombie agents | Agent shows running but unresponsive | Track heartbeats, timeout + kill | +| State inconsistency | Agents report different state | Single source of truth, reconcile | +| Port in use | "EADDRINUSE" on restart | Track PID file, kill stale process | + +--- + +## Pitfall-to-Phase Mapping + +| Phase | Must Address | +|-------|--------------| +| **Core Architecture** | Process lifecycle, signal handling, exit handling, cross-platform paths | +| **Data Layer** | SQLite WAL handling, transactions, checkpoint strategy, backups | +| **Git Integration** | Worktree lifecycle, repair/prune, lock handling | +| **Agent Orchestration** | Process tree tracking, state ownership, race condition prevention | +| **File System UI** | Watcher setup, race conditions, resource limits | +| **CLI Interface** | Error messages, interactive mode, help examples | + +--- + +## Key Sources + +### Process Management +- [5 Tips for Cleaning Orphaned Node.js Processes](https://medium.com/@arunangshudas/5-tips-for-cleaning-orphaned-node-js-processes-196ceaa6d85e) +- [Killing Process Families with Node](https://medium.com/@almenon214/killing-processes-with-node-772ffdd19aad) +- [Die, Child Process, Die!](https://www.exratione.com/2013/05/die-child-process-die/) +- [Node.js Child Process Documentation](https://nodejs.org/api/child_process.html) + +### SQLite +- [How To Corrupt An SQLite Database File](https://www.sqlite.org/howtocorrupt.html) +- [SQLite Write-Ahead Logging](https://sqlite.org/wal.html) +- [SQLITE_BUSY Errors Despite Timeout](https://berthub.eu/articles/posts/a-brief-post-on-sqlite3-database-locked-despite-timeout/) +- [SQLite Myths: Concurrent Writes](https://fractaledmind.com/2023/10/13/sqlite-myths-concurrent-writes-can-corrupt-the-database/) +- [Improving SQLite Concurrency](https://fractaledmind.com/2023/12/11/sqlite-on-rails-improving-concurrency/) + +### Git Worktrees +- [Git Worktree Documentation](https://git-scm.com/docs/git-worktree) +- [Treehouse Worktree MCP Server](https://glama.ai/mcp/servers/@mark-hingston/treehouse-worktree) + +### Graceful Shutdown +- [Graceful Shutdown with Node.js and Kubernetes](https://blog.risingstack.com/graceful-shutdown-node-js-kubernetes/) +- [Implementing NodeJS HTTP Graceful Shutdown](https://blog.dashlane.com/implementing-nodejs-http-graceful-shutdown/) +- [Graceful Shutdown in NodeJS](https://hackernoon.com/graceful-shutdown-in-nodejs-2f8f59d1c357) + +### File Watching +- [Chokidar Repository](https://github.com/paulmillr/chokidar) +- [Chokidar Race Condition Issue](https://github.com/paulmillr/chokidar/issues/1112) + +### Cross-Platform +- [Writing Cross-Platform Node.js](https://shapeshed.com/writing-cross-platform-node/) +- [Cross-Platform Node Guide](https://github.com/ehmicky/cross-platform-node-guide) + +### Multi-Agent Systems +- [AI Agent Orchestration for Production](https://redis.io/blog/ai-agent-orchestration/) +- [Multi-Agent System Reliability](https://www.getmaxim.ai/articles/multi-agent-system-reliability-failure-patterns-root-causes-and-production-validation-strategies/) +- [Multi-Agent Coordination Strategies](https://galileo.ai/blog/multi-agent-coordination-strategies) + +### CLI UX +- [UX Patterns for CLI Tools](https://lucasfcosta.com/2022/06/01/ux-patterns-cli-tools.html) diff --git a/.planning/research/STACK.md b/.planning/research/STACK.md new file mode 100644 index 0000000..1dbd35a --- /dev/null +++ b/.planning/research/STACK.md @@ -0,0 +1,327 @@ +# Stack Research: TypeScript CLI with Embedded Server and SQLite + +**Domain:** Multi-agent orchestration / Developer tooling +**Researched:** 2026-01-30 +**Confidence:** HIGH (based on official docs, npm stats, GitHub activity, authoritative blog posts) + +--- + +## Core Technologies + +| Name | Version | Purpose | Why Recommended | +|------|---------|---------|-----------------| +| **Node.js** | v22+ LTS | Runtime | Native ESM support, `node:sqlite` experimental, stable LTS | +| **TypeScript** | 5.7+ | Type safety | Strict mode required for tRPC/Zod inference | +| **Commander.js** | 14.x | CLI framework | Lightweight, TypeScript types included, mature (9+ years), subcommand support | +| **tRPC** | 11.x | API layer | Type-safe RPC, standalone server adapter, SSE subscriptions, no codegen | +| **Drizzle ORM** | 0.44+ | Database ORM | Lightweight, TypeScript-first, SQL-centric, faster than raw better-sqlite3 with prepared statements | +| **better-sqlite3** | 11.x | SQLite driver | Synchronous API, fastest SQLite driver for Node.js, Drizzle's recommended driver | +| **Zod** | 3.24+ | Validation | tRPC's default validator, runtime + compile-time safety, Standard Schema compliant | + +--- + +## Supporting Libraries + +| Name | Version | Purpose | Notes | +|------|---------|---------|-------| +| **execa** | 9.6+ | Process spawning | Promise-based, auto-cleanup, cross-platform, graceful shutdown | +| **simple-git** | 3.27+ | Git operations | Wrapper around git CLI, worktree config support | +| **cors** | 2.8+ | CORS handling | For tRPC standalone server in dev mode | +| **nanoid** | 5.x | ID generation | URL-safe, small, fast | +| **consola** | 3.x | CLI logging | Pretty console output, log levels | +| **chalk** | 5.x | Terminal styling | ESM-native, color output | + +--- + +## Development Tools + +| Tool | Version | Purpose | +|------|---------|---------| +| **tsup** | 8.5+ | Build/bundle | esbuild-powered, zero-config, ESM+CJS output | +| **tsx** | 4.x | Dev execution | TypeScript execution without compilation, shebang support | +| **vitest** | 3.x | Testing | Vite-powered, native ESM/TS, fast, parallel | +| **drizzle-kit** | 0.31+ | Migrations | Schema migrations, introspection | +| **@types/better-sqlite3** | latest | Types | TypeScript definitions | +| **@types/node** | 22.x | Types | Node.js type definitions | + +--- + +## Installation Commands + +```bash +# Core dependencies +npm install commander @trpc/server drizzle-orm better-sqlite3 zod + +# Supporting libraries +npm install execa simple-git nanoid consola chalk cors + +# Development dependencies +npm install -D typescript tsup tsx vitest drizzle-kit @types/better-sqlite3 @types/node +``` + +--- + +## Project Structure (Hexagonal Architecture) + +``` +src/ +├── cli/ # CLI entry points (commander) +│ ├── index.ts # Main CLI binary +│ └── commands/ # Subcommand handlers +├── server/ # tRPC server +│ ├── router.ts # tRPC router definitions +│ ├── context.ts # Request context +│ └── adapters/ # HTTP adapter config +├── core/ # Domain logic (hexagonal core) +│ ├── domain/ # Entities, value objects +│ ├── ports/ # Interfaces (inbound/outbound) +│ └── services/ # Application services +├── infrastructure/ # Adapters +│ ├── db/ # Drizzle schema, repositories +│ ├── git/ # Git worktree operations +│ └── agents/ # Claude Code process spawning +└── shared/ # Shared types, utils + ├── schema.ts # Zod schemas (shared contracts) + └── types.ts # Inferred types +``` + +--- + +## Key Patterns + +### tRPC Standalone Server Setup + +```typescript +import { initTRPC } from '@trpc/server'; +import { createHTTPServer } from '@trpc/server/adapters/standalone'; +import cors from 'cors'; + +const t = initTRPC.context().create(); + +const appRouter = t.router({ + // procedures +}); + +const server = createHTTPServer({ + middleware: cors(), + router: appRouter, + createContext, +}); + +server.listen(3000); +``` + +### Drizzle + better-sqlite3 Setup + +```typescript +import { drizzle } from 'drizzle-orm/better-sqlite3'; +import Database from 'better-sqlite3'; + +const sqlite = new Database('codewalk.db'); +const db = drizzle(sqlite); + +// Use prepared statements for performance +const prepared = db.select().from(tasks).where(eq(tasks.id, sql.placeholder('id'))).prepare(); +const result = prepared.execute({ id: taskId }); +``` + +### Process Spawning with execa + +```typescript +import { execa } from 'execa'; + +// Spawn Claude Code agent +const subprocess = execa('claude', ['--worktree', worktreePath], { + cwd: worktreePath, + stdio: ['pipe', 'pipe', 'pipe'], +}); + +// Graceful cleanup +process.on('SIGTERM', () => subprocess.kill()); +``` + +### Commander CLI with Subcommands + +```typescript +import { Command } from 'commander'; + +const program = new Command(); + +program + .name('cw') + .description('Codewalk District - Multi-agent orchestration') + .version('0.1.0'); + +program + .command('serve') + .description('Start the orchestration server') + .option('-p, --port ', 'Port number', '3000') + .action(async (options) => { + // Start tRPC server + }); + +program + .command('task') + .description('Manage tasks') + .addCommand(new Command('create').action(createTask)) + .addCommand(new Command('list').action(listTasks)); +``` + +--- + +## Alternatives Considered + +### CLI Frameworks +| Alternative | Why Not Chosen | +|-------------|----------------| +| **oclif** | Overkill for single developer, steeper learning curve, heavier | +| **yargs** | Less TypeScript-native, callback-heavy syntax | +| **clipanion** | Less ecosystem support than commander | +| **Ink** | React overhead unnecessary for this use case | + +### Database +| Alternative | Why Not Chosen | +|-------------|----------------| +| **Prisma** | Heavy, slow cold starts, codegen complexity | +| **TypeORM** | Complex queries lose type safety | +| **node:sqlite** | Experimental, not production-ready | +| **libsql** | Unnecessary Turso features for local-only use | + +### HTTP Server +| Alternative | Why Not Chosen | +|-------------|----------------| +| **Fastify** | tRPC standalone is simpler for this use case | +| **Express** | Slower, less modern | +| **Hono** | Better for edge, unnecessary complexity for Node CLI | + +### Process Management +| Alternative | Why Not Chosen | +|-------------|----------------| +| **child_process** | Lower-level, manual cleanup required | +| **shelljs** | Less maintained, synchronous-focused | +| **cross-spawn** | execa wraps this with better DX | + +--- + +## What NOT to Use + +| Technology | Reason | +|------------|--------| +| **Express** | Slower than alternatives, less type-safe | +| **node-sqlite3** | Async API slower than better-sqlite3's sync API | +| **Prisma** | Too heavy for embedded CLI database | +| **GraphQL** | Overkill for internal API, tRPC is simpler | +| **Sequelize** | TypeScript support inferior to Drizzle | +| **npm link** | Use `tsx` for development instead | +| **ts-node** | tsx is faster, better ESM support | +| **Jest** | Vitest is faster, native ESM/TS support | + +--- + +## Version Compatibility Notes + +- **Node.js 22+** required for stable ESM, modern `child_process` features +- **TypeScript 5.7+** for `satisfies`, improved inference +- **tRPC v11** requires `"strict": true` in tsconfig +- **Drizzle 0.44+** for latest SQLite blob handling fixes +- **execa 9+** is ESM-only (no CommonJS) +- **chalk 5+** is ESM-only +- **better-sqlite3** requires native compilation (node-gyp) + +### tsconfig.json Recommendations + +```json +{ + "compilerOptions": { + "target": "ES2022", + "module": "NodeNext", + "moduleResolution": "NodeNext", + "strict": true, + "esModuleInterop": true, + "skipLibCheck": true, + "outDir": "dist", + "declaration": true, + "composite": true + } +} +``` + +### package.json Type Configuration + +```json +{ + "type": "module", + "exports": { + ".": { + "import": "./dist/index.js", + "types": "./dist/index.d.ts" + } + }, + "bin": { + "cw": "./dist/cli/index.js" + } +} +``` + +--- + +## Sources + +### CLI Frameworks +- [Commander.js GitHub](https://github.com/tj/commander.js) - Official repo, v14 docs +- [Commander.js npm](https://www.npmjs.com/package/commander) - Download stats, version history +- [Building TypeScript CLI with Commander](https://blog.logrocket.com/building-typescript-cli-node-js-commander/) - LogRocket guide +- [CLI Framework Comparison](https://npm-compare.com/commander,oclif,vorpal,yargs) - npm-compare analysis +- [Bloomberg Stricli Alternatives](https://bloomberg.github.io/stricli/docs/getting-started/alternatives) - Framework comparison + +### tRPC +- [tRPC Standalone Adapter](https://trpc.io/docs/server/adapters/standalone) - Official docs (HIGH confidence) +- [tRPC v11 Announcement](https://trpc.io/blog/announcing-trpc-v11) - Official blog (HIGH confidence) +- [tRPC Migration Guide](https://trpc.io/docs/migrate-from-v10-to-v11) - v10 to v11 migration +- [tRPC npm](https://www.npmjs.com/package/@trpc/server) - Package info +- [tRPC + Zod Best Practices](https://betterstack.com/community/guides/scaling-nodejs/trpc-explained/) - Better Stack guide + +### SQLite / Drizzle +- [better-sqlite3 GitHub](https://github.com/WiseLibs/better-sqlite3) - Official repo (HIGH confidence) +- [Drizzle ORM SQLite](https://orm.drizzle.team/docs/quick-sqlite/better-sqlite3) - Official docs (HIGH confidence) +- [Drizzle Releases](https://github.com/drizzle-team/drizzle-orm/releases) - Version history +- [Drizzle Getting Started](https://betterstack.com/community/guides/scaling-nodejs/drizzle-orm/) - Better Stack guide +- [TypeScript ORMs 2025](https://www.bytebase.com/blog/top-typescript-orm/) - Bytebase comparison + +### Process Spawning +- [execa GitHub](https://github.com/sindresorhus/execa) - Official repo (HIGH confidence) +- [execa npm v9](https://www.npmjs.com/package/execa/v/9.0.0) - Package info +- [execa Guide](https://betterstack.com/community/guides/scaling-nodejs/execa-cli/) - Better Stack guide +- [Node.js child_process](https://nodejs.org/api/child_process.html) - Official Node docs (HIGH confidence) + +### Git Libraries +- [simple-git npm](https://www.npmjs.com/package/simple-git) - Package info +- [isomorphic-git](https://isomorphic-git.org/) - Alternative for browser use +- [Git library comparison](https://npm-compare.com/isomorphic-git,nodegit,simple-git) - npm-compare + +### Build Tools +- [tsup GitHub](https://github.com/egoist/tsup) - Official repo (HIGH confidence) +- [tsup Guide](https://generalistprogrammer.com/tutorials/tsup-npm-package-guide) - Usage guide +- [Vitest](https://vitest.dev/) - Official docs (HIGH confidence) +- [TypeScript ESM/CJS 2025](https://lirantal.com/blog/typescript-in-2025-with-esm-and-cjs-npm-publishing) - Liran Tal analysis + +### Validation +- [Zod](https://zod.dev/) - Official docs (HIGH confidence) +- [tRPC Validators](https://trpc.io/docs/server/validators) - Input validation docs + +--- + +## Confidence Assessment + +| Area | Confidence | Reasoning | +|------|------------|-----------| +| CLI Framework | HIGH | Commander is battle-tested (13M weekly downloads), clear winner for simplicity | +| tRPC | HIGH | v11 officially released, standalone adapter well-documented | +| SQLite Stack | HIGH | better-sqlite3 + Drizzle is the recommended combination in official docs | +| Process Spawning | HIGH | execa is the de facto standard (150M+ weekly downloads) | +| Build Tooling | HIGH | tsup/tsx/vitest is the modern consensus for TypeScript libraries | +| Git Operations | MEDIUM | simple-git works but worktree support may need CLI fallback | + +--- + +*Last updated: 2026-01-30* diff --git a/.planning/research/SUMMARY.md b/.planning/research/SUMMARY.md new file mode 100644 index 0000000..7f8f187 --- /dev/null +++ b/.planning/research/SUMMARY.md @@ -0,0 +1,158 @@ +# Project Research Summary + +**Project:** Codewalk District +**Domain:** Multi-agent orchestration / Developer tooling +**Researched:** 2026-01-30 +**Confidence:** HIGH + +## Executive Summary + +Codewalk District enters a space where the basic problem (running multiple Claude Code agents in parallel) is already solved by tools like Claude Squad, par, and Claude Flow. The differentiation opportunity isn't in parallel execution—that's table stakes now. The gap is in **coordination quality**: preventing conflicts before they happen, making review manageable, and keeping the developer in control without drowning in context switches. + +The recommended approach is a TypeScript CLI with embedded tRPC server, SQLite persistence, and hexagonal architecture. Skip Redis/BullMQ—SQLite with WAL mode handles 10-15k tasks/second locally. Use Commander.js (not oclif), Drizzle ORM with better-sqlite3, and execa for process spawning. The modular monolith structure with clear port/adapter boundaries sets up future evolution without rewrites. + +The biggest risks are process management failures (zombie/orphan agents), SQLite WAL corruption during backups, and race conditions with concurrent agents. These are addressed in the first phases with explicit process tree tracking, proper WAL checkpoint strategy, and atomic transaction design using `BEGIN IMMEDIATE`. + +## Key Findings + +### Recommended Stack + +Commander.js + tRPC + Drizzle/better-sqlite3 + execa. All dependencies have HIGH confidence based on official docs and ecosystem maturity. + +**Core technologies:** +- **Commander.js 14.x**: CLI framework — lighter than oclif, TypeScript types built-in +- **tRPC v11**: API layer — type-safe, standalone server adapter, SSE subscriptions +- **Drizzle ORM 0.44+ / better-sqlite3 11.x**: Database — faster than raw better-sqlite3 with prepared statements +- **Zod 3.24+**: Validation — shared contracts between CLI and server +- **execa 9.6+**: Process spawning — auto-cleanup, cross-platform, graceful shutdown + +### Expected Features + +**Must have (table stakes):** +- Git worktree isolation per agent +- tmux session management +- Background/yolo execution mode +- Start/stop/list agents CLI +- Session persistence across restarts + +**Should have (differentiators):** +- File-level coordination (which agent touches which files) +- Cost/token tracking per agent +- Conflict prediction before merge + +**Defer (v2+):** +- AI task decomposition +- Cross-agent communication +- Web dashboard + +### Architecture Approach + +Hexagonal architecture with modular monolith. Ports are interfaces in domain layer, adapters are implementations in infrastructure. Event bus (Node.js EventEmitter) decouples modules. SQLite task queue with atomic claim via transactions. + +**Major components:** +1. **Task Scheduler** — Job queue with priority, retry, delay; SQLite-backed +2. **Agent Pool** — Piscina-style pool with bounded concurrency (min: 1, max: CPU cores) +3. **Process Supervisor** — Child process lifecycle with automatic restart +4. **MCP Transport** — STDIO-based JSON-RPC 2.0 for agent communication +5. **Event Bus** — In-memory pub/sub for module decoupling + +### Critical Pitfalls + +1. **Zombie/orphan processes**: `child.kill()` doesn't clean up process trees. Use `terminate` package to kill all children. Track all PIDs. +2. **SQLite WAL corruption**: Never `fs.copyFile()` an active database. Use SQLite backup API or checkpoint before copy. +3. **SQLITE_BUSY despite busy_timeout**: Use `BEGIN IMMEDIATE` for any transaction that will write. busy_timeout doesn't help mid-transaction. +4. **Graceful shutdown failures**: Node.js doesn't handle shutdown by default. Listen for SIGTERM, use `process.exitCode` not `process.exit()`. +5. **Git worktree orphaning**: Always use `git worktree remove`, never `rm -rf`. Run `git worktree prune` periodically. + +## Implications for Roadmap + +Based on research, suggested phase structure: + +### Phase 1: Core Architecture +**Rationale:** Foundation prevents cascading failures. Process lifecycle, signal handling, and cross-platform paths must be right from day one. +**Delivers:** CLI binary, server mode, graceful shutdown, process tree tracking +**Addresses:** Zombie/orphan processes, graceful shutdown failures, cross-platform paths +**Avoids:** Process cleanup technical debt + +### Phase 2: Data Layer +**Rationale:** SQLite patterns affect everything downstream. Get WAL, transactions, and checkpointing right before building features. +**Delivers:** SQLite database, task queue, state persistence +**Uses:** Drizzle ORM, better-sqlite3, WAL mode +**Addresses:** SQLite corruption, SQLITE_BUSY errors, checkpoint starvation + +### Phase 3: Git Integration +**Rationale:** Worktree management is core to the value proposition. File isolation is table stakes. +**Delivers:** Worktree creation/cleanup, branch isolation +**Addresses:** Worktree orphaning +**Avoids:** Manual cleanup chaos + +### Phase 4: Agent Orchestration +**Rationale:** Now that processes, data, and git work, add the agent pool and task dispatch. +**Delivers:** Agent spawn/stop, task dispatch, agent pool +**Uses:** execa, process supervisor, MCP transport +**Addresses:** Multi-agent race conditions, state ownership + +### Phase 5: File System UI +**Rationale:** Bidirectional sync after core features are stable. +**Delivers:** File watcher, SQLite ↔ filesystem sync +**Addresses:** File watcher race conditions + +### Phase 6: CLI Polish +**Rationale:** UX improvements after features work. +**Delivers:** Rich terminal UI, token tracking, error messages +**Addresses:** UX pitfalls + +### Phase Ordering Rationale + +- **Core → Data → Git → Agent**: Each phase depends on the previous. Can't spawn agents without process management. Can't persist agent state without database. Can't isolate agents without worktrees. +- **File System UI late**: It's an interface, not core logic. Build when you know what state needs syncing. +- **CLI Polish last**: Optimize what works. Don't polish what might change. + +### Research Flags + +Phases likely needing deeper research during planning: +- **Phase 4 (Agent Orchestration)**: MCP protocol integration needs validation with actual Claude Code behavior +- **Phase 5 (File System UI)**: Bidirectional sync patterns may need prototyping + +Phases with standard patterns (skip research-phase): +- **Phase 1 (Core Architecture)**: Well-documented Node.js patterns +- **Phase 2 (Data Layer)**: SQLite + Drizzle has extensive docs +- **Phase 3 (Git Integration)**: Git worktree is well-documented + +## Confidence Assessment + +| Area | Confidence | Notes | +|------|------------|-------| +| Stack | HIGH | Commander, tRPC, Drizzle all have official docs, large user base | +| Features | HIGH | Competitor analysis clear; Claude Squad establishes baseline | +| Architecture | HIGH | Hexagonal + SQLite task queue is well-documented pattern | +| Pitfalls | HIGH | Based on post-mortems, official SQLite docs, Node.js issues | + +**Overall confidence:** HIGH + +### Gaps to Address + +- **MCP protocol with Claude Code**: Need to validate STDIO transport works as expected. May need to spawn agents differently. +- **simple-git worktree support**: Library covers basics but may need CLI fallback for some operations. +- **Token tracking**: Need to find how Claude Code exposes usage metrics (if at all). + +## Sources + +### Primary (HIGH confidence) +- [Commander.js GitHub](https://github.com/tj/commander.js) — CLI framework +- [tRPC Standalone Adapter](https://trpc.io/docs/server/adapters/standalone) — Server setup +- [Drizzle ORM SQLite](https://orm.drizzle.team/docs/quick-sqlite/better-sqlite3) — Database +- [SQLite WAL Documentation](https://sqlite.org/wal.html) — Persistence +- [Git Worktree Documentation](https://git-scm.com/docs/git-worktree) — Worktrees + +### Secondary (MEDIUM confidence) +- [Claude Squad GitHub](https://github.com/smtg-ai/claude-squad) — Competitor analysis +- [Agent-MCP GitHub](https://github.com/rinadelph/Agent-MCP) — File-level coordination patterns +- [Piscina GitHub](https://github.com/piscinajs/piscina) — Worker pool patterns + +### Tertiary (LOW confidence) +- Token tracking approach — no verified source for Claude Code usage API + +--- +*Research completed: 2026-01-30* +*Ready for roadmap: yes*