Codewalkers/.planning/research/PITFALLS.md

# Codewalkers: Critical Pitfalls Reference

- **Domain:** Multi-agent orchestration / Developer tooling
- **Researched:** 2026-01-30
- **Confidence:** HIGH

This document catalogs preventable mistakes for CLI tools that manage background processes, SQLite databases, git worktrees, and file system watchers. Focused on what actually breaks in production, not theoretical edge cases.

---

## Critical Pitfalls

### 1. Zombie and Orphan Processes

**What goes wrong:** Child processes (Claude Code agents) continue running after the parent exits, consuming resources indefinitely. `child.kill()` does not release the process from memory - it only stops the process. Orphaned processes accumulate with every request.

**Why it happens:** Node.js doesn't call `wait/waitpid` automatically. When killing only the parent process, children become orphaned with PPID of 1. The `child.kill()` method stops the process but the kernel still holds the process entry until the parent acknowledges it.

**How to avoid:**
- Use the `terminate` npm package to kill process trees (parent + all children)
- Listen for parent exit in child processes: `process.on('exit', () => process.exit(0))`
- For background daemons: use `detached: true` with `child.unref()` intentionally
- Track all spawned PIDs and clean them up on shutdown

**Warning signs:**
- Memory usage grows over time
- `ps aux | grep node` shows unexpected processes
- System becomes sluggish after extended use

**Phase to address:** Core Architecture (before spawning any agents)

**Sources:**
- [5 Tips for Cleaning Orphaned Node.js Processes](https://medium.com/@arunangshudas/5-tips-for-cleaning-orphaned-node-js-processes-196ceaa6d85e)
- [Node.js Issue #14445: Zombie Processes](https://github.com/nodejs/node/issues/14445)
- [Killing Process Families with Node](https://medium.com/@almenon214/killing-processes-with-node-772ffdd19aad)

---

### 2. SQLite WAL File Separation

**What goes wrong:** Database becomes corrupted or loses committed transactions when the `-wal` and `-shm` files are separated from the main `.db` file. This happens during backups, file moves, or container restarts.

**Why it happens:** SQLite WAL mode uses three files: `.db`, `.db-wal` (write-ahead log), and `.db-shm` (shared memory). The WAL file contains transactions not yet checkpointed to the main database. Copying only the `.db` file loses uncommitted data or corrupts the database.

**How to avoid:**
- Never use `fs.copyFile()` on an active SQLite database
- Use SQLite's backup API (`sqlite3_backup_*`) or `sqlite3_rsync` (v3.47.0+)
- If you must copy: start an `IMMEDIATE` transaction first to block writers
- Always copy all three files atomically
- Run `PRAGMA wal_checkpoint(TRUNCATE)` before backups

**Warning signs:**
- "database disk image is malformed" errors
- Missing recent data after restart
- `-wal` file grows unboundedly

**Phase to address:** Data Layer (first database operation)

**Sources:**
- [SQLite Corruption with fs.copyFile() in WAL Mode](https://scottspence.com/posts/sqlite-corruption-fs-copyfile-issue)
- [How To Corrupt An SQLite Database File](https://www.sqlite.org/howtocorrupt.html)
- [High Performance SQLite: How to Corrupt SQLite](https://highperformancesqlite.com/watch/corrupt)

---

### 3. SQLITE_BUSY Despite busy_timeout

**What goes wrong:** You set `busy_timeout` but still get "database is locked" errors. Transactions fail unexpectedly under concurrent load.

**Why it happens:** `busy_timeout` doesn't help when a `DEFERRED` transaction tries to upgrade from read to write mid-transaction while another connection holds the lock. SQLite returns `SQLITE_BUSY` immediately without invoking the busy handler because retrying wouldn't help (you're mid-transaction).

**How to avoid:**
- Use `BEGIN IMMEDIATE` for any transaction that will write
- Never upgrade from read-only to read-write mid-transaction
- Implement retry logic with exponential backoff at the application level
- Consider using a write queue/serialization for all writes

**Warning signs:**
- Sporadic "database is locked" errors under load
- Errors only appear with concurrent operations
- busy_timeout appears ineffective

**Phase to address:** Data Layer (transaction design)

**Sources:**
- [SQLITE_BUSY Errors Despite Setting Timeout](https://berthub.eu/articles/posts/a-brief-post-on-sqlite3-database-locked-despite-timeout/)
- [Improving SQLite Concurrency](https://fractaledmind.com/2023/12/11/sqlite-on-rails-improving-concurrency/)
- [SQLite Forum: SQLITE_BUSY and Deadlock](https://sqlite.org/forum/forumpost/4350638e78869137)

---

### 4. Checkpoint Starvation in SQLite

**What goes wrong:** The WAL file grows indefinitely, disk space runs out, and query performance degrades progressively.

**Why it happens:** Checkpointing requires no active readers. If you have constant read traffic (e.g., polling for agent status), the WAL file cannot be recycled. Long-running read transactions prevent checkpointing.

**How to avoid:**
- Call `db.checkpoint()` periodically (better-sqlite3)
- Use `PRAGMA wal_autocheckpoint=N` with appropriate N
- Avoid long-running read transactions
- Monitor WAL file size and alert when it exceeds threshold
- Consider read connection pooling with short-lived connections

**Warning signs:**
- `-wal` file growing continuously
- Slowing read performance over time
- Disk space warnings

**Phase to address:** Data Layer (connection management)

**Sources:**
- [Write-Ahead Logging](https://sqlite.org/wal.html)
- [Improving Concurrency with better-sqlite3](https://wchargin.com/better-sqlite3/performance.html)

---

### 5. Graceful Shutdown Failures

**What goes wrong:** Agents are killed mid-operation, database writes are lost, file handles leak, and users lose work. Docker/Kubernetes sends SIGKILL after timeout.

**Why it happens:** Node.js doesn't handle shutdown gracefully by default. `server.close()` doesn't close keep-alive connections. `process.exit()` abandons pending async operations. Docker sends SIGKILL after 10 seconds if SIGTERM isn't handled.

**How to avoid:**
- Listen for SIGTERM and SIGINT: `process.on('SIGTERM', shutdown)`
- Set `process.exitCode` instead of calling `process.exit()` directly
- Track in-flight operations and wait for completion
- Implement timeout: graceful shutdown attempt, then force after N seconds
- For child processes: propagate signals to all children before exiting

**Warning signs:**
- Data loss after restarts
- "Port already in use" after crash
- Orphaned temporary files
- Corrupted state files

**Phase to address:** Core Architecture (process lifecycle)

**Sources:**
- [Graceful Shutdown with Node.js and Kubernetes](https://blog.risingstack.com/graceful-shutdown-node-js-kubernetes/)
- [Implementing NodeJS HTTP Graceful Shutdown](https://blog.dashlane.com/implementing-nodejs-http-graceful-shutdown/)
- [Graceful Shutdown in NodeJS](https://hackernoon.com/graceful-shutdown-in-nodejs-2f8f59d1c357)

---

### 6. Git Worktree Orphaning

**What goes wrong:** Worktrees become "orphaned" - git thinks they exist but they don't, or they exist but git doesn't know about them. Operations fail silently or with cryptic errors.

**Why it happens:** Worktrees moved/deleted without using `git worktree move/remove`. Parent repository moved, breaking the worktree links. Worktree on unmounted network/portable storage gets auto-pruned.

**How to avoid:**
- Always use `git worktree remove` to delete worktrees
- Run `git worktree prune` periodically to clean stale entries
- Use `git worktree lock` for worktrees on removable storage
- After moving the main repo, run `git worktree repair`
- Never `rm -rf` a worktree directory directly

**Warning signs:**
- "fatal: not a git repository" in worktree
- `git worktree list` shows paths that don't exist
- Operations succeed in main repo but fail in worktree

**Phase to address:** Git Integration (worktree lifecycle)

**Sources:**
- [Git Worktree Documentation](https://git-scm.com/docs/git-worktree)
- [Worktree Cleanup Fails Issue](https://github.com/obra/superpowers/issues/238)

---

### 7. File Watcher Race Conditions

**What goes wrong:** Files created between directory scan and watcher setup are never detected. Events fire before files are fully written. Multiple watcher instances conflict.

**Why it happens:** Chokidar scans directory contents, then sets up fs.watch - any file created between these operations is missed. Large files emit 'add' before write completes. Multiple chokidar instances (e.g., your app + Vite) interfere with each other.

**How to avoid:**
- Use `awaitWriteFinish: true` for large files (polls until size stabilizes)
- Call `.close()` with `await` - it's async and returns a Promise
- Avoid multiple watcher instances on same directory
- Increase `fs.inotify.max_user_watches` on Linux
- Add delay/debounce for rapid file changes

**Warning signs:**
- Newly created files sometimes not detected
- Events fire twice for same file
- "ENOSPC: System limit for file watchers reached"
- HMR/hot reload stops working

**Phase to address:** File System UI (watcher setup)

**Sources:**
- [Chokidar GitHub](https://github.com/paulmillr/chokidar)
- [Race Condition When Watching Dirs](https://github.com/paulmillr/chokidar/issues/1112)
- [Vite Issue #12495](https://github.com/vitejs/vite/issues/12495)

---

### 8. process.exit() Kills Async Operations

**What goes wrong:** Log messages never reach their destination. Database writes are lost. Cleanup callbacks don't complete.

**Why it happens:** `process.exit()` forces immediate termination even with pending I/O. Writes to stdout are sometimes async in Node.js. Exit event listeners can only run synchronous code.

**How to avoid:**
- Set `process.exitCode = N` instead of calling `process.exit(N)`
- Let the event loop drain naturally
- Use async-cleanup or node-cleanup packages for async cleanup
- Keep cleanup fast (< 2 seconds) to avoid SIGKILL from OS

**Warning signs:**
- Log messages truncated or missing
- Final state not persisted
- "Cleanup completed" message never appears

**Phase to address:** Core Architecture (exit handling)

**Sources:**
- [Node.js process.exit() Documentation](https://nodejs.org/api/process.html)
- [Exiting Node.js Asynchronously](https://blog.soulserv.net/exiting-node-js-asynchronously/)
- [async-cleanup Package](https://github.com/trevorr/async-cleanup)

---

### 9. Cross-Platform Path Handling

**What goes wrong:** Paths work on macOS but break on Windows. Case-sensitivity issues cause deployment failures. Scripts fail with "command not found."

**Why it happens:** Windows uses backslash, Unix uses forward slash. Windows is case-insensitive, Linux is case-sensitive. `spawn` without `shell: true` can't run `.cmd`/`.bat` files on Windows. Environment variable syntax differs (`SET` vs `export`).

**How to avoid:**
- Always use `path.join()` and `path.normalize()` - never string concatenation
- Use `path.sep` for platform-specific separators
- Test with actual case used in `require()` statements
- Use `cross-env` for environment variables in npm scripts
- Use `shell: true` or explicitly handle `.cmd` files on Windows
- Use `rimraf` instead of `rm -rf`

**Warning signs:**
- "File not found" only on certain platforms
- Import works locally but fails in CI
- npm scripts fail on Windows

**Phase to address:** Core Architecture (all file operations)

**Sources:**
- [Writing Cross-Platform Node.js](https://shapeshed.com/writing-cross-platform-node/)
- [Cross-Platform Node Guide](https://github.com/ehmicky/cross-platform-node-guide)
- [Awesome Cross-Platform Node.js](https://github.com/bcoe/awesome-cross-platform-nodejs)

---

### 10. Multi-Agent State Race Conditions

**What goes wrong:** Two agents update the same resource simultaneously. Plan ownership is ambiguous. State gets duplicated or lost between agents. Deadlocks occur when agents wait on each other.

**Why it happens:** Race conditions increase quadratically with agent count: N agents = N(N-1)/2 potential concurrent interactions. Tool contention when one agent reads while another writes. No clear source of truth for shared state.

**How to avoid:**
- Define clear resource ownership - one agent owns each resource
- Use optimistic locking with version numbers
- Implement file/resource locking for shared assets
- Design for eventual consistency where possible
- Use event-driven patterns with message queues
- Add comprehensive workflow checkpointing for rollback

**Warning signs:**
- Inconsistent state after concurrent operations
- Operations that "should have worked" show wrong results
- Agents appear stuck waiting

**Phase to address:** Agent Orchestration (coordination protocol)

**Sources:**
- [AI Agent Orchestration for Production Systems](https://redis.io/blog/ai-agent-orchestration/)
- [Multi-Agent System Reliability](https://www.getmaxim.ai/articles/multi-agent-system-reliability-failure-patterns-root-causes-and-production-validation-strategies/)
- [Multi-Agent Coordination Strategies](https://galileo.ai/blog/multi-agent-coordination-strategies)

---

## Technical Debt Patterns

| Pattern | Symptom | Fix Cost (Early) | Fix Cost (Late) |
|---------|---------|------------------|-----------------|
| No process tree tracking | Orphan accumulation | 2h | 2d (debugging + cleanup) |
| String path concatenation | Cross-platform breaks | 1h | 8h (find all instances) |
| Missing graceful shutdown | Data loss, port conflicts | 4h | 2d (state recovery) |
| Synchronous DB operations | UI freezes, timeouts | 4h | 1w (async refactor) |
| No WAL checkpoint strategy | Disk full, slow queries | 2h | 1d (emergency cleanup) |
| Implicit transactions | SQLITE_BUSY under load | 2h | 3d (transaction audit) |
| Direct worktree deletion | Orphan worktrees | 1h | 4h (manual repair) |
| Single watcher instance assumed | Silent event drops | 2h | 1d (debugging) |

---

## Integration Gotchas

| Integration | Gotcha | Mitigation |
|-------------|--------|------------|
| SQLite + Network FS | WAL doesn't work over NFS/SMB - silent corruption | Only use local filesystems |
| SQLite + Docker volumes | Shared memory files can't be shared across containers | One container per database |
| Git worktree + Docker | Worktree paths break when container paths differ | Use consistent mount paths |
| chokidar + Vite | Multiple watchers conflict, HMR stops | Single watcher, or namespace paths |
| Node.js + Docker | PID 1 doesn't forward signals | Use `--init` or tini |
| npm scripts + signals | npm doesn't forward SIGTERM to children properly | Run node directly, not via npm |
| spawn + Windows | `.cmd` files need `shell: true` | Check platform, use shell on Windows |
| better-sqlite3 + threads | GVL blocking during busy_timeout | Use worker threads for writes |

---

## Performance Traps

| Trap | Impact | Solution |
|------|--------|----------|
| Polling SQLite for agent status | CPU burn, battery drain | Use file watchers or IPC |
| Not batching SQLite writes | 1000x slower inserts | Wrap in transaction |
| Watching entire repo | inotify limit exhaustion | Watch specific directories |
| Synchronous file operations | Event loop blocked | Use fs.promises, worker threads |
| Spawning shell for every command | 10-100ms overhead per spawn | Use `shell: false` where possible |
| No connection reuse | Connection overhead per query | Reuse database connection |
| Logging to file synchronously | I/O blocks main thread | Async logging, buffer writes |

---

## Security Mistakes

| Mistake | Risk | Mitigation |
|---------|------|------------|
| Storing credentials in SQLite | Database theft exposes secrets | Use OS keychain, env vars |
| Executing user input in shell | Command injection | Never `shell: true` with user input |
| World-readable socket files | Unauthorized access | Set restrictive permissions |
| No validation on IPC messages | Malicious command execution | Schema validation, allowlists |
| Logging sensitive data | Credential exposure in logs | Redact tokens, keys, passwords |
| Worktree in /tmp | Other users can access | Use user-specific directories |

---

## UX Pitfalls

| Pitfall | User Experience | Fix |
|---------|-----------------|-----|
| Silent failures | User thinks it worked | Always report errors clearly |
| No progress indication | User thinks it's frozen | Show spinners, progress bars |
| Cryptic error messages | User can't fix the problem | Include actionable guidance |
| Requiring too many flags | High learning curve | Interactive mode for discovery |
| Inconsistent command names | User guesses wrong | Follow conventions (git-style) |
| No examples in help | User doesn't know how to start | Show common use cases first |
| Breaking changes in config | User's setup stops working | Version configs, migrate automatically |
| Long startup time | User abandons tool | Lazy loading, fast path for common cases |

---

## "Looks Done But Isn't" Checklist

- [ ] **Process cleanup**: Verified all child processes terminate on parent exit
- [ ] **Signal handling**: SIGTERM, SIGINT handled; cleanup completes before SIGKILL timeout
- [ ] **Database integrity**: WAL files included in backups; checkpoint strategy defined
- [ ] **Concurrent writes**: Tested with multiple agents writing simultaneously
- [ ] **Worktree cleanup**: Verified `git worktree remove` used, not `rm -rf`
- [ ] **File watcher coverage**: Tested files created during watcher startup
- [ ] **Cross-platform**: Tested on Windows (paths, signals, shell commands)
- [ ] **Error messages**: All errors have actionable guidance
- [ ] **Graceful degradation**: System handles missing files, network failures
- [ ] **Resource limits**: Tested with max expected agents/worktrees/watchers
- [ ] **Restart recovery**: System recovers correct state after crash/kill
- [ ] **Long-running stability**: No memory leaks, file handle leaks over hours

---

## Recovery Strategies

| Failure | Detection | Recovery |
|---------|-----------|----------|
| Orphan processes | `ps` shows unexpected node processes | Kill process tree, track PIDs |
| Corrupted SQLite | "malformed" error on query | Restore from backup, rebuild |
| Orphan worktrees | `git worktree list` shows missing paths | `git worktree prune` |
| Stuck file watcher | Events stop firing | Restart watcher, check limits |
| Checkpoint starvation | WAL file > 100MB | Force checkpoint, close long readers |
| Zombie agents | Agent shows running but unresponsive | Track heartbeats, timeout + kill |
| State inconsistency | Agents report different state | Single source of truth, reconcile |
| Port in use | "EADDRINUSE" on restart | Track PID file, kill stale process |

---

## Pitfall-to-Phase Mapping

| Phase | Must Address |
|-------|--------------|
| **Core Architecture** | Process lifecycle, signal handling, exit handling, cross-platform paths |
| **Data Layer** | SQLite WAL handling, transactions, checkpoint strategy, backups |
| **Git Integration** | Worktree lifecycle, repair/prune, lock handling |
| **Agent Orchestration** | Process tree tracking, state ownership, race condition prevention |
| **File System UI** | Watcher setup, race conditions, resource limits |
| **CLI Interface** | Error messages, interactive mode, help examples |

---

## Key Sources

### Process Management
- [5 Tips for Cleaning Orphaned Node.js Processes](https://medium.com/@arunangshudas/5-tips-for-cleaning-orphaned-node-js-processes-196ceaa6d85e)
- [Killing Process Families with Node](https://medium.com/@almenon214/killing-processes-with-node-772ffdd19aad)
- [Die, Child Process, Die!](https://www.exratione.com/2013/05/die-child-process-die/)
- [Node.js Child Process Documentation](https://nodejs.org/api/child_process.html)

### SQLite
- [How To Corrupt An SQLite Database File](https://www.sqlite.org/howtocorrupt.html)
- [SQLite Write-Ahead Logging](https://sqlite.org/wal.html)
- [SQLITE_BUSY Errors Despite Timeout](https://berthub.eu/articles/posts/a-brief-post-on-sqlite3-database-locked-despite-timeout/)
- [SQLite Myths: Concurrent Writes](https://fractaledmind.com/2023/10/13/sqlite-myths-concurrent-writes-can-corrupt-the-database/)
- [Improving SQLite Concurrency](https://fractaledmind.com/2023/12/11/sqlite-on-rails-improving-concurrency/)

### Git Worktrees
- [Git Worktree Documentation](https://git-scm.com/docs/git-worktree)
- [Treehouse Worktree MCP Server](https://glama.ai/mcp/servers/@mark-hingston/treehouse-worktree)

### Graceful Shutdown
- [Graceful Shutdown with Node.js and Kubernetes](https://blog.risingstack.com/graceful-shutdown-node-js-kubernetes/)
- [Implementing NodeJS HTTP Graceful Shutdown](https://blog.dashlane.com/implementing-nodejs-http-graceful-shutdown/)
- [Graceful Shutdown in NodeJS](https://hackernoon.com/graceful-shutdown-in-nodejs-2f8f59d1c357)

### File Watching
- [Chokidar Repository](https://github.com/paulmillr/chokidar)
- [Chokidar Race Condition Issue](https://github.com/paulmillr/chokidar/issues/1112)

### Cross-Platform
- [Writing Cross-Platform Node.js](https://shapeshed.com/writing-cross-platform-node/)
- [Cross-Platform Node Guide](https://github.com/ehmicky/cross-platform-node-guide)

### Multi-Agent Systems
- [AI Agent Orchestration for Production](https://redis.io/blog/ai-agent-orchestration/)
- [Multi-Agent System Reliability](https://www.getmaxim.ai/articles/multi-agent-system-reliability-failure-patterns-root-causes-and-production-validation-strategies/)
- [Multi-Agent Coordination Strategies](https://galileo.ai/blog/multi-agent-coordination-strategies)

### CLI UX
- [UX Patterns for CLI Tools](https://lucasfcosta.com/2022/06/01/ux-patterns-cli-tools.html)