Update all user-facing strings (HTML title, manifest, header logo, browser title updater), code comments, and documentation references. Folder name retained as-is.
21 KiB
Codewalkers: Critical Pitfalls Reference
- Domain: Multi-agent orchestration / Developer tooling
- Researched: 2026-01-30
- Confidence: HIGH
This document catalogs preventable mistakes for CLI tools that manage background processes, SQLite databases, git worktrees, and file system watchers. Focused on what actually breaks in production, not theoretical edge cases.
Critical Pitfalls
1. Zombie and Orphan Processes
What goes wrong: Child processes (Claude Code agents) continue running after the parent exits, consuming resources indefinitely. child.kill() does not release the process from memory - it only stops the process. Orphaned processes accumulate with every request.
Why it happens: Node.js doesn't call wait/waitpid automatically. When killing only the parent process, children become orphaned with PPID of 1. The child.kill() method stops the process but the kernel still holds the process entry until the parent acknowledges it.
How to avoid:
- Use the
terminatenpm package to kill process trees (parent + all children) - Listen for parent exit in child processes:
process.on('exit', () => process.exit(0)) - For background daemons: use
detached: truewithchild.unref()intentionally - Track all spawned PIDs and clean them up on shutdown
Warning signs:
- Memory usage grows over time
ps aux | grep nodeshows unexpected processes- System becomes sluggish after extended use
Phase to address: Core Architecture (before spawning any agents)
Sources:
- 5 Tips for Cleaning Orphaned Node.js Processes
- Node.js Issue #14445: Zombie Processes
- Killing Process Families with Node
2. SQLite WAL File Separation
What goes wrong: Database becomes corrupted or loses committed transactions when the -wal and -shm files are separated from the main .db file. This happens during backups, file moves, or container restarts.
Why it happens: SQLite WAL mode uses three files: .db, .db-wal (write-ahead log), and .db-shm (shared memory). The WAL file contains transactions not yet checkpointed to the main database. Copying only the .db file loses uncommitted data or corrupts the database.
How to avoid:
- Never use
fs.copyFile()on an active SQLite database - Use SQLite's backup API (
sqlite3_backup_*) orsqlite3_rsync(v3.47.0+) - If you must copy: start an
IMMEDIATEtransaction first to block writers - Always copy all three files atomically
- Run
PRAGMA wal_checkpoint(TRUNCATE)before backups
Warning signs:
- "database disk image is malformed" errors
- Missing recent data after restart
-walfile grows unboundedly
Phase to address: Data Layer (first database operation)
Sources:
- SQLite Corruption with fs.copyFile() in WAL Mode
- How To Corrupt An SQLite Database File
- High Performance SQLite: How to Corrupt SQLite
3. SQLITE_BUSY Despite busy_timeout
What goes wrong: You set busy_timeout but still get "database is locked" errors. Transactions fail unexpectedly under concurrent load.
Why it happens: busy_timeout doesn't help when a DEFERRED transaction tries to upgrade from read to write mid-transaction while another connection holds the lock. SQLite returns SQLITE_BUSY immediately without invoking the busy handler because retrying wouldn't help (you're mid-transaction).
How to avoid:
- Use
BEGIN IMMEDIATEfor any transaction that will write - Never upgrade from read-only to read-write mid-transaction
- Implement retry logic with exponential backoff at the application level
- Consider using a write queue/serialization for all writes
Warning signs:
- Sporadic "database is locked" errors under load
- Errors only appear with concurrent operations
- busy_timeout appears ineffective
Phase to address: Data Layer (transaction design)
Sources:
- SQLITE_BUSY Errors Despite Setting Timeout
- Improving SQLite Concurrency
- SQLite Forum: SQLITE_BUSY and Deadlock
4. Checkpoint Starvation in SQLite
What goes wrong: The WAL file grows indefinitely, disk space runs out, and query performance degrades progressively.
Why it happens: Checkpointing requires no active readers. If you have constant read traffic (e.g., polling for agent status), the WAL file cannot be recycled. Long-running read transactions prevent checkpointing.
How to avoid:
- Call
db.checkpoint()periodically (better-sqlite3) - Use
PRAGMA wal_autocheckpoint=Nwith appropriate N - Avoid long-running read transactions
- Monitor WAL file size and alert when it exceeds threshold
- Consider read connection pooling with short-lived connections
Warning signs:
-walfile growing continuously- Slowing read performance over time
- Disk space warnings
Phase to address: Data Layer (connection management)
Sources:
5. Graceful Shutdown Failures
What goes wrong: Agents are killed mid-operation, database writes are lost, file handles leak, and users lose work. Docker/Kubernetes sends SIGKILL after timeout.
Why it happens: Node.js doesn't handle shutdown gracefully by default. server.close() doesn't close keep-alive connections. process.exit() abandons pending async operations. Docker sends SIGKILL after 10 seconds if SIGTERM isn't handled.
How to avoid:
- Listen for SIGTERM and SIGINT:
process.on('SIGTERM', shutdown) - Set
process.exitCodeinstead of callingprocess.exit()directly - Track in-flight operations and wait for completion
- Implement timeout: graceful shutdown attempt, then force after N seconds
- For child processes: propagate signals to all children before exiting
Warning signs:
- Data loss after restarts
- "Port already in use" after crash
- Orphaned temporary files
- Corrupted state files
Phase to address: Core Architecture (process lifecycle)
Sources:
- Graceful Shutdown with Node.js and Kubernetes
- Implementing NodeJS HTTP Graceful Shutdown
- Graceful Shutdown in NodeJS
6. Git Worktree Orphaning
What goes wrong: Worktrees become "orphaned" - git thinks they exist but they don't, or they exist but git doesn't know about them. Operations fail silently or with cryptic errors.
Why it happens: Worktrees moved/deleted without using git worktree move/remove. Parent repository moved, breaking the worktree links. Worktree on unmounted network/portable storage gets auto-pruned.
How to avoid:
- Always use
git worktree removeto delete worktrees - Run
git worktree pruneperiodically to clean stale entries - Use
git worktree lockfor worktrees on removable storage - After moving the main repo, run
git worktree repair - Never
rm -rfa worktree directory directly
Warning signs:
- "fatal: not a git repository" in worktree
git worktree listshows paths that don't exist- Operations succeed in main repo but fail in worktree
Phase to address: Git Integration (worktree lifecycle)
Sources:
7. File Watcher Race Conditions
What goes wrong: Files created between directory scan and watcher setup are never detected. Events fire before files are fully written. Multiple watcher instances conflict.
Why it happens: Chokidar scans directory contents, then sets up fs.watch - any file created between these operations is missed. Large files emit 'add' before write completes. Multiple chokidar instances (e.g., your app + Vite) interfere with each other.
How to avoid:
- Use
awaitWriteFinish: truefor large files (polls until size stabilizes) - Call
.close()withawait- it's async and returns a Promise - Avoid multiple watcher instances on same directory
- Increase
fs.inotify.max_user_watcheson Linux - Add delay/debounce for rapid file changes
Warning signs:
- Newly created files sometimes not detected
- Events fire twice for same file
- "ENOSPC: System limit for file watchers reached"
- HMR/hot reload stops working
Phase to address: File System UI (watcher setup)
Sources:
8. process.exit() Kills Async Operations
What goes wrong: Log messages never reach their destination. Database writes are lost. Cleanup callbacks don't complete.
Why it happens: process.exit() forces immediate termination even with pending I/O. Writes to stdout are sometimes async in Node.js. Exit event listeners can only run synchronous code.
How to avoid:
- Set
process.exitCode = Ninstead of callingprocess.exit(N) - Let the event loop drain naturally
- Use async-cleanup or node-cleanup packages for async cleanup
- Keep cleanup fast (< 2 seconds) to avoid SIGKILL from OS
Warning signs:
- Log messages truncated or missing
- Final state not persisted
- "Cleanup completed" message never appears
Phase to address: Core Architecture (exit handling)
Sources:
9. Cross-Platform Path Handling
What goes wrong: Paths work on macOS but break on Windows. Case-sensitivity issues cause deployment failures. Scripts fail with "command not found."
Why it happens: Windows uses backslash, Unix uses forward slash. Windows is case-insensitive, Linux is case-sensitive. spawn without shell: true can't run .cmd/.bat files on Windows. Environment variable syntax differs (SET vs export).
How to avoid:
- Always use
path.join()andpath.normalize()- never string concatenation - Use
path.sepfor platform-specific separators - Test with actual case used in
require()statements - Use
cross-envfor environment variables in npm scripts - Use
shell: trueor explicitly handle.cmdfiles on Windows - Use
rimrafinstead ofrm -rf
Warning signs:
- "File not found" only on certain platforms
- Import works locally but fails in CI
- npm scripts fail on Windows
Phase to address: Core Architecture (all file operations)
Sources:
10. Multi-Agent State Race Conditions
What goes wrong: Two agents update the same resource simultaneously. Plan ownership is ambiguous. State gets duplicated or lost between agents. Deadlocks occur when agents wait on each other.
Why it happens: Race conditions increase quadratically with agent count: N agents = N(N-1)/2 potential concurrent interactions. Tool contention when one agent reads while another writes. No clear source of truth for shared state.
How to avoid:
- Define clear resource ownership - one agent owns each resource
- Use optimistic locking with version numbers
- Implement file/resource locking for shared assets
- Design for eventual consistency where possible
- Use event-driven patterns with message queues
- Add comprehensive workflow checkpointing for rollback
Warning signs:
- Inconsistent state after concurrent operations
- Operations that "should have worked" show wrong results
- Agents appear stuck waiting
Phase to address: Agent Orchestration (coordination protocol)
Sources:
- AI Agent Orchestration for Production Systems
- Multi-Agent System Reliability
- Multi-Agent Coordination Strategies
Technical Debt Patterns
| Pattern | Symptom | Fix Cost (Early) | Fix Cost (Late) |
|---|---|---|---|
| No process tree tracking | Orphan accumulation | 2h | 2d (debugging + cleanup) |
| String path concatenation | Cross-platform breaks | 1h | 8h (find all instances) |
| Missing graceful shutdown | Data loss, port conflicts | 4h | 2d (state recovery) |
| Synchronous DB operations | UI freezes, timeouts | 4h | 1w (async refactor) |
| No WAL checkpoint strategy | Disk full, slow queries | 2h | 1d (emergency cleanup) |
| Implicit transactions | SQLITE_BUSY under load | 2h | 3d (transaction audit) |
| Direct worktree deletion | Orphan worktrees | 1h | 4h (manual repair) |
| Single watcher instance assumed | Silent event drops | 2h | 1d (debugging) |
Integration Gotchas
| Integration | Gotcha | Mitigation |
|---|---|---|
| SQLite + Network FS | WAL doesn't work over NFS/SMB - silent corruption | Only use local filesystems |
| SQLite + Docker volumes | Shared memory files can't be shared across containers | One container per database |
| Git worktree + Docker | Worktree paths break when container paths differ | Use consistent mount paths |
| chokidar + Vite | Multiple watchers conflict, HMR stops | Single watcher, or namespace paths |
| Node.js + Docker | PID 1 doesn't forward signals | Use --init or tini |
| npm scripts + signals | npm doesn't forward SIGTERM to children properly | Run node directly, not via npm |
| spawn + Windows | .cmd files need shell: true |
Check platform, use shell on Windows |
| better-sqlite3 + threads | GVL blocking during busy_timeout | Use worker threads for writes |
Performance Traps
| Trap | Impact | Solution |
|---|---|---|
| Polling SQLite for agent status | CPU burn, battery drain | Use file watchers or IPC |
| Not batching SQLite writes | 1000x slower inserts | Wrap in transaction |
| Watching entire repo | inotify limit exhaustion | Watch specific directories |
| Synchronous file operations | Event loop blocked | Use fs.promises, worker threads |
| Spawning shell for every command | 10-100ms overhead per spawn | Use shell: false where possible |
| No connection reuse | Connection overhead per query | Reuse database connection |
| Logging to file synchronously | I/O blocks main thread | Async logging, buffer writes |
Security Mistakes
| Mistake | Risk | Mitigation |
|---|---|---|
| Storing credentials in SQLite | Database theft exposes secrets | Use OS keychain, env vars |
| Executing user input in shell | Command injection | Never shell: true with user input |
| World-readable socket files | Unauthorized access | Set restrictive permissions |
| No validation on IPC messages | Malicious command execution | Schema validation, allowlists |
| Logging sensitive data | Credential exposure in logs | Redact tokens, keys, passwords |
| Worktree in /tmp | Other users can access | Use user-specific directories |
UX Pitfalls
| Pitfall | User Experience | Fix |
|---|---|---|
| Silent failures | User thinks it worked | Always report errors clearly |
| No progress indication | User thinks it's frozen | Show spinners, progress bars |
| Cryptic error messages | User can't fix the problem | Include actionable guidance |
| Requiring too many flags | High learning curve | Interactive mode for discovery |
| Inconsistent command names | User guesses wrong | Follow conventions (git-style) |
| No examples in help | User doesn't know how to start | Show common use cases first |
| Breaking changes in config | User's setup stops working | Version configs, migrate automatically |
| Long startup time | User abandons tool | Lazy loading, fast path for common cases |
"Looks Done But Isn't" Checklist
- Process cleanup: Verified all child processes terminate on parent exit
- Signal handling: SIGTERM, SIGINT handled; cleanup completes before SIGKILL timeout
- Database integrity: WAL files included in backups; checkpoint strategy defined
- Concurrent writes: Tested with multiple agents writing simultaneously
- Worktree cleanup: Verified
git worktree removeused, notrm -rf - File watcher coverage: Tested files created during watcher startup
- Cross-platform: Tested on Windows (paths, signals, shell commands)
- Error messages: All errors have actionable guidance
- Graceful degradation: System handles missing files, network failures
- Resource limits: Tested with max expected agents/worktrees/watchers
- Restart recovery: System recovers correct state after crash/kill
- Long-running stability: No memory leaks, file handle leaks over hours
Recovery Strategies
| Failure | Detection | Recovery |
|---|---|---|
| Orphan processes | ps shows unexpected node processes |
Kill process tree, track PIDs |
| Corrupted SQLite | "malformed" error on query | Restore from backup, rebuild |
| Orphan worktrees | git worktree list shows missing paths |
git worktree prune |
| Stuck file watcher | Events stop firing | Restart watcher, check limits |
| Checkpoint starvation | WAL file > 100MB | Force checkpoint, close long readers |
| Zombie agents | Agent shows running but unresponsive | Track heartbeats, timeout + kill |
| State inconsistency | Agents report different state | Single source of truth, reconcile |
| Port in use | "EADDRINUSE" on restart | Track PID file, kill stale process |
Pitfall-to-Phase Mapping
| Phase | Must Address |
|---|---|
| Core Architecture | Process lifecycle, signal handling, exit handling, cross-platform paths |
| Data Layer | SQLite WAL handling, transactions, checkpoint strategy, backups |
| Git Integration | Worktree lifecycle, repair/prune, lock handling |
| Agent Orchestration | Process tree tracking, state ownership, race condition prevention |
| File System UI | Watcher setup, race conditions, resource limits |
| CLI Interface | Error messages, interactive mode, help examples |
Key Sources
Process Management
- 5 Tips for Cleaning Orphaned Node.js Processes
- Killing Process Families with Node
- Die, Child Process, Die!
- Node.js Child Process Documentation
SQLite
- How To Corrupt An SQLite Database File
- SQLite Write-Ahead Logging
- SQLITE_BUSY Errors Despite Timeout
- SQLite Myths: Concurrent Writes
- Improving SQLite Concurrency
Git Worktrees
Graceful Shutdown
- Graceful Shutdown with Node.js and Kubernetes
- Implementing NodeJS HTTP Graceful Shutdown
- Graceful Shutdown in NodeJS
File Watching
Cross-Platform
Multi-Agent Systems
- AI Agent Orchestration for Production
- Multi-Agent System Reliability
- Multi-Agent Coordination Strategies