Files
Codewalkers/.planning/research/PITFALLS.md
Lukas May 0ff65b0b02 feat: Rename application from "Codewalk District" to "Codewalkers"
Update all user-facing strings (HTML title, manifest, header logo,
browser title updater), code comments, and documentation references.
Folder name retained as-is.
2026-03-05 12:05:08 +01:00

21 KiB

Codewalkers: Critical Pitfalls Reference

  • Domain: Multi-agent orchestration / Developer tooling
  • Researched: 2026-01-30
  • Confidence: HIGH

This document catalogs preventable mistakes for CLI tools that manage background processes, SQLite databases, git worktrees, and file system watchers. Focused on what actually breaks in production, not theoretical edge cases.


Critical Pitfalls

1. Zombie and Orphan Processes

What goes wrong: Child processes (Claude Code agents) continue running after the parent exits, consuming resources indefinitely. child.kill() does not release the process from memory - it only stops the process. Orphaned processes accumulate with every request.

Why it happens: Node.js doesn't call wait/waitpid automatically. When killing only the parent process, children become orphaned with PPID of 1. The child.kill() method stops the process but the kernel still holds the process entry until the parent acknowledges it.

How to avoid:

  • Use the terminate npm package to kill process trees (parent + all children)
  • Listen for parent exit in child processes: process.on('exit', () => process.exit(0))
  • For background daemons: use detached: true with child.unref() intentionally
  • Track all spawned PIDs and clean them up on shutdown

Warning signs:

  • Memory usage grows over time
  • ps aux | grep node shows unexpected processes
  • System becomes sluggish after extended use

Phase to address: Core Architecture (before spawning any agents)

Sources:


2. SQLite WAL File Separation

What goes wrong: Database becomes corrupted or loses committed transactions when the -wal and -shm files are separated from the main .db file. This happens during backups, file moves, or container restarts.

Why it happens: SQLite WAL mode uses three files: .db, .db-wal (write-ahead log), and .db-shm (shared memory). The WAL file contains transactions not yet checkpointed to the main database. Copying only the .db file loses uncommitted data or corrupts the database.

How to avoid:

  • Never use fs.copyFile() on an active SQLite database
  • Use SQLite's backup API (sqlite3_backup_*) or sqlite3_rsync (v3.47.0+)
  • If you must copy: start an IMMEDIATE transaction first to block writers
  • Always copy all three files atomically
  • Run PRAGMA wal_checkpoint(TRUNCATE) before backups

Warning signs:

  • "database disk image is malformed" errors
  • Missing recent data after restart
  • -wal file grows unboundedly

Phase to address: Data Layer (first database operation)

Sources:


3. SQLITE_BUSY Despite busy_timeout

What goes wrong: You set busy_timeout but still get "database is locked" errors. Transactions fail unexpectedly under concurrent load.

Why it happens: busy_timeout doesn't help when a DEFERRED transaction tries to upgrade from read to write mid-transaction while another connection holds the lock. SQLite returns SQLITE_BUSY immediately without invoking the busy handler because retrying wouldn't help (you're mid-transaction).

How to avoid:

  • Use BEGIN IMMEDIATE for any transaction that will write
  • Never upgrade from read-only to read-write mid-transaction
  • Implement retry logic with exponential backoff at the application level
  • Consider using a write queue/serialization for all writes

Warning signs:

  • Sporadic "database is locked" errors under load
  • Errors only appear with concurrent operations
  • busy_timeout appears ineffective

Phase to address: Data Layer (transaction design)

Sources:


4. Checkpoint Starvation in SQLite

What goes wrong: The WAL file grows indefinitely, disk space runs out, and query performance degrades progressively.

Why it happens: Checkpointing requires no active readers. If you have constant read traffic (e.g., polling for agent status), the WAL file cannot be recycled. Long-running read transactions prevent checkpointing.

How to avoid:

  • Call db.checkpoint() periodically (better-sqlite3)
  • Use PRAGMA wal_autocheckpoint=N with appropriate N
  • Avoid long-running read transactions
  • Monitor WAL file size and alert when it exceeds threshold
  • Consider read connection pooling with short-lived connections

Warning signs:

  • -wal file growing continuously
  • Slowing read performance over time
  • Disk space warnings

Phase to address: Data Layer (connection management)

Sources:


5. Graceful Shutdown Failures

What goes wrong: Agents are killed mid-operation, database writes are lost, file handles leak, and users lose work. Docker/Kubernetes sends SIGKILL after timeout.

Why it happens: Node.js doesn't handle shutdown gracefully by default. server.close() doesn't close keep-alive connections. process.exit() abandons pending async operations. Docker sends SIGKILL after 10 seconds if SIGTERM isn't handled.

How to avoid:

  • Listen for SIGTERM and SIGINT: process.on('SIGTERM', shutdown)
  • Set process.exitCode instead of calling process.exit() directly
  • Track in-flight operations and wait for completion
  • Implement timeout: graceful shutdown attempt, then force after N seconds
  • For child processes: propagate signals to all children before exiting

Warning signs:

  • Data loss after restarts
  • "Port already in use" after crash
  • Orphaned temporary files
  • Corrupted state files

Phase to address: Core Architecture (process lifecycle)

Sources:


6. Git Worktree Orphaning

What goes wrong: Worktrees become "orphaned" - git thinks they exist but they don't, or they exist but git doesn't know about them. Operations fail silently or with cryptic errors.

Why it happens: Worktrees moved/deleted without using git worktree move/remove. Parent repository moved, breaking the worktree links. Worktree on unmounted network/portable storage gets auto-pruned.

How to avoid:

  • Always use git worktree remove to delete worktrees
  • Run git worktree prune periodically to clean stale entries
  • Use git worktree lock for worktrees on removable storage
  • After moving the main repo, run git worktree repair
  • Never rm -rf a worktree directory directly

Warning signs:

  • "fatal: not a git repository" in worktree
  • git worktree list shows paths that don't exist
  • Operations succeed in main repo but fail in worktree

Phase to address: Git Integration (worktree lifecycle)

Sources:


7. File Watcher Race Conditions

What goes wrong: Files created between directory scan and watcher setup are never detected. Events fire before files are fully written. Multiple watcher instances conflict.

Why it happens: Chokidar scans directory contents, then sets up fs.watch - any file created between these operations is missed. Large files emit 'add' before write completes. Multiple chokidar instances (e.g., your app + Vite) interfere with each other.

How to avoid:

  • Use awaitWriteFinish: true for large files (polls until size stabilizes)
  • Call .close() with await - it's async and returns a Promise
  • Avoid multiple watcher instances on same directory
  • Increase fs.inotify.max_user_watches on Linux
  • Add delay/debounce for rapid file changes

Warning signs:

  • Newly created files sometimes not detected
  • Events fire twice for same file
  • "ENOSPC: System limit for file watchers reached"
  • HMR/hot reload stops working

Phase to address: File System UI (watcher setup)

Sources:


8. process.exit() Kills Async Operations

What goes wrong: Log messages never reach their destination. Database writes are lost. Cleanup callbacks don't complete.

Why it happens: process.exit() forces immediate termination even with pending I/O. Writes to stdout are sometimes async in Node.js. Exit event listeners can only run synchronous code.

How to avoid:

  • Set process.exitCode = N instead of calling process.exit(N)
  • Let the event loop drain naturally
  • Use async-cleanup or node-cleanup packages for async cleanup
  • Keep cleanup fast (< 2 seconds) to avoid SIGKILL from OS

Warning signs:

  • Log messages truncated or missing
  • Final state not persisted
  • "Cleanup completed" message never appears

Phase to address: Core Architecture (exit handling)

Sources:


9. Cross-Platform Path Handling

What goes wrong: Paths work on macOS but break on Windows. Case-sensitivity issues cause deployment failures. Scripts fail with "command not found."

Why it happens: Windows uses backslash, Unix uses forward slash. Windows is case-insensitive, Linux is case-sensitive. spawn without shell: true can't run .cmd/.bat files on Windows. Environment variable syntax differs (SET vs export).

How to avoid:

  • Always use path.join() and path.normalize() - never string concatenation
  • Use path.sep for platform-specific separators
  • Test with actual case used in require() statements
  • Use cross-env for environment variables in npm scripts
  • Use shell: true or explicitly handle .cmd files on Windows
  • Use rimraf instead of rm -rf

Warning signs:

  • "File not found" only on certain platforms
  • Import works locally but fails in CI
  • npm scripts fail on Windows

Phase to address: Core Architecture (all file operations)

Sources:


10. Multi-Agent State Race Conditions

What goes wrong: Two agents update the same resource simultaneously. Plan ownership is ambiguous. State gets duplicated or lost between agents. Deadlocks occur when agents wait on each other.

Why it happens: Race conditions increase quadratically with agent count: N agents = N(N-1)/2 potential concurrent interactions. Tool contention when one agent reads while another writes. No clear source of truth for shared state.

How to avoid:

  • Define clear resource ownership - one agent owns each resource
  • Use optimistic locking with version numbers
  • Implement file/resource locking for shared assets
  • Design for eventual consistency where possible
  • Use event-driven patterns with message queues
  • Add comprehensive workflow checkpointing for rollback

Warning signs:

  • Inconsistent state after concurrent operations
  • Operations that "should have worked" show wrong results
  • Agents appear stuck waiting

Phase to address: Agent Orchestration (coordination protocol)

Sources:


Technical Debt Patterns

Pattern Symptom Fix Cost (Early) Fix Cost (Late)
No process tree tracking Orphan accumulation 2h 2d (debugging + cleanup)
String path concatenation Cross-platform breaks 1h 8h (find all instances)
Missing graceful shutdown Data loss, port conflicts 4h 2d (state recovery)
Synchronous DB operations UI freezes, timeouts 4h 1w (async refactor)
No WAL checkpoint strategy Disk full, slow queries 2h 1d (emergency cleanup)
Implicit transactions SQLITE_BUSY under load 2h 3d (transaction audit)
Direct worktree deletion Orphan worktrees 1h 4h (manual repair)
Single watcher instance assumed Silent event drops 2h 1d (debugging)

Integration Gotchas

Integration Gotcha Mitigation
SQLite + Network FS WAL doesn't work over NFS/SMB - silent corruption Only use local filesystems
SQLite + Docker volumes Shared memory files can't be shared across containers One container per database
Git worktree + Docker Worktree paths break when container paths differ Use consistent mount paths
chokidar + Vite Multiple watchers conflict, HMR stops Single watcher, or namespace paths
Node.js + Docker PID 1 doesn't forward signals Use --init or tini
npm scripts + signals npm doesn't forward SIGTERM to children properly Run node directly, not via npm
spawn + Windows .cmd files need shell: true Check platform, use shell on Windows
better-sqlite3 + threads GVL blocking during busy_timeout Use worker threads for writes

Performance Traps

Trap Impact Solution
Polling SQLite for agent status CPU burn, battery drain Use file watchers or IPC
Not batching SQLite writes 1000x slower inserts Wrap in transaction
Watching entire repo inotify limit exhaustion Watch specific directories
Synchronous file operations Event loop blocked Use fs.promises, worker threads
Spawning shell for every command 10-100ms overhead per spawn Use shell: false where possible
No connection reuse Connection overhead per query Reuse database connection
Logging to file synchronously I/O blocks main thread Async logging, buffer writes

Security Mistakes

Mistake Risk Mitigation
Storing credentials in SQLite Database theft exposes secrets Use OS keychain, env vars
Executing user input in shell Command injection Never shell: true with user input
World-readable socket files Unauthorized access Set restrictive permissions
No validation on IPC messages Malicious command execution Schema validation, allowlists
Logging sensitive data Credential exposure in logs Redact tokens, keys, passwords
Worktree in /tmp Other users can access Use user-specific directories

UX Pitfalls

Pitfall User Experience Fix
Silent failures User thinks it worked Always report errors clearly
No progress indication User thinks it's frozen Show spinners, progress bars
Cryptic error messages User can't fix the problem Include actionable guidance
Requiring too many flags High learning curve Interactive mode for discovery
Inconsistent command names User guesses wrong Follow conventions (git-style)
No examples in help User doesn't know how to start Show common use cases first
Breaking changes in config User's setup stops working Version configs, migrate automatically
Long startup time User abandons tool Lazy loading, fast path for common cases

"Looks Done But Isn't" Checklist

  • Process cleanup: Verified all child processes terminate on parent exit
  • Signal handling: SIGTERM, SIGINT handled; cleanup completes before SIGKILL timeout
  • Database integrity: WAL files included in backups; checkpoint strategy defined
  • Concurrent writes: Tested with multiple agents writing simultaneously
  • Worktree cleanup: Verified git worktree remove used, not rm -rf
  • File watcher coverage: Tested files created during watcher startup
  • Cross-platform: Tested on Windows (paths, signals, shell commands)
  • Error messages: All errors have actionable guidance
  • Graceful degradation: System handles missing files, network failures
  • Resource limits: Tested with max expected agents/worktrees/watchers
  • Restart recovery: System recovers correct state after crash/kill
  • Long-running stability: No memory leaks, file handle leaks over hours

Recovery Strategies

Failure Detection Recovery
Orphan processes ps shows unexpected node processes Kill process tree, track PIDs
Corrupted SQLite "malformed" error on query Restore from backup, rebuild
Orphan worktrees git worktree list shows missing paths git worktree prune
Stuck file watcher Events stop firing Restart watcher, check limits
Checkpoint starvation WAL file > 100MB Force checkpoint, close long readers
Zombie agents Agent shows running but unresponsive Track heartbeats, timeout + kill
State inconsistency Agents report different state Single source of truth, reconcile
Port in use "EADDRINUSE" on restart Track PID file, kill stale process

Pitfall-to-Phase Mapping

Phase Must Address
Core Architecture Process lifecycle, signal handling, exit handling, cross-platform paths
Data Layer SQLite WAL handling, transactions, checkpoint strategy, backups
Git Integration Worktree lifecycle, repair/prune, lock handling
Agent Orchestration Process tree tracking, state ownership, race condition prevention
File System UI Watcher setup, race conditions, resource limits
CLI Interface Error messages, interactive mode, help examples

Key Sources

Process Management

SQLite

Git Worktrees

Graceful Shutdown

File Watching

Cross-Platform

Multi-Agent Systems

CLI UX