Files

Lukas May 0ff65b0b02 feat: Rename application from "Codewalk District" to "Codewalkers"

Update all user-facing strings (HTML title, manifest, header logo,
browser title updater), code comments, and documentation references.
Folder name retained as-is.

2026-03-05 12:05:08 +01:00

21 KiB

Raw Blame History

Codewalkers: Critical Pitfalls Reference

Domain: Multi-agent orchestration / Developer tooling
Researched: 2026-01-30
Confidence: HIGH

This document catalogs preventable mistakes for CLI tools that manage background processes, SQLite databases, git worktrees, and file system watchers. Focused on what actually breaks in production, not theoretical edge cases.

Critical Pitfalls

1. Zombie and Orphan Processes

What goes wrong: Child processes (Claude Code agents) continue running after the parent exits, consuming resources indefinitely. child.kill() does not release the process from memory - it only stops the process. Orphaned processes accumulate with every request.

Why it happens: Node.js doesn't call wait/waitpid automatically. When killing only the parent process, children become orphaned with PPID of 1. The child.kill() method stops the process but the kernel still holds the process entry until the parent acknowledges it.

How to avoid:

Use the terminate npm package to kill process trees (parent + all children)
Listen for parent exit in child processes: process.on('exit', () => process.exit(0))
For background daemons: use detached: true with child.unref() intentionally
Track all spawned PIDs and clean them up on shutdown

Warning signs:

Memory usage grows over time
ps aux | grep node shows unexpected processes
System becomes sluggish after extended use

Phase to address: Core Architecture (before spawning any agents)

Sources:

2. SQLite WAL File Separation

What goes wrong: Database becomes corrupted or loses committed transactions when the -wal and -shm files are separated from the main .db file. This happens during backups, file moves, or container restarts.

Why it happens: SQLite WAL mode uses three files: .db, .db-wal (write-ahead log), and .db-shm (shared memory). The WAL file contains transactions not yet checkpointed to the main database. Copying only the .db file loses uncommitted data or corrupts the database.

How to avoid:

Never use fs.copyFile() on an active SQLite database
Use SQLite's backup API (sqlite3_backup_*) or sqlite3_rsync (v3.47.0+)
If you must copy: start an IMMEDIATE transaction first to block writers
Always copy all three files atomically
Run PRAGMA wal_checkpoint(TRUNCATE) before backups

Warning signs:

"database disk image is malformed" errors
Missing recent data after restart
-wal file grows unboundedly

Phase to address: Data Layer (first database operation)

Sources:

3. SQLITE_BUSY Despite busy_timeout

What goes wrong: You set busy_timeout but still get "database is locked" errors. Transactions fail unexpectedly under concurrent load.

Why it happens: busy_timeout doesn't help when a DEFERRED transaction tries to upgrade from read to write mid-transaction while another connection holds the lock. SQLite returns SQLITE_BUSY immediately without invoking the busy handler because retrying wouldn't help (you're mid-transaction).

How to avoid:

Use BEGIN IMMEDIATE for any transaction that will write
Never upgrade from read-only to read-write mid-transaction
Implement retry logic with exponential backoff at the application level
Consider using a write queue/serialization for all writes

Warning signs:

Sporadic "database is locked" errors under load
Errors only appear with concurrent operations
busy_timeout appears ineffective

Phase to address: Data Layer (transaction design)

Sources:

4. Checkpoint Starvation in SQLite

What goes wrong: The WAL file grows indefinitely, disk space runs out, and query performance degrades progressively.

Why it happens: Checkpointing requires no active readers. If you have constant read traffic (e.g., polling for agent status), the WAL file cannot be recycled. Long-running read transactions prevent checkpointing.

How to avoid:

Call db.checkpoint() periodically (better-sqlite3)
Use PRAGMA wal_autocheckpoint=N with appropriate N
Avoid long-running read transactions
Monitor WAL file size and alert when it exceeds threshold
Consider read connection pooling with short-lived connections

Warning signs:

-wal file growing continuously
Slowing read performance over time
Disk space warnings

Phase to address: Data Layer (connection management)

Sources:

5. Graceful Shutdown Failures

What goes wrong: Agents are killed mid-operation, database writes are lost, file handles leak, and users lose work. Docker/Kubernetes sends SIGKILL after timeout.

Why it happens: Node.js doesn't handle shutdown gracefully by default. server.close() doesn't close keep-alive connections. process.exit() abandons pending async operations. Docker sends SIGKILL after 10 seconds if SIGTERM isn't handled.

How to avoid:

Listen for SIGTERM and SIGINT: process.on('SIGTERM', shutdown)
Set process.exitCode instead of calling process.exit() directly
Track in-flight operations and wait for completion
Implement timeout: graceful shutdown attempt, then force after N seconds
For child processes: propagate signals to all children before exiting

Warning signs:

Data loss after restarts
"Port already in use" after crash
Orphaned temporary files
Corrupted state files

Phase to address: Core Architecture (process lifecycle)

Sources:

6. Git Worktree Orphaning

What goes wrong: Worktrees become "orphaned" - git thinks they exist but they don't, or they exist but git doesn't know about them. Operations fail silently or with cryptic errors.

Why it happens: Worktrees moved/deleted without using git worktree move/remove. Parent repository moved, breaking the worktree links. Worktree on unmounted network/portable storage gets auto-pruned.

How to avoid:

Always use git worktree remove to delete worktrees
Run git worktree prune periodically to clean stale entries
Use git worktree lock for worktrees on removable storage
After moving the main repo, run git worktree repair
Never rm -rf a worktree directory directly

Warning signs:

"fatal: not a git repository" in worktree
git worktree list shows paths that don't exist
Operations succeed in main repo but fail in worktree

Phase to address: Git Integration (worktree lifecycle)

Sources:

7. File Watcher Race Conditions

What goes wrong: Files created between directory scan and watcher setup are never detected. Events fire before files are fully written. Multiple watcher instances conflict.

Why it happens: Chokidar scans directory contents, then sets up fs.watch - any file created between these operations is missed. Large files emit 'add' before write completes. Multiple chokidar instances (e.g., your app + Vite) interfere with each other.

How to avoid:

Use awaitWriteFinish: true for large files (polls until size stabilizes)
Call .close() with await - it's async and returns a Promise
Avoid multiple watcher instances on same directory
Increase fs.inotify.max_user_watches on Linux
Add delay/debounce for rapid file changes

Warning signs:

Newly created files sometimes not detected
Events fire twice for same file
"ENOSPC: System limit for file watchers reached"
HMR/hot reload stops working

Phase to address: File System UI (watcher setup)

Sources:

8. process.exit() Kills Async Operations

What goes wrong: Log messages never reach their destination. Database writes are lost. Cleanup callbacks don't complete.

Why it happens: process.exit() forces immediate termination even with pending I/O. Writes to stdout are sometimes async in Node.js. Exit event listeners can only run synchronous code.

How to avoid:

Set process.exitCode = N instead of calling process.exit(N)
Let the event loop drain naturally
Use async-cleanup or node-cleanup packages for async cleanup
Keep cleanup fast (< 2 seconds) to avoid SIGKILL from OS

Warning signs:

Log messages truncated or missing
Final state not persisted
"Cleanup completed" message never appears

Phase to address: Core Architecture (exit handling)

Sources:

9. Cross-Platform Path Handling

What goes wrong: Paths work on macOS but break on Windows. Case-sensitivity issues cause deployment failures. Scripts fail with "command not found."

Why it happens: Windows uses backslash, Unix uses forward slash. Windows is case-insensitive, Linux is case-sensitive. spawn without shell: true can't run .cmd/.bat files on Windows. Environment variable syntax differs (SET vs export).

How to avoid:

Always use path.join() and path.normalize() - never string concatenation
Use path.sep for platform-specific separators
Test with actual case used in require() statements
Use cross-env for environment variables in npm scripts
Use shell: true or explicitly handle .cmd files on Windows
Use rimraf instead of rm -rf

Warning signs:

"File not found" only on certain platforms
Import works locally but fails in CI
npm scripts fail on Windows

Phase to address: Core Architecture (all file operations)

Sources:

10. Multi-Agent State Race Conditions

What goes wrong: Two agents update the same resource simultaneously. Plan ownership is ambiguous. State gets duplicated or lost between agents. Deadlocks occur when agents wait on each other.

Why it happens: Race conditions increase quadratically with agent count: N agents = N(N-1)/2 potential concurrent interactions. Tool contention when one agent reads while another writes. No clear source of truth for shared state.

How to avoid:

Define clear resource ownership - one agent owns each resource
Use optimistic locking with version numbers
Implement file/resource locking for shared assets
Design for eventual consistency where possible
Use event-driven patterns with message queues
Add comprehensive workflow checkpointing for rollback

Warning signs:

Inconsistent state after concurrent operations
Operations that "should have worked" show wrong results
Agents appear stuck waiting

Phase to address: Agent Orchestration (coordination protocol)

Sources:

Technical Debt Patterns

Pattern	Symptom	Fix Cost (Early)	Fix Cost (Late)
No process tree tracking	Orphan accumulation	2h	2d (debugging + cleanup)
String path concatenation	Cross-platform breaks	1h	8h (find all instances)
Missing graceful shutdown	Data loss, port conflicts	4h	2d (state recovery)
Synchronous DB operations	UI freezes, timeouts	4h	1w (async refactor)
No WAL checkpoint strategy	Disk full, slow queries	2h	1d (emergency cleanup)
Implicit transactions	SQLITE_BUSY under load	2h	3d (transaction audit)
Direct worktree deletion	Orphan worktrees	1h	4h (manual repair)
Single watcher instance assumed	Silent event drops	2h	1d (debugging)

Integration Gotchas

Integration	Gotcha	Mitigation
SQLite + Network FS	WAL doesn't work over NFS/SMB - silent corruption	Only use local filesystems
SQLite + Docker volumes	Shared memory files can't be shared across containers	One container per database
Git worktree + Docker	Worktree paths break when container paths differ	Use consistent mount paths
chokidar + Vite	Multiple watchers conflict, HMR stops	Single watcher, or namespace paths
Node.js + Docker	PID 1 doesn't forward signals	Use `--init` or tini
npm scripts + signals	npm doesn't forward SIGTERM to children properly	Run node directly, not via npm
spawn + Windows	`.cmd` files need `shell: true`	Check platform, use shell on Windows
better-sqlite3 + threads	GVL blocking during busy_timeout	Use worker threads for writes

Performance Traps

Trap	Impact	Solution
Polling SQLite for agent status	CPU burn, battery drain	Use file watchers or IPC
Not batching SQLite writes	1000x slower inserts	Wrap in transaction
Watching entire repo	inotify limit exhaustion	Watch specific directories
Synchronous file operations	Event loop blocked	Use fs.promises, worker threads
Spawning shell for every command	10-100ms overhead per spawn	Use `shell: false` where possible
No connection reuse	Connection overhead per query	Reuse database connection
Logging to file synchronously	I/O blocks main thread	Async logging, buffer writes

Security Mistakes

Mistake	Risk	Mitigation
Storing credentials in SQLite	Database theft exposes secrets	Use OS keychain, env vars
Executing user input in shell	Command injection	Never `shell: true` with user input
World-readable socket files	Unauthorized access	Set restrictive permissions
No validation on IPC messages	Malicious command execution	Schema validation, allowlists
Logging sensitive data	Credential exposure in logs	Redact tokens, keys, passwords
Worktree in /tmp	Other users can access	Use user-specific directories

UX Pitfalls

Pitfall	User Experience	Fix
Silent failures	User thinks it worked	Always report errors clearly
No progress indication	User thinks it's frozen	Show spinners, progress bars
Cryptic error messages	User can't fix the problem	Include actionable guidance
Requiring too many flags	High learning curve	Interactive mode for discovery
Inconsistent command names	User guesses wrong	Follow conventions (git-style)
No examples in help	User doesn't know how to start	Show common use cases first
Breaking changes in config	User's setup stops working	Version configs, migrate automatically
Long startup time	User abandons tool	Lazy loading, fast path for common cases

"Looks Done But Isn't" Checklist

Process cleanup: Verified all child processes terminate on parent exit
Signal handling: SIGTERM, SIGINT handled; cleanup completes before SIGKILL timeout
Database integrity: WAL files included in backups; checkpoint strategy defined
Concurrent writes: Tested with multiple agents writing simultaneously
Worktree cleanup: Verified git worktree remove used, not rm -rf
File watcher coverage: Tested files created during watcher startup
Cross-platform: Tested on Windows (paths, signals, shell commands)
Error messages: All errors have actionable guidance
Graceful degradation: System handles missing files, network failures
Resource limits: Tested with max expected agents/worktrees/watchers
Restart recovery: System recovers correct state after crash/kill
Long-running stability: No memory leaks, file handle leaks over hours

Recovery Strategies

Failure	Detection	Recovery
Orphan processes	`ps` shows unexpected node processes	Kill process tree, track PIDs
Corrupted SQLite	"malformed" error on query	Restore from backup, rebuild
Orphan worktrees	`git worktree list` shows missing paths	`git worktree prune`
Stuck file watcher	Events stop firing	Restart watcher, check limits
Checkpoint starvation	WAL file > 100MB	Force checkpoint, close long readers
Zombie agents	Agent shows running but unresponsive	Track heartbeats, timeout + kill
State inconsistency	Agents report different state	Single source of truth, reconcile
Port in use	"EADDRINUSE" on restart	Track PID file, kill stale process

Pitfall-to-Phase Mapping

Phase	Must Address
Core Architecture	Process lifecycle, signal handling, exit handling, cross-platform paths
Data Layer	SQLite WAL handling, transactions, checkpoint strategy, backups
Git Integration	Worktree lifecycle, repair/prune, lock handling
Agent Orchestration	Process tree tracking, state ownership, race condition prevention
File System UI	Watcher setup, race conditions, resource limits
CLI Interface	Error messages, interactive mode, help examples

21 KiB

Raw Blame History

Codewalkers: Critical Pitfalls Reference

Critical Pitfalls

1. Zombie and Orphan Processes

2. SQLite WAL File Separation

3. SQLITE_BUSY Despite busy_timeout

4. Checkpoint Starvation in SQLite

5. Graceful Shutdown Failures

6. Git Worktree Orphaning

7. File Watcher Race Conditions

8. process.exit() Kills Async Operations

9. Cross-Platform Path Handling

10. Multi-Agent State Race Conditions

Technical Debt Patterns

Integration Gotchas

Performance Traps

Security Mistakes

UX Pitfalls

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall-to-Phase Mapping

Key Sources

Process Management

SQLite

Git Worktrees

Graceful Shutdown

File Watching

Cross-Platform

Multi-Agent Systems

CLI UX

21 KiB Raw Blame History

Codewalkers: Critical Pitfalls Reference

Critical Pitfalls

1. Zombie and Orphan Processes

2. SQLite WAL File Separation

3. SQLITE_BUSY Despite busy_timeout

4. Checkpoint Starvation in SQLite

5. Graceful Shutdown Failures

6. Git Worktree Orphaning

7. File Watcher Race Conditions

8. process.exit() Kills Async Operations

9. Cross-Platform Path Handling

10. Multi-Agent State Race Conditions

Technical Debt Patterns

Integration Gotchas

Performance Traps

Security Mistakes

UX Pitfalls

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall-to-Phase Mapping

Key Sources

Process Management

SQLite

Git Worktrees

Graceful Shutdown

File Watching

Cross-Platform

Multi-Agent Systems

CLI UX

21 KiB

Raw Blame History