Codewalkers/docs/archive/crash-marking-fix.md

# Crash Marking Race Condition Fix

## Problem

Agents were being incorrectly marked as "crashed" despite completing successfully with valid `signal.json` files. This happened because of a **race condition** in the output polling logic.

### Root Cause

In `src/agent/output-handler.ts`, the `handleCompletion()` method had two code paths for completion detection:

1. **Main path** (line 273): Checked for `signal.json` completion - ✅ **WORKED**
2. **Error path** (line 306): Marked agent as crashed when no new output detected - ❌ **PROBLEM**

The race condition occurred when:
1. Agent completes and writes `signal.json`
2. Polling happens before all output is flushed to disk
3. No new output detected in `output.jsonl`
4. Error path triggers `handleAgentError()` → marks agent as crashed
5. Completion detection never runs because agent already marked as crashed

### Specific Case: slim-wildebeest

- **Agent ID**: `t9itQywbC0aZBZyc_SL0V`
- **Status**: `crashed` (incorrect)
- **Exit Code**: `NULL` (agent marked crashed before process exit)
- **Signal File**: Valid `signal.json` with `status: "questions"` ✅
- **Problem**: Error path triggered before completion detection

## Solution

**Added completion detection in the error path** before marking agent as crashed.

### Code Change

In `src/agent/output-handler.ts` around line 305:

```typescript
// BEFORE (line 306)
log.warn({ agentId }, 'no result text from stream or file');
await this.handleAgentError(agentId, new Error('No output received'), provider, getAgentWorkdir);

// AFTER
// Before marking as crashed, check if the agent actually completed successfully
const agentWorkdir = getAgentWorkdir(agentId);
if (await this.checkSignalCompletion(agentWorkdir)) {
  const signalPath = join(agentWorkdir, '.cw/output/signal.json');
  const signalContent = await readFile(signalPath, 'utf-8');
  log.debug({ agentId, signalPath }, 'detected completion via signal.json in error path');
  this.filePositions.delete(agentId); // Clean up tracking
  await this.processSignalAndFiles(agentId, signalContent, agent.mode as AgentMode, getAgentWorkdir, active?.streamSessionId);
  return;
}

log.warn({ agentId }, 'no result text from stream or file');
await this.handleAgentError(agentId, new Error('No output received'), provider, getAgentWorkdir);
```

### Logic

The fix adds a **final completion check** right before marking an agent as crashed. If the agent has a valid `signal.json` with status `done`, `questions`, or `error`, it processes the completion instead of marking as crashed.

## Impact

- ✅ **Prevents false crash marking** for agents that completed successfully
- ✅ **No performance impact** - only runs when no new output detected (rare)
- ✅ **Backward compatible** - still marks truly crashed agents as crashed
- ✅ **Comprehensive** - uses same robust `checkSignalCompletion()` logic as main path

## Testing

Verified with completion detection tests:
```bash
npm test -- src/agent/completion-detection.test.ts
```

Manual verification showed slim-wildebeest's signal.json would be detected:
- Signal file exists: `true`
- Status: `questions`
- Should complete: `true` ✅

## Future Improvements

1. **Unified completion detection** - Consider consolidating all completion logic into a single method
2. **Enhanced logging** - Add more detailed logs for completion vs crash decisions
3. **Metrics** - Track completion detection success rates to identify remaining edge cases

## Related Files

- `src/agent/output-handler.ts` - Main fix location
- `src/agent/completion-detection.test.ts` - Existing test coverage
- `src/agent/manager.ts` - Secondary crash handling (different logic)