fix: Prevent lost task completions after server restart

Three bugs causing empty phase diffs when server restarts during agent execution: 1. Startup ordering race: reconcileAfterRestart() emitted agent:stopped before orchestrator registered listeners — events lost. Moved reconciliation to after orchestrator.start(). 2. Stuck in_progress tasks: recoverDispatchQueues() only re-queued pending tasks. Added recovery for in_progress tasks whose agents are dead (not running/waiting_for_input). 3. Branch force-reset destroys work: git branch -f wiped commits when a second agent was dispatched for the same task. Now checks if the branch has commits beyond baseBranch before resetting. Also adds: - agent:crashed handler with auto-retry (MAX_TASK_RETRIES=3) - retryCount column on tasks table + migration - retryCount reset on manual retryBlockedTask()
2026-03-06 12:19:59 +01:00
parent a69527b7d6
commit eac03862e3
9 changed files with 94 additions and 13 deletions
--- a/docs/dispatch-events.md
+++ b/docs/dispatch-events.md
@@ -112,9 +112,22 @@ InitiativeChangesRequestedEvent { initiativeId, phaseId, taskId }
 | Event | Action |
 |-------|--------|
 | `phase:queued` | Dispatch ready phases → dispatch their tasks to idle agents |
-| `agent:stopped` | Re-dispatch queued tasks (freed agent slot) |
+| `agent:stopped` | Auto-complete task (unless user_requested), re-dispatch queued tasks (freed agent slot) |
+| `agent:crashed` | Auto-retry crashed task up to `MAX_TASK_RETRIES` (3). Increments `retryCount`, resets status to `pending`, re-queues. Exceeding retries leaves task `in_progress` for manual intervention. |
 | `task:completed` | Merge task branch (if branch exists), check phase completion, dispatch next queued task |

+### Crash Recovery
+
+When an agent crashes (`agent:crashed` event), the orchestrator automatically retries the task:
+1. Finds the task associated with the crashed agent
+2. Checks `task.retryCount` against `MAX_TASK_RETRIES` (3)
+3. If under limit: increments `retryCount`, resets task to `pending`, re-queues for dispatch
+4. If over limit: logs warning, task stays `in_progress` for manual intervention
+
+On server restart, `recoverDispatchQueues()` also recovers stuck `in_progress` tasks whose agents are dead (status is not `running` or `waiting_for_input`). These are reset to `pending` and re-queued.
+
+Manual retry via `retryBlockedTask()` resets `retryCount` to 0, giving the task a fresh set of automatic retries.
+
 ### Coalesced Scheduling

 Multiple rapid events (e.g. several `phase:queued` from `queueAllPhases`) are coalesced into a single async dispatch cycle via `scheduleDispatch()`. The cycle loops `dispatchNextPhase()` + `dispatchNext()` until both queues are drained, then re-runs if new events arrived during execution.