fix: Prevent lost task completions after server restart

Three bugs causing empty phase diffs when server restarts during agent
execution:

1. Startup ordering race: reconcileAfterRestart() emitted agent:stopped
   before orchestrator registered listeners — events lost. Moved
   reconciliation to after orchestrator.start().

2. Stuck in_progress tasks: recoverDispatchQueues() only re-queued
   pending tasks. Added recovery for in_progress tasks whose agents
   are dead (not running/waiting_for_input).

3. Branch force-reset destroys work: git branch -f wiped commits when
   a second agent was dispatched for the same task. Now checks if the
   branch has commits beyond baseBranch before resetting.

Also adds:
- agent:crashed handler with auto-retry (MAX_TASK_RETRIES=3)
- retryCount column on tasks table + migration
- retryCount reset on manual retryBlockedTask()
This commit is contained in:
Lukas May
2026-03-06 12:19:59 +01:00
parent a69527b7d6
commit eac03862e3
9 changed files with 94 additions and 13 deletions

View File

@@ -112,9 +112,22 @@ InitiativeChangesRequestedEvent { initiativeId, phaseId, taskId }
| Event | Action |
|-------|--------|
| `phase:queued` | Dispatch ready phases → dispatch their tasks to idle agents |
| `agent:stopped` | Re-dispatch queued tasks (freed agent slot) |
| `agent:stopped` | Auto-complete task (unless user_requested), re-dispatch queued tasks (freed agent slot) |
| `agent:crashed` | Auto-retry crashed task up to `MAX_TASK_RETRIES` (3). Increments `retryCount`, resets status to `pending`, re-queues. Exceeding retries leaves task `in_progress` for manual intervention. |
| `task:completed` | Merge task branch (if branch exists), check phase completion, dispatch next queued task |
### Crash Recovery
When an agent crashes (`agent:crashed` event), the orchestrator automatically retries the task:
1. Finds the task associated with the crashed agent
2. Checks `task.retryCount` against `MAX_TASK_RETRIES` (3)
3. If under limit: increments `retryCount`, resets task to `pending`, re-queues for dispatch
4. If over limit: logs warning, task stays `in_progress` for manual intervention
On server restart, `recoverDispatchQueues()` also recovers stuck `in_progress` tasks whose agents are dead (status is not `running` or `waiting_for_input`). These are reset to `pending` and re-queued.
Manual retry via `retryBlockedTask()` resets `retryCount` to 0, giving the task a fresh set of automatic retries.
### Coalesced Scheduling
Multiple rapid events (e.g. several `phase:queued` from `queueAllPhases`) are coalesced into a single async dispatch cycle via `scheduleDispatch()`. The cycle loops `dispatchNextPhase()` + `dispatchNext()` until both queues are drained, then re-runs if new events arrived during execution.