fix: Prevent lost task completions after server restart
Three bugs causing empty phase diffs when server restarts during agent execution: 1. Startup ordering race: reconcileAfterRestart() emitted agent:stopped before orchestrator registered listeners — events lost. Moved reconciliation to after orchestrator.start(). 2. Stuck in_progress tasks: recoverDispatchQueues() only re-queued pending tasks. Added recovery for in_progress tasks whose agents are dead (not running/waiting_for_input). 3. Branch force-reset destroys work: git branch -f wiped commits when a second agent was dispatched for the same task. Now checks if the branch has commits beyond baseBranch before resetting. Also adds: - agent:crashed handler with auto-retry (MAX_TASK_RETRIES=3) - retryCount column on tasks table + migration - retryCount reset on manual retryBlockedTask()
This commit is contained in:
@@ -112,9 +112,22 @@ InitiativeChangesRequestedEvent { initiativeId, phaseId, taskId }
|
||||
| Event | Action |
|
||||
|-------|--------|
|
||||
| `phase:queued` | Dispatch ready phases → dispatch their tasks to idle agents |
|
||||
| `agent:stopped` | Re-dispatch queued tasks (freed agent slot) |
|
||||
| `agent:stopped` | Auto-complete task (unless user_requested), re-dispatch queued tasks (freed agent slot) |
|
||||
| `agent:crashed` | Auto-retry crashed task up to `MAX_TASK_RETRIES` (3). Increments `retryCount`, resets status to `pending`, re-queues. Exceeding retries leaves task `in_progress` for manual intervention. |
|
||||
| `task:completed` | Merge task branch (if branch exists), check phase completion, dispatch next queued task |
|
||||
|
||||
### Crash Recovery
|
||||
|
||||
When an agent crashes (`agent:crashed` event), the orchestrator automatically retries the task:
|
||||
1. Finds the task associated with the crashed agent
|
||||
2. Checks `task.retryCount` against `MAX_TASK_RETRIES` (3)
|
||||
3. If under limit: increments `retryCount`, resets task to `pending`, re-queues for dispatch
|
||||
4. If over limit: logs warning, task stays `in_progress` for manual intervention
|
||||
|
||||
On server restart, `recoverDispatchQueues()` also recovers stuck `in_progress` tasks whose agents are dead (status is not `running` or `waiting_for_input`). These are reset to `pending` and re-queued.
|
||||
|
||||
Manual retry via `retryBlockedTask()` resets `retryCount` to 0, giving the task a fresh set of automatic retries.
|
||||
|
||||
### Coalesced Scheduling
|
||||
|
||||
Multiple rapid events (e.g. several `phase:queued` from `queueAllPhases`) are coalesced into a single async dispatch cycle via `scheduleDispatch()`. The cycle loops `dispatchNextPhase()` + `dispatchNext()` until both queues are drained, then re-runs if new events arrived during execution.
|
||||
|
||||
Reference in New Issue
Block a user