Run Lifecycle and Phases
This page describes the state machines that govern run execution and tool call processing, the phase enum, termination conditions, and checkpoint triggers.
RunStatus
Section titled “RunStatus”A run’s coarse lifecycle is captured by RunStatus:
Created --+--> Running --+--> Waiting --+--> Running (resume) | | | +--> Done +--> Done +--> Donepub enum RunStatus { Created, // Persisted but not yet executing Running, // Actively executing Waiting, // Paused, waiting for external decisions Done, // Terminal -- cannot transition further}Created -> Running: the runtime picks up the run and begins executing phases.Created -> Done: the run was cancelled or rejected before any phase ran.Running -> Waiting: a tool call suspends, the run pauses for external input.Waiting -> Running: decisions arrive, the run resumes.Running -> DoneorWaiting -> Done: terminal transition on completion, cancellation, or error.Done -> *: not allowed. Terminal state.
ToolCallStatus
Section titled “ToolCallStatus”Each tool call in a run has its own lifecycle:
New --> Running --+--> Succeeded (terminal) +--> Failed (terminal) +--> Cancelled (terminal) +--> Suspended --> Resuming --+--> Running +--> Suspended (re-suspend) +--> Succeeded/Failed/Cancelledpub enum ToolCallStatus { New, // Created, not yet executing Running, // Currently executing Suspended, // Waiting for external decision Resuming, // Decision received, about to re-execute Succeeded, // Completed successfully (terminal) Failed, // Completed with error (terminal) Cancelled, // Cancelled externally (terminal)}Key transitions:
Suspendedcan only move toResumingorCancelled— it cannot jump directly toRunningor a success/failure state.Resuminghas wide transitions: it can re-enterRunning, re-suspend, or reach any terminal state.- Terminal states (
Succeeded,Failed,Cancelled) cannot transition to any non-self state.
Phase Enum
Section titled “Phase Enum”The Phase enum defines the nine execution phases in order:
pub enum Phase { RunStart, StepStart, BeforeInference, AfterInference, ToolGate, BeforeToolExecute, AfterToolExecute, StepEnd, RunEnd,}RunStart — fires once at the beginning of a run. Plugins initialize run-scoped state.
StepStart — fires at the beginning of each inference round. Step counter increments.
BeforeInference — last chance to modify the inference request (system prompt, tools, parameters). Plugins can skip inference by setting a behavior flag.
AfterInference — fires after the LLM response arrives. Plugins can inspect the response, modify tool call lists, or request termination.
ToolGate — fires before each tool call is allowed to execute. It is a pure decision phase used for allow / block / suspend / set-result interception, and the runtime may re-evaluate it after earlier tool calls commit state.
BeforeToolExecute — fires only for tool calls that will actually execute. Use it for execution-time side effects such as metrics, counters, or preflight state updates.
AfterToolExecute — fires after tool results are available. Plugins can inspect results and trigger side effects.
StepEnd — fires at the end of each inference round. Checkpoint persistence happens here. Stop conditions (max rounds, token budget, loop detection) are evaluated.
RunEnd — fires once when the run terminates, regardless of reason. Cleanup and final state persistence.
TerminationReason
Section titled “TerminationReason”When a run ends, the TerminationReason records why:
pub enum TerminationReason { NaturalEnd, // LLM returned no tool calls BehaviorRequested, // A plugin requested inference skip Stopped(StoppedReason), // A stop condition fired (code + optional detail) Cancelled, // External cancellation signal Blocked(String), // Permission checker blocked the run Suspended, // Waiting for external tool-call resolution Error(String), // Error path}TerminationReason::to_run_status() maps each variant to the appropriate RunStatus:
Suspendedmaps toRunStatus::Waiting(the run can resume).- All other variants map to
RunStatus::Done.
Stop Conditions
Section titled “Stop Conditions”Declarative stop conditions are configured per agent via StopConditionSpec:
| Variant | Trigger |
|---|---|
MaxRounds { rounds } | Step count exceeds limit |
Timeout { seconds } | Wall-clock time exceeds limit |
TokenBudget { max_total } | Cumulative token usage exceeds budget |
ConsecutiveErrors { max } | Sequential tool errors exceed threshold |
StopOnTool { tool_name } | A specific tool is called |
ContentMatch { pattern } | LLM output matches a regex pattern |
LoopDetection { window } | Repeated identical tool calls within a sliding window |
Stop conditions are evaluated at StepEnd. When one fires, the run terminates with TerminationReason::Stopped.
Checkpoint Triggers
Section titled “Checkpoint Triggers”State is persisted at StepEnd after each inference round. The checkpoint includes:
- Thread messages (append-only)
- Run lifecycle state (
RunStatus, step count, termination reason) - Persistent state keys (those registered with
persistent: true) - Tool call states for suspended calls
Checkpoints enable resume from the last completed step after a crash or intentional suspension.
RunStatus Derived from ToolCall States
Section titled “RunStatus Derived from ToolCall States”A run’s status is a projection of all its tool call states. Each tool call has an independent lifecycle; the run status is the aggregate:
fn derive_run_status(calls: &HashMap<String, ToolCallState>) -> RunStatus { let mut has_suspended = false; for state in calls.values() { match state.status { // Any Running or Resuming call → Run is still executing ToolCallStatus::Running | ToolCallStatus::Resuming => { return RunStatus::Running; } ToolCallStatus::Suspended => { has_suspended = true; } // Succeeded / Failed / Cancelled are terminal — keep checking _ => {} } } if has_suspended { RunStatus::Waiting // No executing calls, but some await decisions } else { RunStatus::Done // All calls in terminal state → step complete }}Decision table:
Any Running/Resuming? | Any Suspended? | Run Status | Meaning |
|---|---|---|---|
| Yes | — | Running | Tools are actively executing |
| No | Yes | Waiting | All execution done, awaiting external decisions |
| No | No | Done | All calls terminal → proceed to next step |
Tool-call state timeline
Section titled “Tool-call state timeline”When an LLM returns multiple tool calls (for example, [tool_A, tool_B, tool_C]), each call has its own state. Suspended calls can wait for decisions
while allowed calls continue through the configured executor:
Time tool_A(approval-req) tool_B(approval-req) tool_C(normal) → Run Status────────────────────────────────────────────────────────────────t0 Created Created Created Running Step startst1 Suspended Created Running Running tool_A interceptedt2 Suspended Suspended Running Running tool_B intercepted, tool_C executingt3 Suspended Suspended Succeeded Waiting tool_C done, no Running callst4 Resuming Suspended Succeeded Running tool_A decision arrivest5 Succeeded Suspended Succeeded Waiting tool_A replay donet6 Succeeded Resuming Succeeded Running tool_B decision arrivest7 Succeeded Succeeded Succeeded Done All terminal → next stepAt every transition the run status is re-derived from the aggregate of all call
states. This means a single decision arriving does not end the wait — the run
stays in Waiting until all suspended calls are resolved.
Suspension Bridges Run and Tool-Call Layers
Section titled “Suspension Bridges Run and Tool-Call Layers”Current execution model
Section titled “Current execution model”Tool execution is split into three stages inside the step runner:
Stage 1 — ToolGate (serial, per-call): for each call: ToolGate hooks → resolve allow / block / suspend / set-result Suspend? → mark Suspended, set suspended=true, continue Block? → mark Failed, return immediately SetResult → mark with provided result, continue None → add to allowed_calls
Stage 2 — BeforeToolExecute (allowed_calls only): run execution-time hooks once for the calls that will execute
Stage 3 — Execute (allowed_calls only): Sequential mode: one by one, break on first suspension Parallel mode: batch execute, collect all resultsAfter both phases, if suspended == true, the step returns
StepOutcome::Suspended. The orchestrator then:
- Persists checkpoint (messages, tool call states)
- Emits
RunFinish(Suspended)to protocol encoders - Enters
wait_for_resume_or_cancelloop
wait_for_resume_or_cancel loop
Section titled “wait_for_resume_or_cancel loop”loop { let decisions = decision_rx.next().await; // block until decisions arrive emit_decision_events_and_messages(decisions); prepare_resume(decisions); // Suspended → Resuming detect_and_replay_resume(); // re-execute Resuming calls if !has_suspended_calls() { return WaitOutcome::Resumed; // all resolved → exit wait } // Some calls still Suspended → continue waiting}Key properties:
- The loop handles partial resume: if only tool_A’s decision arrives but tool_B is still suspended, tool_A is replayed immediately and the loop continues waiting for tool_B.
- Decisions can arrive in batches or one at a time.
- On
WaitOutcome::Resumed, the orchestrator re-enters the step loop for the next LLM inference round.
Resume replay
Section titled “Resume replay”detect_and_replay_resume scans ToolCallStates for calls with
status == Resuming and re-executes them through the standard tool pipeline.
The arguments field already reflects the resume mode (set by
prepare_resume):
| Resume Mode | Arguments on Replay | Behavior |
|---|---|---|
ReplayToolCall | Original arguments | Full re-execution |
UseDecisionAsToolResult | Decision result | A ToolGateHook returns ToolInterceptPayload::SetResult during replay |
PassDecisionToTool | Decision result | Tool receives decision as arguments |
Already-completed calls (Succeeded, Failed, Cancelled) are skipped.
Decisions during tool execution
Section titled “Decisions during tool execution”The default resolver installs SequentialToolExecutor, so allowed tool calls
run one at a time and decisions that arrive while a tool is still executing sit
in the decision channel until the step enters wait_for_resume_or_cancel.
ParallelToolExecutor can execute an allowed batch concurrently when installed
through a custom resolver or ResolvedAgent::with_tool_executor(...). Even in
that mode, the built-in resume loop consumes decisions after a step suspends,
prepares resume state, and replays suspended calls through the standard
ToolGate -> BeforeToolExecute -> tool -> AfterToolExecute pipeline. The
contract-level DecisionReplayPolicy values describe coordination behavior for
custom parallel HITL integrations; they are not an AgentSpec field.
Protocol Adapter: SSE Reconnection
Section titled “Protocol Adapter: SSE Reconnection”A backend run may span multiple frontend SSE connections. This is especially relevant for the AI SDK v6 protocol, where each HTTP request corresponds to one “turn” and produces one SSE stream.
Problem
Section titled “Problem”Turn 1 (user message): HTTP POST → SSE stream 1 → events flow → tool suspends → RunFinish(Suspended) → finish event → SSE stream 1 closes
The run is still alive in wait_for_resume_or_cancel. But the event_tx channel from SSE stream 1 has been dropped.
Turn 2 (tool output / approval): HTTP POST → SSE stream 2 → decision delivered to orchestrator → orchestrator resumes → emits events → events go to... the dropped event_tx? Lost.Solution: ReconnectableEventSink
Section titled “Solution: ReconnectableEventSink”Replace the fixed ChannelEventSink with a reconnectable wrapper that allows
swapping the underlying channel sender:
struct ReconnectableEventSink { inner: Arc<tokio::sync::Mutex<mpsc::UnboundedSender<AgentEvent>>>,}
impl ReconnectableEventSink { fn new(tx: mpsc::UnboundedSender<AgentEvent>) -> Self { Self { inner: Arc::new(tokio::sync::Mutex::new(tx)) } }
/// Replace the underlying channel. Called when a new SSE connection /// arrives for an existing suspended run. async fn reconnect(&self, new_tx: mpsc::UnboundedSender<AgentEvent>) { *self.inner.lock().await = new_tx; }}
#[async_trait]impl EventSink for ReconnectableEventSink { async fn emit(&self, event: AgentEvent) { let _ = self.inner.lock().await.send(event); }}Reconnection flow
Section titled “Reconnection flow”Turn 1: submit() → create (event_tx1, event_rx1) → ReconnectableEventSink(event_tx1) → spawn_execution (run starts) events → event_tx1 → event_rx1 → SSE stream 1 tool suspends → finish(tool-calls) → SSE stream 1 closes event_tx1 still held by ReconnectableEventSink (sends fail silently) run alive in wait_for_resume_or_cancel
Turn 2: new HTTP request with decisions arrives create (event_tx2, event_rx2) sink.reconnect(event_tx2) ← swap channel send_decision → decision channel → orchestrator resumes events → ReconnectableEventSink → event_tx2 → event_rx2 → SSE stream 2 run completes → RunFinish(NaturalEnd) → SSE stream 2 closesNo events are lost between SSE connections because:
- During suspend, the orchestrator is blocked in
wait_for_resume_or_canceland emits no events. reconnect()completes beforesend_decision(), so the first resume event (RunStart) goes to the new channel.
Protocol-specific behavior
Section titled “Protocol-specific behavior”| Protocol | Suspend Signal | Resume Mechanism |
|---|---|---|
| AI SDK v6 | finish(finishReason: "tool-calls") | New HTTP POST → reconnect → send_decision |
| AG-UI | RUN_FINISHED(outcome: "interrupt") | New HTTP POST → reconnect → send_decision |
| CopilotKit | renderAndWaitForResponse UI | Same SSE or new request via AG-UI |
See Also
Section titled “See Also”- HITL and Mailbox — suspension, resume, and decision handling
- Tool Execution Modes — Sequential vs Parallel execution
- State and Snapshot Model — how state is read and written during phases
- Architecture — three-layer overview
- Cancellation — auto-cancellation on new message