Skip to content

Run Lifecycle and Phases

This page describes the state machines that govern run execution and tool call processing, the phase enum, termination conditions, and checkpoint triggers.

A run’s coarse lifecycle is captured by RunStatus:

Created --+--> Running --+--> Waiting --+--> Running (resume)
| | |
+--> Done +--> Done +--> Done
pub enum RunStatus {
Created, // Persisted but not yet executing
Running, // Actively executing
Waiting, // Paused, waiting for external decisions
Done, // Terminal -- cannot transition further
}
  • Created -> Running: the runtime picks up the run and begins executing phases.
  • Created -> Done: the run was cancelled or rejected before any phase ran.
  • Running -> Waiting: a tool call suspends, the run pauses for external input.
  • Waiting -> Running: decisions arrive, the run resumes.
  • Running -> Done or Waiting -> Done: terminal transition on completion, cancellation, or error.
  • Done -> *: not allowed. Terminal state.

Each tool call in a run has its own lifecycle:

New --> Running --+--> Succeeded (terminal)
+--> Failed (terminal)
+--> Cancelled (terminal)
+--> Suspended --> Resuming --+--> Running
+--> Suspended (re-suspend)
+--> Succeeded/Failed/Cancelled
pub enum ToolCallStatus {
New, // Created, not yet executing
Running, // Currently executing
Suspended, // Waiting for external decision
Resuming, // Decision received, about to re-execute
Succeeded, // Completed successfully (terminal)
Failed, // Completed with error (terminal)
Cancelled, // Cancelled externally (terminal)
}

Key transitions:

  • Suspended can only move to Resuming or Cancelled — it cannot jump directly to Running or a success/failure state.
  • Resuming has wide transitions: it can re-enter Running, re-suspend, or reach any terminal state.
  • Terminal states (Succeeded, Failed, Cancelled) cannot transition to any non-self state.

The Phase enum defines the nine execution phases in order:

pub enum Phase {
RunStart,
StepStart,
BeforeInference,
AfterInference,
ToolGate,
BeforeToolExecute,
AfterToolExecute,
StepEnd,
RunEnd,
}

RunStart — fires once at the beginning of a run. Plugins initialize run-scoped state.

StepStart — fires at the beginning of each inference round. Step counter increments.

BeforeInference — last chance to modify the inference request (system prompt, tools, parameters). Plugins can skip inference by setting a behavior flag.

AfterInference — fires after the LLM response arrives. Plugins can inspect the response, modify tool call lists, or request termination.

ToolGate — fires before each tool call is allowed to execute. It is a pure decision phase used for allow / block / suspend / set-result interception, and the runtime may re-evaluate it after earlier tool calls commit state.

BeforeToolExecute — fires only for tool calls that will actually execute. Use it for execution-time side effects such as metrics, counters, or preflight state updates.

AfterToolExecute — fires after tool results are available. Plugins can inspect results and trigger side effects.

StepEnd — fires at the end of each inference round. Checkpoint persistence happens here. Stop conditions (max rounds, token budget, loop detection) are evaluated.

RunEnd — fires once when the run terminates, regardless of reason. Cleanup and final state persistence.

When a run ends, the TerminationReason records why:

pub enum TerminationReason {
NaturalEnd, // LLM returned no tool calls
BehaviorRequested, // A plugin requested inference skip
Stopped(StoppedReason), // A stop condition fired (code + optional detail)
Cancelled, // External cancellation signal
Blocked(String), // Permission checker blocked the run
Suspended, // Waiting for external tool-call resolution
Error(String), // Error path
}

TerminationReason::to_run_status() maps each variant to the appropriate RunStatus:

  • Suspended maps to RunStatus::Waiting (the run can resume).
  • All other variants map to RunStatus::Done.

Declarative stop conditions are configured per agent via StopConditionSpec:

VariantTrigger
MaxRounds { rounds }Step count exceeds limit
Timeout { seconds }Wall-clock time exceeds limit
TokenBudget { max_total }Cumulative token usage exceeds budget
ConsecutiveErrors { max }Sequential tool errors exceed threshold
StopOnTool { tool_name }A specific tool is called
ContentMatch { pattern }LLM output matches a regex pattern
LoopDetection { window }Repeated identical tool calls within a sliding window

Stop conditions are evaluated at StepEnd. When one fires, the run terminates with TerminationReason::Stopped.

State is persisted at StepEnd after each inference round. The checkpoint includes:

  • Thread messages (append-only)
  • Run lifecycle state (RunStatus, step count, termination reason)
  • Persistent state keys (those registered with persistent: true)
  • Tool call states for suspended calls

Checkpoints enable resume from the last completed step after a crash or intentional suspension.

A run’s status is a projection of all its tool call states. Each tool call has an independent lifecycle; the run status is the aggregate:

fn derive_run_status(calls: &HashMap<String, ToolCallState>) -> RunStatus {
let mut has_suspended = false;
for state in calls.values() {
match state.status {
// Any Running or Resuming call → Run is still executing
ToolCallStatus::Running | ToolCallStatus::Resuming => {
return RunStatus::Running;
}
ToolCallStatus::Suspended => {
has_suspended = true;
}
// Succeeded / Failed / Cancelled are terminal — keep checking
_ => {}
}
}
if has_suspended {
RunStatus::Waiting // No executing calls, but some await decisions
} else {
RunStatus::Done // All calls in terminal state → step complete
}
}

Decision table:

Any Running/Resuming?Any Suspended?Run StatusMeaning
YesRunningTools are actively executing
NoYesWaitingAll execution done, awaiting external decisions
NoNoDoneAll calls terminal → proceed to next step

When an LLM returns multiple tool calls (for example, [tool_A, tool_B, tool_C]), each call has its own state. Suspended calls can wait for decisions while allowed calls continue through the configured executor:

Time tool_A(approval-req) tool_B(approval-req) tool_C(normal) → Run Status
────────────────────────────────────────────────────────────────
t0 Created Created Created Running Step starts
t1 Suspended Created Running Running tool_A intercepted
t2 Suspended Suspended Running Running tool_B intercepted, tool_C executing
t3 Suspended Suspended Succeeded Waiting tool_C done, no Running calls
t4 Resuming Suspended Succeeded Running tool_A decision arrives
t5 Succeeded Suspended Succeeded Waiting tool_A replay done
t6 Succeeded Resuming Succeeded Running tool_B decision arrives
t7 Succeeded Succeeded Succeeded Done All terminal → next step

At every transition the run status is re-derived from the aggregate of all call states. This means a single decision arriving does not end the wait — the run stays in Waiting until all suspended calls are resolved.

Suspension Bridges Run and Tool-Call Layers

Section titled “Suspension Bridges Run and Tool-Call Layers”

Tool execution is split into three stages inside the step runner:

Stage 1 — ToolGate (serial, per-call):
for each call:
ToolGate hooks → resolve allow / block / suspend / set-result
Suspend? → mark Suspended, set suspended=true, continue
Block? → mark Failed, return immediately
SetResult → mark with provided result, continue
None → add to allowed_calls
Stage 2 — BeforeToolExecute (allowed_calls only):
run execution-time hooks once for the calls that will execute
Stage 3 — Execute (allowed_calls only):
Sequential mode: one by one, break on first suspension
Parallel mode: batch execute, collect all results

After both phases, if suspended == true, the step returns StepOutcome::Suspended. The orchestrator then:

  1. Persists checkpoint (messages, tool call states)
  2. Emits RunFinish(Suspended) to protocol encoders
  3. Enters wait_for_resume_or_cancel loop
loop {
let decisions = decision_rx.next().await; // block until decisions arrive
emit_decision_events_and_messages(decisions);
prepare_resume(decisions); // Suspended → Resuming
detect_and_replay_resume(); // re-execute Resuming calls
if !has_suspended_calls() {
return WaitOutcome::Resumed; // all resolved → exit wait
}
// Some calls still Suspended → continue waiting
}

Key properties:

  • The loop handles partial resume: if only tool_A’s decision arrives but tool_B is still suspended, tool_A is replayed immediately and the loop continues waiting for tool_B.
  • Decisions can arrive in batches or one at a time.
  • On WaitOutcome::Resumed, the orchestrator re-enters the step loop for the next LLM inference round.

detect_and_replay_resume scans ToolCallStates for calls with status == Resuming and re-executes them through the standard tool pipeline. The arguments field already reflects the resume mode (set by prepare_resume):

Resume ModeArguments on ReplayBehavior
ReplayToolCallOriginal argumentsFull re-execution
UseDecisionAsToolResultDecision resultA ToolGateHook returns ToolInterceptPayload::SetResult during replay
PassDecisionToToolDecision resultTool receives decision as arguments

Already-completed calls (Succeeded, Failed, Cancelled) are skipped.

The default resolver installs SequentialToolExecutor, so allowed tool calls run one at a time and decisions that arrive while a tool is still executing sit in the decision channel until the step enters wait_for_resume_or_cancel.

ParallelToolExecutor can execute an allowed batch concurrently when installed through a custom resolver or ResolvedAgent::with_tool_executor(...). Even in that mode, the built-in resume loop consumes decisions after a step suspends, prepares resume state, and replays suspended calls through the standard ToolGate -> BeforeToolExecute -> tool -> AfterToolExecute pipeline. The contract-level DecisionReplayPolicy values describe coordination behavior for custom parallel HITL integrations; they are not an AgentSpec field.

A backend run may span multiple frontend SSE connections. This is especially relevant for the AI SDK v6 protocol, where each HTTP request corresponds to one “turn” and produces one SSE stream.

Turn 1 (user message):
HTTP POST → SSE stream 1 → events flow → tool suspends
→ RunFinish(Suspended) → finish event → SSE stream 1 closes
The run is still alive in wait_for_resume_or_cancel.
But the event_tx channel from SSE stream 1 has been dropped.
Turn 2 (tool output / approval):
HTTP POST → SSE stream 2 → decision delivered to orchestrator
→ orchestrator resumes → emits events
→ events go to... the dropped event_tx? Lost.

Replace the fixed ChannelEventSink with a reconnectable wrapper that allows swapping the underlying channel sender:

struct ReconnectableEventSink {
inner: Arc<tokio::sync::Mutex<mpsc::UnboundedSender<AgentEvent>>>,
}
impl ReconnectableEventSink {
fn new(tx: mpsc::UnboundedSender<AgentEvent>) -> Self {
Self { inner: Arc::new(tokio::sync::Mutex::new(tx)) }
}
/// Replace the underlying channel. Called when a new SSE connection
/// arrives for an existing suspended run.
async fn reconnect(&self, new_tx: mpsc::UnboundedSender<AgentEvent>) {
*self.inner.lock().await = new_tx;
}
}
#[async_trait]
impl EventSink for ReconnectableEventSink {
async fn emit(&self, event: AgentEvent) {
let _ = self.inner.lock().await.send(event);
}
}
Turn 1:
submit() → create (event_tx1, event_rx1)
→ ReconnectableEventSink(event_tx1)
→ spawn_execution (run starts)
events → event_tx1 → event_rx1 → SSE stream 1
tool suspends → finish(tool-calls) → SSE stream 1 closes
event_tx1 still held by ReconnectableEventSink (sends fail silently)
run alive in wait_for_resume_or_cancel
Turn 2:
new HTTP request with decisions arrives
create (event_tx2, event_rx2)
sink.reconnect(event_tx2) ← swap channel
send_decision → decision channel → orchestrator resumes
events → ReconnectableEventSink → event_tx2 → event_rx2 → SSE stream 2
run completes → RunFinish(NaturalEnd) → SSE stream 2 closes

No events are lost between SSE connections because:

  • During suspend, the orchestrator is blocked in wait_for_resume_or_cancel and emits no events.
  • reconnect() completes before send_decision(), so the first resume event (RunStart) goes to the new channel.
ProtocolSuspend SignalResume Mechanism
AI SDK v6finish(finishReason: "tool-calls")New HTTP POST → reconnect → send_decision
AG-UIRUN_FINISHED(outcome: "interrupt")New HTTP POST → reconnect → send_decision
CopilotKitrenderAndWaitForResponse UISame SSE or new request via AG-UI