Run Lifecycle and Phases

This page describes the state machines that govern run execution and tool call processing, the phase enum, termination conditions, and checkpoint triggers.

RunStatus

A run’s coarse lifecycle is captured by RunStatus:

Created --+--> Running --+--> Waiting --+--> Running (resume)
          |              |              |
          +--> Done      +--> Done      +--> Done

pub enum RunStatus {
    Created,  // Persisted but not yet executing
    Running,  // Actively executing
    Waiting,  // Paused, waiting for external decisions
    Done,     // Terminal -- cannot transition further
}

Created -> Running: the runtime picks up the run and begins executing phases.
Created -> Done: the run was cancelled or rejected before any phase ran.
Running -> Waiting: a tool call suspends, the run pauses for external input.
Waiting -> Running: decisions arrive, the run resumes.
Running -> Done or Waiting -> Done: terminal transition on completion, cancellation, or error.
Done -> *: not allowed. Terminal state.

ToolCallStatus

Each tool call in a run has its own lifecycle:

New --> Running --+--> Succeeded (terminal)
                  +--> Failed (terminal)
                  +--> Cancelled (terminal)
                  +--> Suspended --> Resuming --+--> Running
                                                +--> Suspended (re-suspend)
                                                +--> Succeeded/Failed/Cancelled

pub enum ToolCallStatus {
    New,        // Created, not yet executing
    Running,    // Currently executing
    Suspended,  // Waiting for external decision
    Resuming,   // Decision received, about to re-execute
    Succeeded,  // Completed successfully (terminal)
    Failed,     // Completed with error (terminal)
    Cancelled,  // Cancelled externally (terminal)
}

Key transitions:

Suspended can only move to Resuming or Cancelled — it cannot jump directly to Running or a success/failure state.
Resuming has wide transitions: it can re-enter Running, re-suspend, or reach any terminal state.
Terminal states (Succeeded, Failed, Cancelled) cannot transition to any non-self state.

Phase Enum

The Phase enum defines the nine execution phases in order:

pub enum Phase {
    RunStart,
    StepStart,
    BeforeInference,
    AfterInference,
    ToolGate,
    BeforeToolExecute,
    AfterToolExecute,
    StepEnd,
    RunEnd,
}

RunStart — fires once at the beginning of a run. Plugins initialize run-scoped state.

StepStart — fires at the beginning of each inference round. Step counter increments.

BeforeInference — last chance to modify the inference request (system prompt, tools, parameters). Plugins can skip inference by setting a behavior flag.

AfterInference — fires after the LLM response arrives. Plugins can inspect the response, modify tool call lists, or request termination.

ToolGate — fires before each tool call is allowed to execute. It is a pure decision phase used for allow / block / suspend / set-result interception, and the runtime may re-evaluate it after earlier tool calls commit state.

BeforeToolExecute — fires only for tool calls that will actually execute. Use it for execution-time side effects such as metrics, counters, or preflight state updates.

AfterToolExecute — fires after tool results are available. Plugins can inspect results and trigger side effects.

StepEnd — fires at the end of each inference round. Checkpoint persistence happens here. Stop conditions (max rounds, token budget, loop detection) are evaluated.

RunEnd — fires once when the run terminates, regardless of reason. Cleanup and final state persistence.

TerminationReason

When a run ends, the TerminationReason records why:

pub enum TerminationReason {
    NaturalEnd,           // LLM returned no tool calls
    BehaviorRequested,    // A plugin requested inference skip
    Stopped(StoppedReason), // A stop condition fired (code + optional detail)
    Cancelled,            // External cancellation signal
    Blocked(String),      // Permission checker blocked the run
    Suspended,            // Waiting for external tool-call resolution
    Error(String),        // Error path
}

TerminationReason::to_run_status() maps each variant to the appropriate RunStatus:

Suspended maps to RunStatus::Waiting (the run can resume).
All other variants map to RunStatus::Done.

Stop Conditions

Declarative stop conditions are configured per agent via StopConditionSpec:

Variant	Trigger
`MaxRounds { rounds }`	Step count exceeds limit
`Timeout { seconds }`	Wall-clock time exceeds limit
`TokenBudget { max_total }`	Cumulative token usage exceeds budget
`ConsecutiveErrors { max }`	Sequential tool errors exceed threshold
`StopOnTool { tool_name }`	A specific tool is called
`ContentMatch { pattern }`	LLM output matches a regex pattern
`LoopDetection { window }`	Repeated identical tool calls within a sliding window

Stop conditions are evaluated at StepEnd. When one fires, the run terminates with TerminationReason::Stopped.

Checkpoint Triggers

State is persisted at StepEnd after each inference round. The checkpoint includes:

Thread messages (append-only)
Run lifecycle state (RunStatus, step count, termination reason)
Persistent state keys (those registered with persistent: true)
Tool call states for suspended calls

Checkpoints enable resume from the last completed step after a crash or intentional suspension.

RunStatus Derived from ToolCall States

A run’s status is a projection of all its tool call states. Each tool call has an independent lifecycle; the run status is the aggregate:

fn derive_run_status(calls: &HashMap<String, ToolCallState>) -> RunStatus {
    let mut has_suspended = false;
    for state in calls.values() {
        match state.status {
            // Any Running or Resuming call → Run is still executing
            ToolCallStatus::Running | ToolCallStatus::Resuming => {
                return RunStatus::Running;
            }
            ToolCallStatus::Suspended => {
                has_suspended = true;
            }
            // Succeeded / Failed / Cancelled are terminal — keep checking
            _ => {}
        }
    }
    if has_suspended {
        RunStatus::Waiting   // No executing calls, but some await decisions
    } else {
        RunStatus::Done      // All calls in terminal state → step complete
    }
}

Decision table:

Any `Running`/`Resuming`?	Any `Suspended`?	Run Status	Meaning
Yes	—	Running	Tools are actively executing
No	Yes	Waiting	All execution done, awaiting external decisions
No	No	Done	All calls terminal → proceed to next step

Tool-call state timeline

When an LLM returns multiple tool calls (for example, [tool_A, tool_B, tool_C]), each call has its own state. Suspended calls can wait for decisions while allowed calls continue through the configured executor:

Time  tool_A(approval-req)  tool_B(approval-req)  tool_C(normal)   → Run Status
────────────────────────────────────────────────────────────────
t0    Created        Created        Created        Running     Step starts
t1    Suspended      Created        Running        Running     tool_A intercepted
t2    Suspended      Suspended      Running        Running     tool_B intercepted, tool_C executing
t3    Suspended      Suspended      Succeeded      Waiting     tool_C done, no Running calls
t4    Resuming       Suspended      Succeeded      Running     tool_A decision arrives
t5    Succeeded      Suspended      Succeeded      Waiting     tool_A replay done
t6    Succeeded      Resuming       Succeeded      Running     tool_B decision arrives
t7    Succeeded      Succeeded      Succeeded      Done        All terminal → next step

At every transition the run status is re-derived from the aggregate of all call states. This means a single decision arriving does not end the wait — the run stays in Waiting until all suspended calls are resolved.

Suspension Bridges Run and Tool-Call Layers

Current execution model

Tool execution is split into three stages inside the step runner:

Stage 1 — ToolGate (serial, per-call):
  for each call:
    ToolGate hooks → resolve allow / block / suspend / set-result
    Suspend?  → mark Suspended, set suspended=true, continue
    Block?    → mark Failed, return immediately
    SetResult → mark with provided result, continue
    None      → add to allowed_calls

Stage 2 — BeforeToolExecute (allowed_calls only):
  run execution-time hooks once for the calls that will execute

Stage 3 — Execute (allowed_calls only):
  Sequential mode: one by one, break on first suspension
  Parallel mode:   batch execute, collect all results

After both phases, if suspended == true, the step returns StepOutcome::Suspended. The orchestrator then:

Persists checkpoint (messages, tool call states)
Emits RunFinish(Suspended) to protocol encoders
Enters wait_for_resume_or_cancel loop

wait_for_resume_or_cancel loop

loop {
    let decisions = decision_rx.next().await;  // block until decisions arrive
    emit_decision_events_and_messages(decisions);
    prepare_resume(decisions);       // Suspended → Resuming
    detect_and_replay_resume();      // re-execute Resuming calls
    if !has_suspended_calls() {
        return WaitOutcome::Resumed; // all resolved → exit wait
    }
    // Some calls still Suspended → continue waiting
}

Key properties:

The loop handles partial resume: if only tool_A’s decision arrives but tool_B is still suspended, tool_A is replayed immediately and the loop continues waiting for tool_B.
Decisions can arrive in batches or one at a time.
On WaitOutcome::Resumed, the orchestrator re-enters the step loop for the next LLM inference round.

Resume replay

detect_and_replay_resume scans ToolCallStates for calls with status == Resuming and re-executes them through the standard tool pipeline. The arguments field already reflects the resume mode (set by prepare_resume):

Resume Mode	Arguments on Replay	Behavior
`ReplayToolCall`	Original arguments	Full re-execution
`UseDecisionAsToolResult`	Decision result	A `ToolGateHook` returns `ToolInterceptPayload::SetResult` during replay
`PassDecisionToTool`	Decision result	Tool receives decision as arguments

Already-completed calls (Succeeded, Failed, Cancelled) are skipped.

Decisions during tool execution

The default resolver installs SequentialToolExecutor, so allowed tool calls run one at a time and decisions that arrive while a tool is still executing sit in the decision channel until the step enters wait_for_resume_or_cancel.

ParallelToolExecutor can execute an allowed batch concurrently when installed through a custom resolver or ResolvedAgent::with_tool_executor(...). Even in that mode, the built-in resume loop consumes decisions after a step suspends, prepares resume state, and replays suspended calls through the standard ToolGate -> BeforeToolExecute -> tool -> AfterToolExecute pipeline. The contract-level DecisionReplayPolicy values describe coordination behavior for custom parallel HITL integrations; they are not an AgentSpec field.

Protocol Adapter: SSE Reconnection

A backend run may span multiple frontend SSE connections. This is especially relevant for the AI SDK v6 protocol, where each HTTP request corresponds to one “turn” and produces one SSE stream.

Problem

Turn 1 (user message):
  HTTP POST → SSE stream 1 → events flow → tool suspends
  → RunFinish(Suspended) → finish event → SSE stream 1 closes

  The run is still alive in wait_for_resume_or_cancel.
  But the event_tx channel from SSE stream 1 has been dropped.

Turn 2 (tool output / approval):
  HTTP POST → SSE stream 2 → decision delivered to orchestrator
  → orchestrator resumes → emits events
  → events go to... the dropped event_tx? Lost.

Solution: ReconnectableEventSink

Replace the fixed ChannelEventSink with a reconnectable wrapper that allows swapping the underlying channel sender:

struct ReconnectableEventSink {
    inner: Arc<tokio::sync::Mutex<mpsc::UnboundedSender<AgentEvent>>>,
}

impl ReconnectableEventSink {
    fn new(tx: mpsc::UnboundedSender<AgentEvent>) -> Self {
        Self { inner: Arc::new(tokio::sync::Mutex::new(tx)) }
    }

    /// Replace the underlying channel. Called when a new SSE connection
    /// arrives for an existing suspended run.
    async fn reconnect(&self, new_tx: mpsc::UnboundedSender<AgentEvent>) {
        *self.inner.lock().await = new_tx;
    }
}

#[async_trait]
impl EventSink for ReconnectableEventSink {
    async fn emit(&self, event: AgentEvent) {
        let _ = self.inner.lock().await.send(event);
    }
}

Reconnection flow

Turn 1:
  submit() → create (event_tx1, event_rx1)
           → ReconnectableEventSink(event_tx1)
           → spawn_execution (run starts)
  events → event_tx1 → event_rx1 → SSE stream 1
  tool suspends → finish(tool-calls) → SSE stream 1 closes
  event_tx1 still held by ReconnectableEventSink (sends fail silently)
  run alive in wait_for_resume_or_cancel

Turn 2:
  new HTTP request with decisions arrives
  create (event_tx2, event_rx2)
  sink.reconnect(event_tx2)          ← swap channel
  send_decision → decision channel → orchestrator resumes
  events → ReconnectableEventSink → event_tx2 → event_rx2 → SSE stream 2
  run completes → RunFinish(NaturalEnd) → SSE stream 2 closes

No events are lost between SSE connections because:

During suspend, the orchestrator is blocked in wait_for_resume_or_cancel and emits no events.
reconnect() completes before send_decision(), so the first resume event (RunStart) goes to the new channel.

Protocol-specific behavior

Protocol	Suspend Signal	Resume Mechanism
AI SDK v6	`finish(finishReason: "tool-calls")`	New HTTP POST → reconnect → send_decision
AG-UI	`RUN_FINISHED(outcome: "interrupt")`	New HTTP POST → reconnect → send_decision
CopilotKit	`renderAndWaitForResponse` UI	Same SSE or new request via AG-UI