Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Optimize the Context Window

Use this when you need to control how the runtime manages conversation history to stay within model token limits.

Prerequisites

  • awaken crate added to Cargo.toml
  • An agent configured with AgentSpec

ContextWindowPolicy

Every agent has a ContextWindowPolicy that controls how conversation history is managed. Set it on your AgentSpec:

use awaken::ContextWindowPolicy;

let policy = ContextWindowPolicy {
    max_context_tokens: 200_000,
    max_output_tokens: 16_384,
    min_recent_messages: 10,
    enable_prompt_cache: true,
    autocompact_threshold: Some(100_000),
    compaction_mode: ContextCompactionMode::KeepRecentRawSuffix,
    compaction_raw_suffix_messages: 2,
};

Fields

FieldTypeDefaultDescription
max_context_tokensusize200_000Model’s total context window size in tokens
max_output_tokensusize16_384Tokens reserved for model output
min_recent_messagesusize10Minimum number of recent messages to always preserve, even if over budget
enable_prompt_cachebooltrueWhether to enable prompt caching
autocompact_thresholdOption<usize>NoneToken count that triggers auto-compaction. None disables auto-compaction
compaction_modeContextCompactionModeKeepRecentRawSuffixStrategy used when auto-compaction fires
compaction_raw_suffix_messagesusize2Number of recent raw messages to preserve in suffix compaction mode

Truncation

When the conversation exceeds the available token budget, the runtime automatically drops the oldest messages to fit. The budget is calculated as:

available = max_context_tokens - max_output_tokens - tool_schema_tokens

What truncation preserves

  • System messages are never truncated. All system messages at the start of the history survive regardless of budget.
  • Recent messages – at least min_recent_messages history messages are kept, even if they exceed the budget.
  • Tool call/result pairs – the split point is adjusted so that an assistant message with tool calls is never separated from its corresponding tool result messages.
  • Dangling tool calls – after truncation, any orphaned tool calls (whose results were dropped) are patched to prevent invalid message sequences.

Artifact compaction

Before truncation runs, oversized tool results are compacted automatically. A tool result whose text exceeds ARTIFACT_COMPACT_THRESHOLD_TOKENS (2048 tokens, estimated at ~8192 characters) is truncated to a preview of at most 1600 characters or 24 lines, whichever is shorter. The preview includes a compaction indicator showing the original size.

Non-tool messages (system, user, assistant) are never subject to artifact compaction.

Compaction

Compaction summarizes older conversation history into a condensed summary message, reducing token usage while preserving context. Unlike truncation (which drops messages), compaction replaces them with a summary.

Enabling auto-compaction

Set autocompact_threshold to trigger compaction when total message tokens exceed that value:

let policy = ContextWindowPolicy {
    autocompact_threshold: Some(100_000),
    compaction_mode: ContextCompactionMode::KeepRecentRawSuffix,
    compaction_raw_suffix_messages: 4,
    ..Default::default()
};

ContextCompactionMode

Two strategies are available:

  • KeepRecentRawSuffix (default) – keeps the most recent compaction_raw_suffix_messages messages as raw history. Everything before the compaction boundary is summarized.
  • CompactToSafeFrontier – compacts all messages up to the safe frontier (the latest point where all tool call/result pairs are complete).

The compaction boundary is chosen so that no tool call is separated from its result. The boundary finder walks the message history and only places boundaries where all open tool calls have been resolved.

CompactionConfig

The compaction subsystem is configured through CompactionConfig, stored in the agent spec’s sections["compaction"] and read via CompactionConfigKey:

use awaken::CompactionConfig;

let config = CompactionConfig {
    summarizer_system_prompt: "You are a conversation summarizer. \
        Preserve all key facts, decisions, tool results, and action items. \
        Be concise but complete.".into(),
    summarizer_user_prompt: "Summarize the following conversation:\n\n{messages}".into(),
    summary_max_tokens: Some(1024),
    summary_model: Some("claude-3-haiku".into()),
    min_savings_ratio: 0.3,
};
FieldTypeDefaultDescription
summarizer_system_promptStringConversation summarizer promptSystem prompt for the summarizer LLM call
summarizer_user_promptString"Summarize...\n\n{messages}"User prompt template; {messages} is replaced with the conversation transcript
summary_max_tokensOption<u32>NoneMaximum tokens for the summary response
summary_modelOption<String>NoneModel for summarization (defaults to the agent’s model)
min_savings_ratiof640.3Minimum token savings ratio (0.0-1.0) to accept a compaction

The compaction pass only runs when the expected savings ratio exceeds min_savings_ratio. A minimum gain of 1024 tokens (MIN_COMPACTION_GAIN_TOKENS) is also required to justify the summarization LLM call.

DefaultSummarizer

The built-in DefaultSummarizer reads prompts from CompactionConfig and supports cumulative summarization. When a previous summary exists, it asks the LLM to update the existing summary with new conversation rather than re-summarizing everything from scratch.

The transcript renderer filters out Visibility::Internal messages before sending to the summarizer, since system-injected context is re-injected each turn and should not be included in summaries.

Summary storage

Compaction summaries are stored as <conversation-summary> tagged internal system messages. On load, trim_to_compaction_boundary drops all messages before the latest summary message, so already-summarized history is never re-loaded into the context window.

Compaction boundaries are tracked durably via CompactionState, recording the summary text, pre/post token counts, and timestamp for each compaction event.

Truncation recovery

When the LLM stops due to MaxTokens with incomplete tool calls (argument JSON was truncated mid-generation), the runtime can automatically retry by injecting a continuation prompt asking the model to break its work into smaller pieces and continue. The retry count is tracked by TruncationState and bounded by a configurable maximum.

Key Files

  • crates/awaken-contract/src/contract/inference.rsContextWindowPolicy, ContextCompactionMode
  • crates/awaken-runtime/src/context/transform/mod.rsContextTransform (truncation)
  • crates/awaken-runtime/src/context/transform/compaction.rs – artifact compaction
  • crates/awaken-runtime/src/context/compaction.rs – boundary finding, load-time trimming
  • crates/awaken-runtime/src/context/summarizer.rsContextSummarizer, DefaultSummarizer
  • crates/awaken-runtime/src/context/plugin.rsCompactionPlugin, CompactionConfig, CompactionState
  • crates/awaken-runtime/src/context/truncation.rsTruncationState, continuation prompts