Context Management

Every AI agent operates within a context window, the total amount of text it can consider at once. This includes the system prompt, conversation history, tool descriptions, tool results, and the agent’s own reasoning. When that window fills up, the agent starts losing information and making worse decisions.

Context management is about making the most of this limited resource. It’s what determines whether your agent can handle a 50-file codebase refactor or runs out of room after reading three files. These strategies apply whether you’re designing individual skills or orchestrating multi-step workflows.

Understanding the context budget

Before you can optimize, you need to know where your context is going. A typical agent session breaks down roughly like this:

Component	Typical size	Notes
System prompt	500-2,000 tokens	Instructions, personality, constraints
Tool definitions	1,000-5,000 tokens	Scales with number of available tools
Conversation history	2,000-50,000 tokens	Grows with each turn
Tool results	500-20,000+ tokens per call	Largest variable; a single file read can be huge
Agent reasoning	1,000-5,000 tokens per turn	Chain-of-thought, planning

The biggest offender is almost always tool results. A single file read can consume thousands of tokens. A search across a codebase might return hundreds of matches. Without careful management, a few tool calls can eat your entire budget.

Summarization and compression strategies

The core idea is simple: don’t keep raw data in context when a summary will do.

Summarize tool results immediately

When a skill returns large results, summarize them before they enter the agent’s working memory. This can happen at the skill level (the skill itself returns a summary) or at the orchestration level (a post-processing step compresses the output).

// Bad: returning raw file contents into context
async function readFile(path: string): Promise<ToolResult> {
  const content = await fs.readFile(path, "utf-8");
  return { content }; // Could be 10,000+ tokens
}

// Better: return with metadata that helps the agent decide what to keep
async function readFile(
  path: string,
  options?: ReadOptions,
): Promise<ToolResult> {
  const content = await fs.readFile(path, "utf-8");
  const lines = content.split("\n");

  if (options?.summaryOnly) {
    return {
      path,
      lineCount: lines.length,
      language: detectLanguage(path),
      exports: extractExports(content),
      imports: extractImports(content),
      summary: `${lines.length} lines of ${detectLanguage(path)}. Key exports: ${extractExports(content).join(", ")}`,
    };
  }

  // If full content requested but file is large, truncate with guidance
  if (lines.length > 200) {
    return {
      path,
      content: lines.slice(0, 200).join("\n"),
      truncated: true,
      totalLines: lines.length,
      message:
        "File truncated at 200 lines. Use offset parameter to read specific sections.",
    };
  }

  return { path, content, truncated: false };
}

Progressive detail loading

Start with high-level summaries and drill down only where needed. This is the single most effective strategy for context management.

async def explore_codebase(path: str) -> dict:
    """Level 1: Directory structure overview."""
    tree = await invoke("list_directory", path=path, recursive=True, depth=2)
    return {
        "structure": tree,
        "file_count": count_files(tree),
        "languages": detect_languages(tree),
        "hint": "Use read_file_summary for details on specific files.",
    }

async def read_file_summary(path: str) -> dict:
    """Level 2: File-level summary without full content."""
    content = await read_file(path)
    return {
        "path": path,
        "line_count": len(content.splitlines()),
        "functions": extract_function_signatures(content),
        "classes": extract_class_names(content),
        "imports": extract_imports(content),
        "hint": "Use read_file_section to read specific functions or line ranges.",
    }

async def read_file_section(path: str, start: int, end: int) -> dict:
    """Level 3: Specific section of a file."""
    lines = (await read_file(path)).splitlines()
    return {
        "path": path,
        "range": f"lines {start}-{end} of {len(lines)}",
        "content": "\n".join(lines[start:end]),
    }

This three-level approach (overview, summary, detail) lets the agent navigate a large codebase while keeping context usage proportional to what it actually needs.

Selective context loading

Not everything needs to be in context at once. Skills should load information on demand rather than preloading everything upfront.

Pattern: lazy loading with caching

class ContextManager {
  private cache = new Map<string, { data: unknown; accessedAt: Date }>();
  private maxCacheSize: number;

  constructor(maxCacheSize = 20) {
    this.maxCacheSize = maxCacheSize;
  }

  async get(key: string, loader: () => Promise<unknown>): Promise<unknown> {
    if (this.cache.has(key)) {
      const entry = this.cache.get(key)!;
      entry.accessedAt = new Date();
      return entry.data;
    }

    // Evict least recently accessed if at capacity
    if (this.cache.size >= this.maxCacheSize) {
      this.evictLeastRecent();
    }

    const data = await loader();
    this.cache.set(key, { data, accessedAt: new Date() });
    return data;
  }

  private evictLeastRecent(): void {
    let oldestKey = "";
    let oldestTime = new Date();
    for (const [key, entry] of this.cache) {
      if (entry.accessedAt < oldestTime) {
        oldestTime = entry.accessedAt;
        oldestKey = key;
      }
    }
    if (oldestKey) this.cache.delete(oldestKey);
  }
}

Pattern: relevance-based filtering

When a search returns many results, filter by relevance before adding them to context. This is especially important for search skills that might match hundreds of files.

def filter_search_results(
    results: list[SearchResult],
    query_context: str,
    max_results: int = 10,
) -> list[SearchResult]:
    """Filter search results to the most relevant subset."""
    scored = []
    for result in results:
        score = 0
        # Exact filename match scores highest
        if query_context.lower() in result.path.lower():
            score += 10
        # Results in src/ are usually more relevant than node_modules/
        if "/src/" in result.path:
            score += 5
        if "node_modules" in result.path or "vendor" in result.path:
            score -= 20
        # More recent files are often more relevant
        if result.modified_days_ago < 7:
            score += 3

        scored.append((score, result))

    scored.sort(key=lambda x: x[0], reverse=True)
    return [result for _, result in scored[:max_results]]

Memory patterns: short-term vs. long-term

Agent skills need different memory strategies depending on how long the information needs to stick around.

Short-term memory: conversation context

Short-term memory lives in the current conversation. It’s fast and directly accessible, but ephemeral and limited by the context window. Most workflow state lives here.

Best practices for short-term memory:

Summarize completed steps rather than keeping full results
Drop intermediate results once downstream steps have consumed them
Use structured summaries the agent can quickly scan

// Instead of keeping all raw results:
const rawResults = {
  step1: {
    /* 2000 tokens of data */
  },
  step2: {
    /* 3000 tokens of data */
  },
  step3: {
    /* 1500 tokens of data */
  },
};

// Maintain a running summary:
const workingSummary = {
  completedSteps: ["fetch_data", "validate_schema", "transform"],
  keyFindings: [
    "Schema has 3 breaking changes in users table",
    "47 records failed date format validation",
    "Transform produced 10,234 clean records",
  ],
  nextStep: "load_to_destination",
  blockers: [],
};

Long-term memory: persistent storage

For information that persists across conversations (user preferences, project context, learned patterns), use external storage accessed through skills. This avoids loading everything into context upfront.

class ProjectMemory:
    """Persistent memory for project-specific context."""

    def __init__(self, storage_path: str):
        self.storage_path = storage_path

    async def remember(self, key: str, value: str, category: str = "general") -> None:
        """Store a fact for later retrieval."""
        memories = await self._load()
        memories[key] = {
            "value": value,
            "category": category,
            "stored_at": datetime.now().isoformat(),
        }
        await self._save(memories)

    async def recall(self, category: str | None = None, query: str | None = None) -> list[dict]:
        """Retrieve relevant memories, optionally filtered."""
        memories = await self._load()
        results = []
        for key, entry in memories.items():
            if category and entry["category"] != category:
                continue
            if query and query.lower() not in entry["value"].lower():
                continue
            results.append({"key": key, **entry})
        return results

Choosing the right memory strategy

What you need	Strategy	Example
Current task state	Short-term (context)	Workflow progress, intermediate results
File contents being edited	Short-term with eviction	Keep only the files currently being modified
Project structure	Long-term, loaded on demand	Directory layout, tech stack, conventions
User preferences	Long-term, loaded at start	Coding style, preferred tools, common paths
Previous conversation outcomes	Long-term, searched when relevant	Past decisions, resolved issues

Context window recovery

When context is running low mid-task, skills need strategies to keep going.

Pattern: context compression checkpoint

When a workflow detects it’s approaching context limits, it should compress its state before continuing.

function compressWorkflowState(ctx: WorkflowContext): WorkflowContext {
  // Replace detailed step results with summaries
  for (const [step, result] of ctx.stepResults) {
    if (typeof result === "object" && result !== null) {
      ctx.stepResults.set(step, {
        summary: result.summary || `Step ${step} completed successfully`,
        keyOutputs: extractKeyOutputs(result),
        // Drop raw data, keep only what downstream steps need
      });
    }
  }

  return ctx;
}

This is where context management connects directly to error handling. Running out of context mid-workflow is a failure mode your skills should anticipate and handle gracefully, not silently degrade through.

Key takeaways

Tool results are the biggest context consumer. Design skills that return right-sized responses with truncation, summarization, and pagination built in.
Use progressive detail loading. Start with overviews, drill into specifics only where needed. The three levels (overview, summary, detail) cover most use cases.
Summarize completed work aggressively. Once a workflow step is done and its output has been consumed, replace the raw data with a compact summary.
Separate short-term and long-term memory. Not everything belongs in the context window. Persistent facts should live in external storage and be loaded selectively.
Design for the worst case. Assume context will run low and build compression and recovery strategies into your workflows from the start.