Context Management
Strategies for working within context window limits: summarization, selective loading, and memory patterns for agent skills.
Every AI agent operates within a context window, the total amount of text it can consider at once. This includes the system prompt, conversation history, tool descriptions, tool results, and the agent’s own reasoning. When that window fills up, the agent starts losing information and making worse decisions.
Context management is about making the most of this limited resource. It’s what determines whether your agent can handle a 50-file codebase refactor or runs out of room after reading three files. These strategies apply whether you’re designing individual skills or orchestrating multi-step workflows.
Understanding the context budget
Before you can optimize, you need to know where your context is going. A typical agent session breaks down roughly like this:
| Component | Typical size | Notes |
|---|---|---|
| System prompt | 500-2,000 tokens | Instructions, personality, constraints |
| Tool definitions | 1,000-5,000 tokens | Scales with number of available tools |
| Conversation history | 2,000-50,000 tokens | Grows with each turn |
| Tool results | 500-20,000+ tokens per call | Largest variable; a single file read can be huge |
| Agent reasoning | 1,000-5,000 tokens per turn | Chain-of-thought, planning |
The biggest offender is almost always tool results. A single file read can consume thousands of tokens. A search across a codebase might return hundreds of matches. Without careful management, a few tool calls can eat your entire budget.
Summarization and compression strategies
The core idea is simple: don’t keep raw data in context when a summary will do.
Summarize tool results immediately
When a skill returns large results, summarize them before they enter the agent’s working memory. This can happen at the skill level (the skill itself returns a summary) or at the orchestration level (a post-processing step compresses the output).
// Bad: returning raw file contents into context
async function readFile(path: string): Promise<ToolResult> {
const content = await fs.readFile(path, "utf-8");
return { content }; // Could be 10,000+ tokens
}
// Better: return with metadata that helps the agent decide what to keep
async function readFile(
path: string,
options?: ReadOptions,
): Promise<ToolResult> {
const content = await fs.readFile(path, "utf-8");
const lines = content.split("\n");
if (options?.summaryOnly) {
return {
path,
lineCount: lines.length,
language: detectLanguage(path),
exports: extractExports(content),
imports: extractImports(content),
summary: `${lines.length} lines of ${detectLanguage(path)}. Key exports: ${extractExports(content).join(", ")}`,
};
}
// If full content requested but file is large, truncate with guidance
if (lines.length > 200) {
return {
path,
content: lines.slice(0, 200).join("\n"),
truncated: true,
totalLines: lines.length,
message:
"File truncated at 200 lines. Use offset parameter to read specific sections.",
};
}
return { path, content, truncated: false };
}
Progressive detail loading
Start with high-level summaries and drill down only where needed. This is the single most effective strategy for context management.
async def explore_codebase(path: str) -> dict:
"""Level 1: Directory structure overview."""
tree = await invoke("list_directory", path=path, recursive=True, depth=2)
return {
"structure": tree,
"file_count": count_files(tree),
"languages": detect_languages(tree),
"hint": "Use read_file_summary for details on specific files.",
}
async def read_file_summary(path: str) -> dict:
"""Level 2: File-level summary without full content."""
content = await read_file(path)
return {
"path": path,
"line_count": len(content.splitlines()),
"functions": extract_function_signatures(content),
"classes": extract_class_names(content),
"imports": extract_imports(content),
"hint": "Use read_file_section to read specific functions or line ranges.",
}
async def read_file_section(path: str, start: int, end: int) -> dict:
"""Level 3: Specific section of a file."""
lines = (await read_file(path)).splitlines()
return {
"path": path,
"range": f"lines {start}-{end} of {len(lines)}",
"content": "\n".join(lines[start:end]),
}
This three-level approach (overview, summary, detail) lets the agent navigate a large codebase while keeping context usage proportional to what it actually needs.
Selective context loading
Not everything needs to be in context at once. Skills should load information on demand rather than preloading everything upfront.
Pattern: lazy loading with caching
class ContextManager {
private cache = new Map<string, { data: unknown; accessedAt: Date }>();
private maxCacheSize: number;
constructor(maxCacheSize = 20) {
this.maxCacheSize = maxCacheSize;
}
async get(key: string, loader: () => Promise<unknown>): Promise<unknown> {
if (this.cache.has(key)) {
const entry = this.cache.get(key)!;
entry.accessedAt = new Date();
return entry.data;
}
// Evict least recently accessed if at capacity
if (this.cache.size >= this.maxCacheSize) {
this.evictLeastRecent();
}
const data = await loader();
this.cache.set(key, { data, accessedAt: new Date() });
return data;
}
private evictLeastRecent(): void {
let oldestKey = "";
let oldestTime = new Date();
for (const [key, entry] of this.cache) {
if (entry.accessedAt < oldestTime) {
oldestTime = entry.accessedAt;
oldestKey = key;
}
}
if (oldestKey) this.cache.delete(oldestKey);
}
}
Pattern: relevance-based filtering
When a search returns many results, filter by relevance before adding them to context. This is especially important for search skills that might match hundreds of files.
def filter_search_results(
results: list[SearchResult],
query_context: str,
max_results: int = 10,
) -> list[SearchResult]:
"""Filter search results to the most relevant subset."""
scored = []
for result in results:
score = 0
# Exact filename match scores highest
if query_context.lower() in result.path.lower():
score += 10
# Results in src/ are usually more relevant than node_modules/
if "/src/" in result.path:
score += 5
if "node_modules" in result.path or "vendor" in result.path:
score -= 20
# More recent files are often more relevant
if result.modified_days_ago < 7:
score += 3
scored.append((score, result))
scored.sort(key=lambda x: x[0], reverse=True)
return [result for _, result in scored[:max_results]]
Memory patterns: short-term vs. long-term
Agent skills need different memory strategies depending on how long the information needs to stick around.
Short-term memory: conversation context
Short-term memory lives in the current conversation. It’s fast and directly accessible, but ephemeral and limited by the context window. Most workflow state lives here.
Best practices for short-term memory:
- Summarize completed steps rather than keeping full results
- Drop intermediate results once downstream steps have consumed them
- Use structured summaries the agent can quickly scan
// Instead of keeping all raw results:
const rawResults = {
step1: {
/* 2000 tokens of data */
},
step2: {
/* 3000 tokens of data */
},
step3: {
/* 1500 tokens of data */
},
};
// Maintain a running summary:
const workingSummary = {
completedSteps: ["fetch_data", "validate_schema", "transform"],
keyFindings: [
"Schema has 3 breaking changes in users table",
"47 records failed date format validation",
"Transform produced 10,234 clean records",
],
nextStep: "load_to_destination",
blockers: [],
};
Long-term memory: persistent storage
For information that persists across conversations (user preferences, project context, learned patterns), use external storage accessed through skills. This avoids loading everything into context upfront.
class ProjectMemory:
"""Persistent memory for project-specific context."""
def __init__(self, storage_path: str):
self.storage_path = storage_path
async def remember(self, key: str, value: str, category: str = "general") -> None:
"""Store a fact for later retrieval."""
memories = await self._load()
memories[key] = {
"value": value,
"category": category,
"stored_at": datetime.now().isoformat(),
}
await self._save(memories)
async def recall(self, category: str | None = None, query: str | None = None) -> list[dict]:
"""Retrieve relevant memories, optionally filtered."""
memories = await self._load()
results = []
for key, entry in memories.items():
if category and entry["category"] != category:
continue
if query and query.lower() not in entry["value"].lower():
continue
results.append({"key": key, **entry})
return results
Choosing the right memory strategy
| What you need | Strategy | Example |
|---|---|---|
| Current task state | Short-term (context) | Workflow progress, intermediate results |
| File contents being edited | Short-term with eviction | Keep only the files currently being modified |
| Project structure | Long-term, loaded on demand | Directory layout, tech stack, conventions |
| User preferences | Long-term, loaded at start | Coding style, preferred tools, common paths |
| Previous conversation outcomes | Long-term, searched when relevant | Past decisions, resolved issues |
Context window recovery
When context is running low mid-task, skills need strategies to keep going.
Pattern: context compression checkpoint
When a workflow detects it’s approaching context limits, it should compress its state before continuing.
function compressWorkflowState(ctx: WorkflowContext): WorkflowContext {
// Replace detailed step results with summaries
for (const [step, result] of ctx.stepResults) {
if (typeof result === "object" && result !== null) {
ctx.stepResults.set(step, {
summary: result.summary || `Step ${step} completed successfully`,
keyOutputs: extractKeyOutputs(result),
// Drop raw data, keep only what downstream steps need
});
}
}
return ctx;
}
This is where context management connects directly to error handling. Running out of context mid-workflow is a failure mode your skills should anticipate and handle gracefully, not silently degrade through.
Key takeaways
-
Tool results are the biggest context consumer. Design skills that return right-sized responses with truncation, summarization, and pagination built in.
-
Use progressive detail loading. Start with overviews, drill into specifics only where needed. The three levels (overview, summary, detail) cover most use cases.
-
Summarize completed work aggressively. Once a workflow step is done and its output has been consumed, replace the raw data with a compact summary.
-
Separate short-term and long-term memory. Not everything belongs in the context window. Persistent facts should live in external storage and be loaded selectively.
-
Design for the worst case. Assume context will run low and build compression and recovery strategies into your workflows from the start.