Context engineering is the discipline of deliberately designing and managing the information you put into an AI model’s context window. Andrej Karpathy — OpenAI co-founder — coined the term in mid-2025, defining it as “the art of filling the context window with exactly the right information at the right time.”
This matters now because AI usage has fundamentally changed. We’re not issuing single queries anymore — agents now complete an average of 20 autonomous actions before requiring human input (Anthropic’s 2026 Agentic Coding Trends Report), and 95% of professional developers use AI tools weekly. Context windows have grown to 200K tokens (Claude Opus 4.8), 128K (GPT-5.5), and 1M (Gemini 3.5 Flash). The question isn’t whether your model can handle complex tasks — it’s whether your context is structured well enough to make those tasks succeed.
This guide covers everything you need to get substantially better results from Claude Code, Cursor, Cline, and similar tools. Here’s what we cover:
- The difference between prompt engineering and context engineering
- Why context quality matters more than prompt quality (with the numbers to prove it)
- The six architectural layers of context engineering
- CLAUDE.md configuration with a production-ready example
- RAG design for codebases — what actually works
- Context compression strategies for long agent sessions
- Agentic workflow patterns and common mistakes
Context Engineering vs Prompt Engineering — What Is the Difference?
These are related but different disciplines. Here’s what each one focuses on:
Prompt engineering is about phrasing — crafting the specific text you send to a model to get the response you want. It asks: “How do I ask this question so the model answers correctly?” It’s a real skill, but it has a hard ceiling.
- Focuses on a single message or conversation template
- Optimizes word choice, role framing, and instruction clarity
- Works well for single-turn Q&A and simple completions
- A perfect prompt inside a bad context still produces bad results
Context engineering encompasses prompt engineering and goes much further. It asks: “What does this model need to know — and only know — to complete this task reliably?”
- Covers the full informational environment: system prompt, retrieved documents, conversation history, tool definitions, memory summaries
- Decides what information the model sees, in what order, and how it’s structured
- Manages retrieval, compression, and refresh across multi-step workflows
- Critical for agentic workflows — in a 20-step session, bad context compounds into failure
In a single-turn chat, the distinction barely matters. In an agentic coding session with 20+ sequential decisions, it’s the difference between a workflow that completes successfully and one that spirals until you restart from scratch.
| Dimension | Prompt Engineering | Context Engineering |
|---|---|---|
| Primary concern | Phrasing of a single instruction | Full information environment across a session |
| Scope | One message or template | System prompt + retrieved data + memory + tool definitions + ordering |
| When it matters most | Single-turn Q&A, simple completions | Multi-step agents, long coding sessions, RAG pipelines |
| Primary skill | Instruction clarity, role framing | Information architecture, retrieval design, compression strategy |
| Example | “You are a senior TypeScript engineer. Refactor this function for readability.” | Injecting only the relevant file, related type definitions, project conventions from CLAUDE.md, and the specific task spec — in the right order |
| Relationship | Foundational — still required | Builds on top of prompt engineering, does not replace it |
Why Context Quality Matters More Than You Think
A peer-reviewed study across 9,649 experiments found that context quality is a stronger predictor of output quality than prompt quality. This directly contradicts the assumption that better prompting is the main lever. Developers spending hours A/B testing prompt phrasings while ignoring context architecture are optimizing the wrong variable.
A related problem compounds this: the “lost in the middle” effect. A 2024 Stanford NLP study showed that models perform best on information at the beginning or end of the context window — accuracy drops measurably when key facts are buried in the middle. Key implications:
- In a 32K-token context, accuracy on centered information dropped significantly vs. information at the edges
- At 200K tokens (Claude Opus 4.8) or 1M tokens (Gemini 2.5 Pro), naively appending files to the end of a system prompt will systematically underperform
- Position in the context window is not neutral — it directly affects model attention
The token cost angle is equally concrete. Claude Opus 4.7 costs $5 per million input tokens. The math adds up fast:
- One API call at 200K tokens = $1.00 in input tokens before any output
- A 20-step agentic session at 200K tokens per step = $20.00 in input tokens
- If 40% of that context is irrelevant noise, $8 of that $20 is wasted
- At team scale, with 95% of developers using AI weekly, this becomes a real budget line
Key insight: The agent amplification problem makes context quality exponentially more important in agentic workflows. A 10% degradation in context quality in a single-turn chat might produce a response that is 10% worse. The same degradation in step one of a 20-step autonomous agent session compounds across every subsequent decision. By step 20, the agent may be operating on a completely incorrect model of the codebase, the task, or its own previous actions — and no amount of prompt cleverness will recover it.
The 6 Layers of Context Engineering — A Complete Architecture
Context engineering isn’t one technique — it’s a layered architecture. Understanding each layer and how they interact is what separates developers who get consistently strong AI results from those who don’t. The six layers are:
- System prompt architecture
- Retrieval-augmented generation (RAG)
- Tool and function selection
- Memory systems
- Context compression and summarization
- Information ordering
Layer 1: System Prompt Architecture
The system prompt is the foundation — it establishes the model’s role, constraints, output format, and behavioral guardrails before any user message or retrieved content enters the picture. A well-engineered system prompt is not “You are a helpful coding assistant.” It’s a structured document with distinct sections:
- Role definition — establishes expertise and perspective
- Constraints — explicit statements of what the model should not do
- Output format specification — removes ambiguity about response structure
- Project-specific context — anchors the model to the codebase, stack, and conventions
System prompts should be versioned and treated as production artifacts. A prompt that works today may behave differently after a model update. Track versions, maintain changelogs, and run regression tests on prompt changes. The system prompt is code — treat it with the same rigor as any other production configuration.
Layer 2: Retrieval-Augmented Generation (RAG)
RAG is the mechanism for injecting task-relevant external information at query time instead of including everything statically. Done well, the model has exactly what it needs and nothing more. Done poorly, it’s wading through thousands of tokens of tangentially related code to find the three functions that actually matter.
The key design decisions for a RAG system:
- Chunking strategy — how to split documents (structural vs. fixed-size)
- Embedding model selection
- Retrieval mechanism — vector search, BM25, or hybrid
- Metadata filtering — narrows the search space before embedding similarity runs
- Re-ranking — improves precision of the final result set
We cover RAG in detail in a dedicated section below.
Layer 3: Tool and Function Selection
Every tool definition you provide to an agent consumes tokens and adds cognitive surface area the model has to reason about. Giving an agent 40 tool definitions when it only needs 5 actively degrades reasoning quality — research on tool-augmented language models consistently shows this. The principle is the same as good API design: expose only what’s needed for the task at hand.
Implement dynamic tool provision — different tool subsets for different task phases:
- Planning phase: read-only tools (file search, code lookup, documentation retrieval)
- Execution phase: write tools (file creation, code modification, test runner)
- Validation phase: verification tools (linters, type checkers, test suites)
Providing all tools at all phases dilutes the signal and increases the probability of misuse.
Layer 4: Memory Systems
Memory in AI systems operates at three levels — each with different tradeoffs:
- Short-term memory: In-context conversation history. Fast and zero-latency, but ephemeral and bounded by the context window size.
- Long-term memory: External storage (vector databases, key-value stores, structured databases). Persistent and unbounded, but requires explicit retrieval.
- Episodic memory: Structured summaries of past sessions, decisions, and outcomes. Helps the agent maintain continuity across multiple context windows on a long-running project.
Episodic memory is the most underutilized and highest-value memory type for coding agents working on large projects over days or weeks. CLAUDE.md is a manual, file-based implementation of episodic memory for Claude Code users — injecting structured summaries of architectural decisions and established patterns at the start of each session.
Layer 5: Context Compression and Summarization
Long agentic sessions will fill their context windows. Without a compression strategy, the agent either fails at the limit or operates with truncated context that drops critical early information. Two primary techniques handle this:
- Progressive summarization: Periodically replace older, detailed exchanges with compressed summaries as the session grows. Best when the beginning of the session contains the most critical context (task spec, architectural decisions).
- Rolling windows: Maintain a fixed-size window of the most recent exchanges in full fidelity, while compressing or discarding older content. Best for tasks where recency dominates — iterative debugging, back-and-forth refinement.
Regardless of which technique you use, pin the elements that must always survive compression: the task specification, architectural constraints, and current objective.
Layer 6: Information Ordering
Given the “lost in the middle” finding, information ordering is non-negotiable. The rule: most critical information goes at the beginning and end of the context. Supporting information goes in the middle.
A practical ordering pattern for coding tasks:
- System prompt with role and constraints — first
- Specific task instruction — immediately after the system prompt
- Retrieved supporting code and documentation — in the middle
- Restatement of the core task requirement — at the very end of the user message
This “bookending” pattern anchors the model’s attention on what matters most, even as the volume of supporting context grows.
CLAUDE.md — The Practical Starting Point for Coding Contexts
CLAUDE.md is a plain-text Markdown file that Claude Code reads at the start of every session to establish project-specific context. It’s the most accessible context engineering tool available to Claude Code users, and it’s frequently underused. A well-maintained CLAUDE.md functions as persistent episodic memory — a structured briefing that answers: “What does this agent need to know about this codebase to work effectively from message one?”
The file lives at the root of your project repo and is automatically injected at session start. You can also place it in subdirectories for module-specific context — Claude Code merges them hierarchically. Its contents are processed as part of the system prompt layer, so they receive strong positional attention.
Here is a production-quality CLAUDE.md example for a Node.js/TypeScript project:
# CLAUDE.md — Project Context for Claude Code
## Project Overview
This is the API backend for Bindspace, a B2B SaaS project management platform.
Stack: Node.js 22, TypeScript 5.4, Express 5, PostgreSQL 16, Redis 7.
Primary consumers: React frontend (internal) and mobile apps via REST API.
Current phase: Active feature development. Stability is critical on /api/billing and /api/auth routes.
## Architecture
- Monolith with module boundaries enforced via barrel exports (src/modules/*)
- No microservices. Do not propose splitting into services.
- Database access only through the repository layer (src/repositories/*). Never write raw SQL in controllers or services.
- All business logic in service files (src/services/*). Controllers are thin routing layers only.
- Shared types in src/types/. Never redefine types that exist there.
## Technology Conventions
- TypeScript: strict mode always. No 'any' types. Use 'unknown' and narrow.
- Async: always use async/await. Never use callbacks or raw Promise chains.
- Validation: use zod for all input validation. Never use joi, express-validator, or manual checks.
- HTTP errors: use the HttpError class from src/utils/errors.ts. Never throw plain Error objects from controllers.
- Logging: use the logger from src/utils/logger.ts (pino-based). Never use console.log in production code.
- ORM: use Prisma client from src/db/client.ts. Never import PrismaClient directly.
- Testing: Vitest. Test files co-located with source files (*.test.ts). Coverage threshold: 80%.
## Prohibited Patterns
- DO NOT use var. Use const by default, let only when reassignment is required.
- DO NOT use default exports. All exports must be named exports.
- DO NOT write database migrations by hand. Use 'npx prisma migrate dev --name [description]'.
- DO NOT modify files in src/generated/* — these are auto-generated by Prisma.
- DO NOT add dependencies without noting them in your response for human review.
- DO NOT catch errors silently. All catch blocks must either rethrow or log + return an appropriate error response.
## Running the Project
- Dev server: npm run dev (nodemon + ts-node, port 3001)
- Tests: npm test (Vitest, watch mode) or npm run test:ci (single run)
- Type check: npm run typecheck
- Lint: npm run lint (ESLint + Prettier)
- DB migrations: npx prisma migrate dev
## Known Fragile Areas
- src/services/billing.ts: Complex Stripe webhook handling. Touch with extreme care. Always run the billing test suite after any changes here.
- src/middleware/rateLimiter.ts: Redis-dependent. Behavior in tests differs from production due to mock Redis. Do not refactor without discussing first.
- The auth token refresh flow in src/services/auth.ts lines 145-190 has a known race condition being tracked in issue #447. Do not modify this section.
## Environment
- .env.example contains all required environment variables with descriptions.
- Staging environment is automatically deployed from the 'develop' branch.
- Production deploys are manual via GitHub Actions workflow 'deploy-prod.yml'.
The most common CLAUDE.md mistakes to avoid:
- Writing it once and never updating it. Stale context can be worse than no context — it creates confident errors based on outdated conventions.
- Making it too generic. Boilerplate like “write clean code” adds tokens without adding information. Be specific or don’t include it.
- Skipping negative instructions. The “DO NOT” section is the highest-value part of CLAUDE.md. Explicitly naming prohibited patterns is far more reliable than just omitting examples of them — a CLAUDE.md that says “do not use callbacks” will dramatically reduce callback-based suggestions, even when the model’s training data skews toward them.
RAG for Coding Agents — How to Do It Right
RAG for codebases is one of the highest-leverage context engineering techniques available — and one of the most frequently implemented badly. Naive RAG (fixed-size text chunks, embed them, retrieve top-k by cosine similarity) produces mediocre results for code. Code has structural properties plain text doesn’t: functions have entry points and dependencies, classes have hierarchies, types reference each other across files. Cutting a function in half at a token boundary produces an embedding that represents nothing useful.
Chunking strategy: structural, not positional. Chunk at function, class, and module boundaries — not at fixed token counts.
- A 400-token function is one chunk
- A large class is chunked at method boundaries
- For very large codebases, use a two-level hierarchy: file-level summaries for broad navigation, function-level chunks for precise retrieval
Metadata filtering: Every code chunk should be stored with metadata, and you should filter on it before running vector similarity search.
- File path and programming language
- Last modified timestamp
- Module or package membership
- Relevant tags (e.g., “authentication”, “database”, “public-api”)
- If the task involves a TypeScript file in src/services/, filter to that path before running embedding search — precision improves dramatically
Hybrid search: Combine BM25 (keyword-based ranking) with vector similarity search. Pure vector search handles semantic queries well (“find code that does authentication”) but misses exact symbol matches. BM25 handles exact symbol matching precisely but fails at semantic queries. Combining them with a re-ranking step consistently outperforms either approach alone. Use hybrid search as the default for production codebase RAG.
Here is a conceptual example of a well-engineered codebase RAG query function:
// Conceptual codebase RAG query — TypeScript pseudocode for illustration
async function retrieveCodeContext(
query: string,
options: {
currentFilePath: string;
language: string;
maxTokens: number;
taskType: 'read' | 'modify' | 'debug';
}
): Promise<CodeContext> {
// Step 1: Metadata pre-filter — narrow the search space before embedding
const metadataFilter = {
language: options.language,
// For modifications, prioritize same module; for debugging, allow broader search
modulePath: options.taskType === 'modify'
? extractModulePath(options.currentFilePath)
: undefined,
// Deprioritize files not modified in 6+ months for active feature work
lastModifiedAfter: options.taskType !== 'debug'
? subMonths(new Date(), 6)
: undefined,
};
// Step 2: Run hybrid search (BM25 + vector, combined with RRF re-ranking)
const [vectorResults, bm25Results] = await Promise.all([
vectorStore.search(await embedQuery(query), { filter: metadataFilter, topK: 20 }),
bm25Index.search(query, { filter: metadataFilter, topK: 20 }),
]);
// Step 3: Reciprocal Rank Fusion (RRF) to merge result lists
const mergedResults = reciprocalRankFusion([vectorResults, bm25Results]);
// Step 4: Expand retrieved chunks to include their imports and type references
const expandedChunks = await expandWithDependencies(mergedResults.slice(0, 10));
// Step 5: Fit within token budget, prioritizing highest-ranked chunks
const fittedContext = fitToTokenBudget(expandedChunks, options.maxTokens);
// Step 6: Order for context injection — most relevant first (beginning of retrieved section)
// and second-most-relevant last (end of retrieved section) per "lost in the middle" findings
return orderForContextInjection(fittedContext);
}
The dependency expansion step (Step 4) deserves emphasis. When you retrieve a function that calls other functions defined elsewhere, the raw chunk may be nearly useless without its dependencies. A good RAG system for code automatically fetches the signatures — and ideally the full definitions — of functions called within a retrieved chunk, plus the type definitions for its parameters and return values. This “dependency hydration” significantly improves the model’s ability to reason accurately about retrieved code.
Context Compression — What to Do When You Hit the Limit
Context windows are finite, and in long agentic coding sessions they will fill up. The naive response — starting a new session and losing all accumulated context — is usually the most expensive option, both in tokens to re-establish context and in productivity cost. Context compression preserves the semantic content most critical to completing the remaining work while reducing size.
Two primary techniques:
- Progressive summarization: After every N exchange turns, replace the oldest turns with a structured summary (“Agent investigated the auth module, identified a bug in the token refresh logic on line 147, proposed a mutex lock fix, fix was approved and applied”). The history now contains the summary plus recent turns — far fewer tokens, decision trail preserved. Best when early context (task spec, architectural decisions) is most important.
- Rolling windows: Maintain a fixed-size window of the most recent N exchanges in full fidelity. Older content is compressed externally or discarded. Best for iterative debugging where recency dominates. Mitigate the risk of losing early context by pinning critical elements (task spec, constraints, current objective) that always appear at the beginning regardless of compression.
The cost math is straightforward. A 20-step agentic session without compression grows linearly toward the 200K token ceiling. With progressive summarization that halves context size every 5 steps, total input token consumption drops 40–60% across the session. At Claude Opus 4.7 pricing of $5/M tokens, a session that costs $20 in input tokens without compression might cost $8–12 with compression. For teams running hundreds of sessions per day, this becomes a real budget decision.
Practical rule of thumb: When your context is more than 60% full and you have more than a few steps remaining, start compressing. Waiting until you hit 90–95% forces rushed compression under pressure and risks losing critical state. Design compression triggers at the 60% threshold and pin your task specification, current objective, and top-level constraints so they always survive compression.
Context Engineering for Agentic Workflows — A Different Set of Rules
Agentic workflows require context strategies that are fundamentally different from single-turn interactions. In a single-turn chat, over-including information is relatively harmless — the model reads it once and you get the result. In a 20-step autonomous workflow, over-inclusion means carrying cognitive overhead through every step, and any misinterpretation compounds over time. The core principles for agentic context:
- Scoped context: Give the agent only the modules relevant to the current task — not the full codebase. Provide only the tools appropriate to the current task phase. Compress conversation history at regular intervals and pin the task specification.
- Checkpoint injection: After every 5 actions, inject a structured checkpoint that restates the original task objective, actions taken so far (from the compressed summary), current state of the codebase (which files have been modified), and remaining objectives. This prevents the common failure mode where an agent solves a subproblem well but in a direction that doesn’t serve the original goal.
- Tool phase gating: Restrict available tools to what’s appropriate for the current phase — planning, execution, or validation. Never provide all tools at all phases.
- Pinned invariant context: The task specification, architectural constraints, and core conventions must always be present and unchanged across every step of the workflow.
- Explicit out-of-scope declarations: Tell the agent what it should not do in the task spec. This reduces scope creep and prevents well-intentioned detours that accumulate as technical debt.
Here is what a well-engineered agent context looks like at initialization for a coding task:
// Agent context initialization structure for a coding task — annotated pseudocode
const agentContext = {
// ---- LAYER 1: System prompt (ALWAYS first — benefits from positional attention) ----
systemPrompt: `
You are a senior TypeScript engineer working on the Bindspace API.
Project conventions are defined below. Follow them strictly.
Work incrementally. Complete one logical change at a time.
Before modifying a file, read its current contents using the read_file tool.
After modifying a file, verify the change compiles using the typecheck tool.
If you encounter unexpected state (a file that does not match your mental model),
STOP and report the discrepancy rather than proceeding with assumptions.
[CLAUDE.md contents injected here — approximately 300-500 tokens]
`,
// ---- LAYER 2: Scoped task specification ----
// Specific, bounded, with explicit success criteria and out-of-scope declaration
taskSpec: `
TASK: Add rate limiting to the POST /api/auth/login endpoint.
SUCCESS CRITERIA:
- Rate limit: 5 attempts per IP per 15-minute window
- Use the existing Redis-based rate limiter in src/middleware/rateLimiter.ts
- Return 429 with a Retry-After header when the limit is exceeded
- Add a test in src/routes/auth.test.ts covering the rate-limited case
- Do NOT modify any other auth routes
OUT OF SCOPE: Do not refactor the existing rate limiter. Use it as-is.
`,
// ---- LAYER 3: Scoped tool set ----
// Only tools needed for this specific task — NOT all available tools
tools: [
'read_file', // Read source files before modifying
'write_file', // Write modified source files
'list_directory', // Navigate the project structure if needed
'run_typecheck', // Verify TypeScript compilation after each change
'run_tests', // Run auth test suite: npm test src/routes/auth.test.ts
// EXCLUDED: deploy, git_commit, database_migrate, install_package
// These exclusions prevent scope creep and reduce reasoning surface area
],
// ---- LAYER 4: Pre-retrieved relevant context (injected in MIDDLE of context) ----
// Structurally chunked at file/function level, dependency-hydrated
retrievedContext: [
'src/middleware/rateLimiter.ts', // The existing rate limiter — full file
'src/routes/auth.ts', // Route file to be modified — full file
'src/routes/auth.test.ts', // Test file to be extended — full file
'src/types/express.d.ts', // Request type augmentations — for context
],
// ---- LAYER 5: Checkpoint configuration ----
checkpointEveryNActions: 5,
pinnedContext: ['taskSpec'], // Always survives compression
};
Common Context Engineering Mistakes — and How to Fix Them
Context Pollution
This is the most common mistake: injecting every potentially relevant file, document, or conversation turn without curation. The intuition (“more information = better results”) is wrong in practice. More noise means less effective attention on the signal.
- Fix: Adopt a retrieval mindset. Before adding any content to the context, ask “does the model specifically need this for the current step?” If the answer is “it might help,” put it in a retrieval index — not the primary context. The context window is prime real estate.
Over-Relying on Recency
Developers often append the most important instructions at the very end of a long context, assuming the model reads sequentially and remembers recent content most clearly. The “lost in the middle” research shows this is only partially true. The end receives strong attention — but so does the beginning. Information buried 150K tokens into a 200K-token context receives degraded attention regardless of its importance.
- Fix: Explicit bookending — put what matters most at both the start and the end of the context, with supporting context in the middle.
Missing Negative Instructions
Models learn patterns from training data, including widely-used but locally-prohibited approaches. If your codebase bans a popular library, uses a non-standard error handling pattern, or has architectural decisions that violate common conventions, the absence of those patterns in your codebase is not enough to prevent the model from introducing them.
- Fix: Add explicit “DO NOT use X” instructions to your CLAUDE.md and system prompt. An explicit prohibition is far more reliable than relying on code-level omission.
Not Versioning System Prompts
When a model update changes how a system prompt is interpreted, or when a team member edits the system prompt and inadvertently breaks existing behavior, you need the ability to diff, rollback, and test. Without version control, these regressions are invisible.
- Fix: Store system prompts in version control. Tag them with the model version they were tested against. Run a baseline eval suite when you change them. Treat them exactly as you would any other production configuration file.
Inconsistent Context Across Agent Steps
If one step provides detailed type definitions and the next doesn’t, the agent’s internal model of the codebase becomes inconsistent. If tool definitions change between steps, the agent may try to use tools that are no longer available. These inconsistencies compound quickly in long workflows.
- Fix: Ensure the persistent, invariant context (task specification, conventions, constraints) is maintained identically across every step. Add a context consistency check before each agent step to verify invariant elements are present and unchanged.
| Mistake | Symptom | Fix |
|---|---|---|
| Context pollution | Model produces correct but unfocused or overly cautious responses | Retrieval-first mindset — only inject what is definitively needed |
| Over-relying on recency | Critical early instructions ignored in long sessions | Bookend critical info at beginning AND end of context |
| Missing negative instructions | Model introduces banned libraries or deprecated patterns | Add explicit DO NOT clauses in CLAUDE.md and system prompt |
| Unversioned system prompts | Unexplained behavior regressions after model updates | Version control all prompts; tag with model version; run evals on changes |
| Inconsistent agent context | Agent contradicts its earlier decisions; tool call failures | Pin invariant context; consistency check before each agent step |
| Stale CLAUDE.md | Confident errors based on outdated conventions | Quarterly CLAUDE.md reviews; remove anything no longer accurate |
Measuring Context Engineering Quality — Metrics That Actually Matter
You can’t improve what you don’t measure. The three metrics that matter most for context engineering quality:
- Task completion rate: What percentage of agent tasks complete successfully without human intervention?
- Correction rounds per task: How many times does a human need to redirect or correct the agent before the task is done?
- Token cost per successfully completed task: Total input plus output tokens divided by tasks that met acceptance criteria.
A/B testing system prompts is the most direct way to measure the impact of context engineering changes. Define a benchmark task set that represents your real workload, establish a baseline with your current configuration, introduce one change at a time, and measure on the same benchmark. Run 20–50 trials per variant for reliable signal. Use a fixed seed where the model supports it to reduce sampling variance.
Evaluation frameworks reduce the overhead of systematic measurement. The main options:
- LangSmith: Tracing and evaluation infrastructure that captures full context payloads alongside quality scores, making it straightforward to correlate context composition with output quality
- PromptFoo: Dedicated prompt and context evaluation framework with assertion-based test cases — define what a correct output looks like and it runs your prompts against a test suite automatically
- Anthropic’s Evals framework: Purpose-built for Claude-based systems with native Claude Code integration
Key insight: The 2026 Anthropic data shows that developers report only being able to fully delegate 0–20% of tasks to AI agents — despite agents completing 20 autonomous actions on average. This gap suggests that the bottleneck is not agent capability but context quality. Developers who invest in measuring and improving their context engineering systematically close this delegation gap faster than those who iterate intuitively.
Frequently Asked Questions About Context Engineering in 2026
What exactly is context engineering, and who coined the term?
Context engineering is the discipline of deliberately designing and managing the information placed into an AI model’s context window to maximize output quality across complex, multi-step tasks. Andrej Karpathy — OpenAI co-founder — coined the term in mid-2025, defining it as “the art of filling the context window with exactly the right information at the right time.” The term reflects a growing recognition that prompt phrasing is a second-order variable compared to the quality and composition of the full context the model operates within.
Is context engineering just a more complicated version of prompt engineering?
No — it’s a broader discipline that includes prompt engineering as one component. Context engineering covers the full informational architecture:
- System prompt design
- Retrieved document selection and ordering
- Conversation history management
- Tool definitions provided to agents
- Memory systems (short-term in-context and long-term external)
- Compression approach for long sessions
In a single-turn chat, the difference is minor. In a 20-step agentic coding workflow, context engineering decisions determine whether the agent succeeds or fails — and prompt engineering is just one of many levers.
What is the “lost in the middle” problem and how does it affect coding AI tools?
The “lost in the middle” problem, documented in a 2024 Stanford NLP study, is the empirical finding that language models perform significantly better on information at the beginning and end of the context window than on information buried in the middle. This is a structural property of how transformer attention works at long context lengths. For coding AI tools, the practical fix is to bookend critical information: put your most important instructions and constraints at the beginning (in the system prompt) and restate the core task at the end of your user message, with supporting context — files, documentation — in the middle.
How do I get started with CLAUDE.md for my project?
Create a file named CLAUDE.md at the root of your project repository. Start with these six sections:
- Project overview (2–3 sentences on what the project does and its primary constraints)
- Technology stack with exact version numbers
- Coding conventions that aren’t obvious from reading the code (naming patterns, async patterns, file organization)
- Explicitly prohibited patterns starting with “DO NOT” — this is the highest-value section
- Common commands for running, testing, and deploying
- Known fragile areas that require extra care
Keep it under 500 lines. Update it as the project evolves. Review it quarterly and remove anything that’s no longer accurate — stale context produces confident errors.
What is the token cost impact of poor context engineering at production scale?
The numbers add up fast. Claude Opus 4.7 costs $5 per million input tokens. A single API call with a full 200K-token context costs $1.00 in input tokens. A 20-step agentic session at 200K tokens per step costs $20.00 in input tokens alone. If 40% of that context is irrelevant content (files not needed, stale docs, duplicated type definitions), $8 of that $20 is being spent on noise. At team scale — with 95% of professional developers using AI tools weekly — the per-developer cost of poor context engineering multiplies rapidly. Treat context optimization as a cost engineering initiative, not just a quality initiative.
How does context engineering differ for agents versus single-turn AI chat?
Single-turn chat is forgiving of context engineering errors — the model makes one decision and a human evaluates the result immediately. Agents make sequences of decisions (20 on average in 2026 coding workflows) where each step depends on the previous ones. Errors compound. For agents, context engineering must address additional concerns that don’t apply to single-turn chat:
- Tool scoping — providing only the tools appropriate to the current task phase
- Checkpoint injection — re-anchoring the agent to the original objective at regular intervals
- Context compression — managing the accumulation of conversation history over many steps
- Consistency enforcement — ensuring invariant context elements are present and unchanged across every step
What is the relationship between RAG and context engineering?
RAG is one of the six layers of context engineering — the mechanism for dynamically selecting which external information gets injected into the context window at query time. Context engineering is the broader discipline that includes RAG design alongside system prompt architecture, tool selection, memory systems, compression strategies, and information ordering. Many developers implement RAG without thinking about the other layers, which limits its effectiveness. Retrieving highly relevant code chunks (good RAG) but injecting them in the middle of a long context (poor ordering) means the “lost in the middle” effect will degrade the model’s ability to use them. RAG design and context ordering must be considered together.
Should I invest time in context engineering if I only use AI tools occasionally?
The value scales with usage frequency. For truly occasional use (a few queries per week), a sophisticated context architecture probably isn’t worth the overhead. But 95% of professional developers now use AI tools weekly — the “occasional user” category is shrinking fast. For daily users, even a basic CLAUDE.md for each active project produces measurable improvements in output quality and fewer correction rounds per task. The 30–60 minutes to write a good CLAUDE.md pays back within a single focused coding session on any non-trivial project. The deeper techniques — RAG design, agent checkpointing, compression — are most valuable for developers building AI-native workflows or managing AI coding agents at team scale.
Wrapping Up — Context Is the New Code Quality
The evidence is clear: a 9,649-experiment peer-reviewed study confirms context quality outperforms prompt quality as a predictor of output quality, Stanford NLP has documented the structural attention biases that make information ordering a real engineering concern, and Anthropic’s 2026 data shows agents completing 20 autonomous actions before human input — making the compounding cost of context errors a first-order operational issue. The developers who get disproportionate value from AI coding tools in the next 12 months won’t find better prompts. They’ll build better context architectures.
Where to start:
- Write a CLAUDE.md for every active project — include the “DO NOT” section, keep it under 500 lines, review it quarterly
- Audit your RAG pipeline — move to structural chunking at function and class boundaries and implement hybrid search if you haven’t already
- Add checkpoint injection to long agent sessions — re-anchor to the original objective every 5 actions
- Version your system prompts — store them in source control, tag them with model version, run evals on changes
- Start measuring — track task completion rate, correction rounds per task, and token cost per successfully completed task so you have a baseline to improve against