Context Engineering in 2026: The Complete Developer’s Guide

June 8, 2026
4:56 am

Context engineering is the discipline of deliberately designing and managing the information you put into an AI model’s context window. Andrej Karpathy — OpenAI co-founder — coined the term in mid-2025, defining it as “the art of filling the context window with exactly the right information at the right time.”

This matters now because AI usage has fundamentally changed. We’re not issuing single queries anymore — agents now complete an average of 20 autonomous actions before requiring human input (Anthropic’s 2026 Agentic Coding Trends Report), and 95% of professional developers use AI tools weekly. Context windows have grown to 200K tokens (Claude Opus 4.8), 128K (GPT-5.5), and 1M (Gemini 3.5 Flash). The question isn’t whether your model can handle complex tasks — it’s whether your context is structured well enough to make those tasks succeed.

This guide covers everything you need to get substantially better results from Claude Code, Cursor, Cline, and similar tools. Here’s what we cover:

The difference between prompt engineering and context engineering
Why context quality matters more than prompt quality (with the numbers to prove it)
The six architectural layers of context engineering
CLAUDE.md configuration with a production-ready example
RAG design for codebases — what actually works
Context compression strategies for long agent sessions
Agentic workflow patterns and common mistakes

Context Engineering vs Prompt Engineering — What Is the Difference?

These are related but different disciplines. Here’s what each one focuses on:

Prompt engineering is about phrasing — crafting the specific text you send to a model to get the response you want. It asks: “How do I ask this question so the model answers correctly?” It’s a real skill, but it has a hard ceiling.

Focuses on a single message or conversation template
Optimizes word choice, role framing, and instruction clarity
Works well for single-turn Q&A and simple completions
A perfect prompt inside a bad context still produces bad results

Context engineering encompasses prompt engineering and goes much further. It asks: “What does this model need to know — and only know — to complete this task reliably?”

Covers the full informational environment: system prompt, retrieved documents, conversation history, tool definitions, memory summaries
Decides what information the model sees, in what order, and how it’s structured
Manages retrieval, compression, and refresh across multi-step workflows
Critical for agentic workflows — in a 20-step session, bad context compounds into failure

In a single-turn chat, the distinction barely matters. In an agentic coding session with 20+ sequential decisions, it’s the difference between a workflow that completes successfully and one that spirals until you restart from scratch.

Dimension	Prompt Engineering	Context Engineering
Primary concern	Phrasing of a single instruction	Full information environment across a session
Scope	One message or template	System prompt + retrieved data + memory + tool definitions + ordering
When it matters most	Single-turn Q&A, simple completions	Multi-step agents, long coding sessions, RAG pipelines
Primary skill	Instruction clarity, role framing	Information architecture, retrieval design, compression strategy
Example	“You are a senior TypeScript engineer. Refactor this function for readability.”	Injecting only the relevant file, related type definitions, project conventions from CLAUDE.md, and the specific task spec — in the right order
Relationship	Foundational — still required	Builds on top of prompt engineering, does not replace it

Why Context Quality Matters More Than You Think

A peer-reviewed study across 9,649 experiments found that context quality is a stronger predictor of output quality than prompt quality. This directly contradicts the assumption that better prompting is the main lever. Developers spending hours A/B testing prompt phrasings while ignoring context architecture are optimizing the wrong variable.

A related problem compounds this: the “lost in the middle” effect. A 2024 Stanford NLP study showed that models perform best on information at the beginning or end of the context window — accuracy drops measurably when key facts are buried in the middle. Key implications:

In a 32K-token context, accuracy on centered information dropped significantly vs. information at the edges
At 200K tokens (Claude Opus 4.8) or 1M tokens (Gemini 2.5 Pro), naively appending files to the end of a system prompt will systematically underperform
Position in the context window is not neutral — it directly affects model attention

The token cost angle is equally concrete. Claude Opus 4.7 costs $5 per million input tokens. The math adds up fast:

One API call at 200K tokens = $1.00 in input tokens before any output
A 20-step agentic session at 200K tokens per step = $20.00 in input tokens
If 40% of that context is irrelevant noise, $8 of that $20 is wasted
At team scale, with 95% of developers using AI weekly, this becomes a real budget line

Key insight: The agent amplification problem makes context quality exponentially more important in agentic workflows. A 10% degradation in context quality in a single-turn chat might produce a response that is 10% worse. The same degradation in step one of a 20-step autonomous agent session compounds across every subsequent decision. By step 20, the agent may be operating on a completely incorrect model of the codebase, the task, or its own previous actions — and no amount of prompt cleverness will recover it.

The 6 Layers of Context Engineering — A Complete Architecture

Context engineering isn’t one technique — it’s a layered architecture. Understanding each layer and how they interact is what separates developers who get consistently strong AI results from those who don’t. The six layers are:

System prompt architecture
Retrieval-augmented generation (RAG)
Tool and function selection
Memory systems
Context compression and summarization
Information ordering

Layer 1: System Prompt Architecture

The system prompt is the foundation — it establishes the model’s role, constraints, output format, and behavioral guardrails before any user message or retrieved content enters the picture. A well-engineered system prompt is not “You are a helpful coding assistant.” It’s a structured document with distinct sections:

Role definition — establishes expertise and perspective
Constraints — explicit statements of what the model should not do
Output format specification — removes ambiguity about response structure
Project-specific context — anchors the model to the codebase, stack, and conventions

System prompts should be versioned and treated as production artifacts. A prompt that works today may behave differently after a model update. Track versions, maintain changelogs, and run regression tests on prompt changes. The system prompt is code — treat it with the same rigor as any other production configuration.

Layer 2: Retrieval-Augmented Generation (RAG)

RAG is the mechanism for injecting task-relevant external information at query time instead of including everything statically. Done well, the model has exactly what it needs and nothing more. Done poorly, it’s wading through thousands of tokens of tangentially related code to find the three functions that actually matter.

The key design decisions for a RAG system:

Chunking strategy — how to split documents (structural vs. fixed-size)
Embedding model selection
Retrieval mechanism — vector search, BM25, or hybrid
Metadata filtering — narrows the search space before embedding similarity runs
Re-ranking — improves precision of the final result set

We cover RAG in detail in a dedicated section below.

Layer 3: Tool and Function Selection

Every tool definition you provide to an agent consumes tokens and adds cognitive surface area the model has to reason about. Giving an agent 40 tool definitions when it only needs 5 actively degrades reasoning quality — research on tool-augmented language models consistently shows this. The principle is the same as good API design: expose only what’s needed for the task at hand.

Implement dynamic tool provision — different tool subsets for different task phases:

Planning phase: read-only tools (file search, code lookup, documentation retrieval)
Execution phase: write tools (file creation, code modification, test runner)
Validation phase: verification tools (linters, type checkers, test suites)

Providing all tools at all phases dilutes the signal and increases the probability of misuse.

Layer 4: Memory Systems

Memory in AI systems operates at three levels — each with different tradeoffs:

Short-term memory: In-context conversation history. Fast and zero-latency, but ephemeral and bounded by the context window size.
Long-term memory: External storage (vector databases, key-value stores, structured databases). Persistent and unbounded, but requires explicit retrieval.
Episodic memory: Structured summaries of past sessions, decisions, and outcomes. Helps the agent maintain continuity across multiple context windows on a long-running project.

Episodic memory is the most underutilized and highest-value memory type for coding agents working on large projects over days or weeks. CLAUDE.md is a manual, file-based implementation of episodic memory for Claude Code users — injecting structured summaries of architectural decisions and established patterns at the start of each session.

Layer 5: Context Compression and Summarization

Long agentic sessions will fill their context windows. Without a compression strategy, the agent either fails at the limit or operates with truncated context that drops critical early information. Two primary techniques handle this:

Progressive summarization: Periodically replace older, detailed exchanges with compressed summaries as the session grows. Best when the beginning of the session contains the most critical context (task spec, architectural decisions).
Rolling windows: Maintain a fixed-size window of the most recent exchanges in full fidelity, while compressing or discarding older content. Best for tasks where recency dominates — iterative debugging, back-and-forth refinement.

Regardless of which technique you use, pin the elements that must always survive compression: the task specification, architectural constraints, and current objective.

Layer 6: Information Ordering

Given the “lost in the middle” finding, information ordering is non-negotiable. The rule: most critical information goes at the beginning and end of the context. Supporting information goes in the middle.

A practical ordering pattern for coding tasks:

System prompt with role and constraints — first
Specific task instruction — immediately after the system prompt
Retrieved supporting code and documentation — in the middle
Restatement of the core task requirement — at the very end of the user message

This “bookending” pattern anchors the model’s attention on what matters most, even as the volume of supporting context grows.

CLAUDE.md — The Practical Starting Point for Coding Contexts

CLAUDE.md is a plain-text Markdown file that Claude Code reads at the start of every session to establish project-specific context. It’s the most accessible context engineering tool available to Claude Code users, and it’s frequently underused. A well-maintained CLAUDE.md functions as persistent episodic memory — a structured briefing that answers: “What does this agent need to know about this codebase to work effectively from message one?”

The file lives at the root of your project repo and is automatically injected at session start. You can also place it in subdirectories for module-specific context — Claude Code merges them hierarchically. Its contents are processed as part of the system prompt layer, so they receive strong positional attention.

Here is a production-quality CLAUDE.md example for a Node.js/TypeScript project:

# CLAUDE.md — Project Context for Claude Code

## Project Overview
This is the API backend for Bindspace, a B2B SaaS project management platform.
Stack: Node.js 22, TypeScript 5.4, Express 5, PostgreSQL 16, Redis 7.
Primary consumers: React frontend (internal) and mobile apps via REST API.
Current phase: Active feature development. Stability is critical on /api/billing and /api/auth routes.

## Architecture
- Monolith with module boundaries enforced via barrel exports (src/modules/*)
- No microservices. Do not propose splitting into services.
- Database access only through the repository layer (src/repositories/*). Never write raw SQL in controllers or services.
- All business logic in service files (src/services/*). Controllers are thin routing layers only.
- Shared types in src/types/. Never redefine types that exist there.

## Technology Conventions
- TypeScript: strict mode always. No 'any' types. Use 'unknown' and narrow.
- Async: always use async/await. Never use callbacks or raw Promise chains.
- Validation: use zod for all input validation. Never use joi, express-validator, or manual checks.
- HTTP errors: use the HttpError class from src/utils/errors.ts. Never throw plain Error objects from controllers.
- Logging: use the logger from src/utils/logger.ts (pino-based). Never use console.log in production code.
- ORM: use Prisma client from src/db/client.ts. Never import PrismaClient directly.
- Testing: Vitest. Test files co-located with source files (*.test.ts). Coverage threshold: 80%.

## Prohibited Patterns
- DO NOT use var. Use const by default, let only when reassignment is required.
- DO NOT use default exports. All exports must be named exports.
- DO NOT write database migrations by hand. Use 'npx prisma migrate dev --name [description]'.
- DO NOT modify files in src/generated/* — these are auto-generated by Prisma.
- DO NOT add dependencies without noting them in your response for human review.
- DO NOT catch errors silently. All catch blocks must either rethrow or log + return an appropriate error response.

## Running the Project
- Dev server: npm run dev (nodemon + ts-node, port 3001)
- Tests: npm test (Vitest, watch mode) or npm run test:ci (single run)
- Type check: npm run typecheck
- Lint: npm run lint (ESLint + Prettier)
- DB migrations: npx prisma migrate dev

## Known Fragile Areas
- src/services/billing.ts: Complex Stripe webhook handling. Touch with extreme care. Always run the billing test suite after any changes here.
- src/middleware/rateLimiter.ts: Redis-dependent. Behavior in tests differs from production due to mock Redis. Do not refactor without discussing first.
- The auth token refresh flow in src/services/auth.ts lines 145-190 has a known race condition being tracked in issue #447. Do not modify this section.

## Environment
- .env.example contains all required environment variables with descriptions.
- Staging environment is automatically deployed from the 'develop' branch.
- Production deploys are manual via GitHub Actions workflow 'deploy-prod.yml'.

The most common CLAUDE.md mistakes to avoid:

Writing it once and never updating it. Stale context can be worse than no context — it creates confident errors based on outdated conventions.
Making it too generic. Boilerplate like “write clean code” adds tokens without adding information. Be specific or don’t include it.
Skipping negative instructions. The “DO NOT” section is the highest-value part of CLAUDE.md. Explicitly naming prohibited patterns is far more reliable than just omitting examples of them — a CLAUDE.md that says “do not use callbacks” will dramatically reduce callback-based suggestions, even when the model’s training data skews toward them.

RAG for Coding Agents — How to Do It Right

RAG for codebases is one of the highest-leverage context engineering techniques available — and one of the most frequently implemented badly. Naive RAG (fixed-size text chunks, embed them, retrieve top-k by cosine similarity) produces mediocre results for code. Code has structural properties plain text doesn’t: functions have entry points and dependencies, classes have hierarchies, types reference each other across files. Cutting a function in half at a token boundary produces an embedding that represents nothing useful.

Chunking strategy: structural, not positional. Chunk at function, class, and module boundaries — not at fixed token counts.

A 400-token function is one chunk
A large class is chunked at method boundaries
For very large codebases, use a two-level hierarchy: file-level summaries for broad navigation, function-level chunks for precise retrieval

Metadata filtering: Every code chunk should be stored with metadata, and you should filter on it before running vector similarity search.

File path and programming language
Last modified timestamp
Module or package membership
Relevant tags (e.g., “authentication”, “database”, “public-api”)
If the task involves a TypeScript file in src/services/, filter to that path before running embedding search — precision improves dramatically

Hybrid search: Combine BM25 (keyword-based ranking) with vector similarity search. Pure vector search handles semantic queries well (“find code that does authentication”) but misses exact symbol matches. BM25 handles exact symbol matching precisely but fails at semantic queries. Combining them with a re-ranking step consistently outperforms either approach alone. Use hybrid search as the default for production codebase RAG.

Here is a conceptual example of a well-engineered codebase RAG query function:

// Conceptual codebase RAG query — TypeScript pseudocode for illustration
async function retrieveCodeContext(
  query: string,
  options: {
    currentFilePath: string;
    language: string;
    maxTokens: number;
    taskType: 'read' | 'modify' | 'debug';
  }
): Promise<CodeContext> {

  // Step 1: Metadata pre-filter — narrow the search space before embedding
  const metadataFilter = {
    language: options.language,
    // For modifications, prioritize same module; for debugging, allow broader search
    modulePath: options.taskType === 'modify'
      ? extractModulePath(options.currentFilePath)
      : undefined,
    // Deprioritize files not modified in 6+ months for active feature work
    lastModifiedAfter: options.taskType !== 'debug'
      ? subMonths(new Date(), 6)
      : undefined,
  };

  // Step 2: Run hybrid search (BM25 + vector, combined with RRF re-ranking)
  const [vectorResults, bm25Results] = await Promise.all([
    vectorStore.search(await embedQuery(query), { filter: metadataFilter, topK: 20 }),
    bm25Index.search(query, { filter: metadataFilter, topK: 20 }),
  ]);

  // Step 3: Reciprocal Rank Fusion (RRF) to merge result lists
  const mergedResults = reciprocalRankFusion([vectorResults, bm25Results]);

  // Step 4: Expand retrieved chunks to include their imports and type references
  const expandedChunks = await expandWithDependencies(mergedResults.slice(0, 10));

  // Step 5: Fit within token budget, prioritizing highest-ranked chunks
  const fittedContext = fitToTokenBudget(expandedChunks, options.maxTokens);

  // Step 6: Order for context injection — most relevant first (beginning of retrieved section)
  // and second-most-relevant last (end of retrieved section) per "lost in the middle" findings
  return orderForContextInjection(fittedContext);
}

The dependency expansion step (Step 4) deserves emphasis. When you retrieve a function that calls other functions defined elsewhere, the raw chunk may be nearly useless without its dependencies. A good RAG system for code automatically fetches the signatures — and ideally the full definitions — of functions called within a retrieved chunk, plus the type definitions for its parameters and return values. This “dependency hydration” significantly improves the model’s ability to reason accurately about retrieved code.

Context Compression — What to Do When You Hit the Limit

Context windows are finite, and in long agentic coding sessions they will fill up. The naive response — starting a new session and losing all accumulated context — is usually the most expensive option, both in tokens to re-establish context and in productivity cost. Context compression preserves the semantic content most critical to completing the remaining work while reducing size.

Two primary techniques:

Progressive summarization: After every N exchange turns, replace the oldest turns with a structured summary (“Agent investigated the auth module, identified a bug in the token refresh logic on line 147, proposed a mutex lock fix, fix was approved and applied”). The history now contains the summary plus recent turns — far fewer tokens, decision trail preserved. Best when early context (task spec, architectural decisions) is most important.
Rolling windows: Maintain a fixed-size window of the most recent N exchanges in full fidelity. Older content is compressed externally or discarded. Best for iterative debugging where recency dominates. Mitigate the risk of losing early context by pinning critical elements (task spec, constraints, current objective) that always appear at the beginning regardless of compression.

The cost math is straightforward. A 20-step agentic session without compression grows linearly toward the 200K token ceiling. With progressive summarization that halves context size every 5 steps, total input token consumption drops 40–60% across the session. At Claude Opus 4.7 pricing of $5/M tokens, a session that costs $20 in input tokens without compression might cost $8–12 with compression. For teams running hundreds of sessions per day, this becomes a real budget decision.

Practical rule of thumb: When your context is more than 60% full and you have more than a few steps remaining, start compressing. Waiting until you hit 90–95% forces rushed compression under pressure and risks losing critical state. Design compression triggers at the 60% threshold and pin your task specification, current objective, and top-level constraints so they always survive compression.

Context Engineering for Agentic Workflows — A Different Set of Rules

Agentic workflows require context strategies that are fundamentally different from single-turn interactions. In a single-turn chat, over-including information is relatively harmless — the model reads it once and you get the result. In a 20-step autonomous workflow, over-inclusion means carrying cognitive overhead through every step, and any misinterpretation compounds over time. The core principles for agentic context:

Scoped context: Give the agent only the modules relevant to the current task — not the full codebase. Provide only the tools appropriate to the current task phase. Compress conversation history at regular intervals and pin the task specification.
Checkpoint injection: After every 5 actions, inject a structured checkpoint that restates the original task objective, actions taken so far (from the compressed summary), current state of the codebase (which files have been modified), and remaining objectives. This prevents the common failure mode where an agent solves a subproblem well but in a direction that doesn’t serve the original goal.
Tool phase gating: Restrict available tools to what’s appropriate for the current phase — planning, execution, or validation. Never provide all tools at all phases.
Pinned invariant context: The task specification, architectural constraints, and core conventions must always be present and unchanged across every step of the workflow.
Explicit out-of-scope declarations: Tell the agent what it should not do in the task spec. This reduces scope creep and prevents well-intentioned detours that accumulate as technical debt.

Here is what a well-engineered agent context looks like at initialization for a coding task:

// Agent context initialization structure for a coding task — annotated pseudocode

const agentContext = {

  // ---- LAYER 1: System prompt (ALWAYS first — benefits from positional attention) ----
  systemPrompt: `
    You are a senior TypeScript engineer working on the Bindspace API.
    Project conventions are defined below. Follow them strictly.
    Work incrementally. Complete one logical change at a time.
    Before modifying a file, read its current contents using the read_file tool.
    After modifying a file, verify the change compiles using the typecheck tool.
    If you encounter unexpected state (a file that does not match your mental model),
    STOP and report the discrepancy rather than proceeding with assumptions.
    [CLAUDE.md contents injected here — approximately 300-500 tokens]
  `,

  // ---- LAYER 2: Scoped task specification ----
  // Specific, bounded, with explicit success criteria and out-of-scope declaration
  taskSpec: `
    TASK: Add rate limiting to the POST /api/auth/login endpoint.

    SUCCESS CRITERIA:
    - Rate limit: 5 attempts per IP per 15-minute window
    - Use the existing Redis-based rate limiter in src/middleware/rateLimiter.ts
    - Return 429 with a Retry-After header when the limit is exceeded
    - Add a test in src/routes/auth.test.ts covering the rate-limited case
    - Do NOT modify any other auth routes

    OUT OF SCOPE: Do not refactor the existing rate limiter. Use it as-is.
  `,

  // ---- LAYER 3: Scoped tool set ----
  // Only tools needed for this specific task — NOT all available tools
  tools: [
    'read_file',       // Read source files before modifying
    'write_file',      // Write modified source files
    'list_directory',  // Navigate the project structure if needed
    'run_typecheck',   // Verify TypeScript compilation after each change
    'run_tests',       // Run auth test suite: npm test src/routes/auth.test.ts
    // EXCLUDED: deploy, git_commit, database_migrate, install_package
    // These exclusions prevent scope creep and reduce reasoning surface area
  ],

  // ---- LAYER 4: Pre-retrieved relevant context (injected in MIDDLE of context) ----
  // Structurally chunked at file/function level, dependency-hydrated
  retrievedContext: [
    'src/middleware/rateLimiter.ts',  // The existing rate limiter — full file
    'src/routes/auth.ts',             // Route file to be modified — full file
    'src/routes/auth.test.ts',        // Test file to be extended — full file
    'src/types/express.d.ts',         // Request type augmentations — for context
  ],

  // ---- LAYER 5: Checkpoint configuration ----
  checkpointEveryNActions: 5,
  pinnedContext: ['taskSpec'],  // Always survives compression
};

Common Context Engineering Mistakes — and How to Fix Them

Context Pollution

This is the most common mistake: injecting every potentially relevant file, document, or conversation turn without curation. The intuition (“more information = better results”) is wrong in practice. More noise means less effective attention on the signal.

Fix: Adopt a retrieval mindset. Before adding any content to the context, ask “does the model specifically need this for the current step?” If the answer is “it might help,” put it in a retrieval index — not the primary context. The context window is prime real estate.

Over-Relying on Recency

Developers often append the most important instructions at the very end of a long context, assuming the model reads sequentially and remembers recent content most clearly. The “lost in the middle” research shows this is only partially true. The end receives strong attention — but so does the beginning. Information buried 150K tokens into a 200K-token context receives degraded attention regardless of its importance.

Fix: Explicit bookending — put what matters most at both the start and the end of the context, with supporting context in the middle.

Missing Negative Instructions

Models learn patterns from training data, including widely-used but locally-prohibited approaches. If your codebase bans a popular library, uses a non-standard error handling pattern, or has architectural decisions that violate common conventions, the absence of those patterns in your codebase is not enough to prevent the model from introducing them.

Fix: Add explicit “DO NOT use X” instructions to your CLAUDE.md and system prompt. An explicit prohibition is far more reliable than relying on code-level omission.

Not Versioning System Prompts

When a model update changes how a system prompt is interpreted, or when a team member edits the system prompt and inadvertently breaks existing behavior, you need the ability to diff, rollback, and test. Without version control, these regressions are invisible.

Fix: Store system prompts in version control. Tag them with the model version they were tested against. Run a baseline eval suite when you change them. Treat them exactly as you would any other production configuration file.

Inconsistent Context Across Agent Steps

If one step provides detailed type definitions and the next doesn’t, the agent’s internal model of the codebase becomes inconsistent. If tool definitions change between steps, the agent may try to use tools that are no longer available. These inconsistencies compound quickly in long workflows.

Fix: Ensure the persistent, invariant context (task specification, conventions, constraints) is maintained identically across every step. Add a context consistency check before each agent step to verify invariant elements are present and unchanged.

Mistake	Symptom	Fix
Context pollution	Model produces correct but unfocused or overly cautious responses	Retrieval-first mindset — only inject what is definitively needed
Over-relying on recency	Critical early instructions ignored in long sessions	Bookend critical info at beginning AND end of context
Missing negative instructions	Model introduces banned libraries or deprecated patterns	Add explicit DO NOT clauses in CLAUDE.md and system prompt
Unversioned system prompts	Unexplained behavior regressions after model updates	Version control all prompts; tag with model version; run evals on changes
Inconsistent agent context	Agent contradicts its earlier decisions; tool call failures	Pin invariant context; consistency check before each agent step
Stale CLAUDE.md	Confident errors based on outdated conventions	Quarterly CLAUDE.md reviews; remove anything no longer accurate

Measuring Context Engineering Quality — Metrics That Actually Matter

You can’t improve what you don’t measure. The three metrics that matter most for context engineering quality:

Task completion rate: What percentage of agent tasks complete successfully without human intervention?
Correction rounds per task: How many times does a human need to redirect or correct the agent before the task is done?
Token cost per successfully completed task: Total input plus output tokens divided by tasks that met acceptance criteria.

A/B testing system prompts is the most direct way to measure the impact of context engineering changes. Define a benchmark task set that represents your real workload, establish a baseline with your current configuration, introduce one change at a time, and measure on the same benchmark. Run 20–50 trials per variant for reliable signal. Use a fixed seed where the model supports it to reduce sampling variance.

Evaluation frameworks reduce the overhead of systematic measurement. The main options:

LangSmith: Tracing and evaluation infrastructure that captures full context payloads alongside quality scores, making it straightforward to correlate context composition with output quality
PromptFoo: Dedicated prompt and context evaluation framework with assertion-based test cases — define what a correct output looks like and it runs your prompts against a test suite automatically
Anthropic’s Evals framework: Purpose-built for Claude-based systems with native Claude Code integration

Key insight: The 2026 Anthropic data shows that developers report only being able to fully delegate 0–20% of tasks to AI agents — despite agents completing 20 autonomous actions on average. This gap suggests that the bottleneck is not agent capability but context quality. Developers who invest in measuring and improving their context engineering systematically close this delegation gap faster than those who iterate intuitively.

Frequently Asked Questions About Context Engineering in 2026

What exactly is context engineering, and who coined the term?

Context engineering is the discipline of deliberately designing and managing the information placed into an AI model’s context window to maximize output quality across complex, multi-step tasks. Andrej Karpathy — OpenAI co-founder — coined the term in mid-2025, defining it as “the art of filling the context window with exactly the right information at the right time.” The term reflects a growing recognition that prompt phrasing is a second-order variable compared to the quality and composition of the full context the model operates within.

Is context engineering just a more complicated version of prompt engineering?

No — it’s a broader discipline that includes prompt engineering as one component. Context engineering covers the full informational architecture:

System prompt design
Retrieved document selection and ordering
Conversation history management
Tool definitions provided to agents
Memory systems (short-term in-context and long-term external)
Compression approach for long sessions

In a single-turn chat, the difference is minor. In a 20-step agentic coding workflow, context engineering decisions determine whether the agent succeeds or fails — and prompt engineering is just one of many levers.

What is the “lost in the middle” problem and how does it affect coding AI tools?

The “lost in the middle” problem, documented in a 2024 Stanford NLP study, is the empirical finding that language models perform significantly better on information at the beginning and end of the context window than on information buried in the middle. This is a structural property of how transformer attention works at long context lengths. For coding AI tools, the practical fix is to bookend critical information: put your most important instructions and constraints at the beginning (in the system prompt) and restate the core task at the end of your user message, with supporting context — files, documentation — in the middle.

How do I get started with CLAUDE.md for my project?

Create a file named CLAUDE.md at the root of your project repository. Start with these six sections:

Project overview (2–3 sentences on what the project does and its primary constraints)
Technology stack with exact version numbers
Coding conventions that aren’t obvious from reading the code (naming patterns, async patterns, file organization)
Explicitly prohibited patterns starting with “DO NOT” — this is the highest-value section
Common commands for running, testing, and deploying
Known fragile areas that require extra care

Keep it under 500 lines. Update it as the project evolves. Review it quarterly and remove anything that’s no longer accurate — stale context produces confident errors.

What is the token cost impact of poor context engineering at production scale?

The numbers add up fast. Claude Opus 4.7 costs $5 per million input tokens. A single API call with a full 200K-token context costs $1.00 in input tokens. A 20-step agentic session at 200K tokens per step costs $20.00 in input tokens alone. If 40% of that context is irrelevant content (files not needed, stale docs, duplicated type definitions), $8 of that $20 is being spent on noise. At team scale — with 95% of professional developers using AI tools weekly — the per-developer cost of poor context engineering multiplies rapidly. Treat context optimization as a cost engineering initiative, not just a quality initiative.

How does context engineering differ for agents versus single-turn AI chat?

Single-turn chat is forgiving of context engineering errors — the model makes one decision and a human evaluates the result immediately. Agents make sequences of decisions (20 on average in 2026 coding workflows) where each step depends on the previous ones. Errors compound. For agents, context engineering must address additional concerns that don’t apply to single-turn chat:

Tool scoping — providing only the tools appropriate to the current task phase
Checkpoint injection — re-anchoring the agent to the original objective at regular intervals
Context compression — managing the accumulation of conversation history over many steps
Consistency enforcement — ensuring invariant context elements are present and unchanged across every step

What is the relationship between RAG and context engineering?

RAG is one of the six layers of context engineering — the mechanism for dynamically selecting which external information gets injected into the context window at query time. Context engineering is the broader discipline that includes RAG design alongside system prompt architecture, tool selection, memory systems, compression strategies, and information ordering. Many developers implement RAG without thinking about the other layers, which limits its effectiveness. Retrieving highly relevant code chunks (good RAG) but injecting them in the middle of a long context (poor ordering) means the “lost in the middle” effect will degrade the model’s ability to use them. RAG design and context ordering must be considered together.

Should I invest time in context engineering if I only use AI tools occasionally?

The value scales with usage frequency. For truly occasional use (a few queries per week), a sophisticated context architecture probably isn’t worth the overhead. But 95% of professional developers now use AI tools weekly — the “occasional user” category is shrinking fast. For daily users, even a basic CLAUDE.md for each active project produces measurable improvements in output quality and fewer correction rounds per task. The 30–60 minutes to write a good CLAUDE.md pays back within a single focused coding session on any non-trivial project. The deeper techniques — RAG design, agent checkpointing, compression — are most valuable for developers building AI-native workflows or managing AI coding agents at team scale.

Wrapping Up — Context Is the New Code Quality

The evidence is clear: a 9,649-experiment peer-reviewed study confirms context quality outperforms prompt quality as a predictor of output quality, Stanford NLP has documented the structural attention biases that make information ordering a real engineering concern, and Anthropic’s 2026 data shows agents completing 20 autonomous actions before human input — making the compounding cost of context errors a first-order operational issue. The developers who get disproportionate value from AI coding tools in the next 12 months won’t find better prompts. They’ll build better context architectures.

Where to start:

Write a CLAUDE.md for every active project — include the “DO NOT” section, keep it under 500 lines, review it quarterly
Audit your RAG pipeline — move to structural chunking at function and class boundaries and implement hybrid search if you haven’t already
Add checkpoint injection to long agent sessions — re-anchor to the original objective every 5 actions
Version your system prompts — store them in source control, tag them with model version, run evals on changes
Start measuring — track task completion rate, correction rounds per task, and token cost per successfully completed task so you have a baseline to improve against

The AI workspace that turns prompts into results.

Plan, research, and ship faster with AI that understands your work.

From PRD to production before the week is over. Build with Friday AI

Available on:

tryfriday.ai

product_team_goals:

time_to_market: "shipped_in_hours"

dev_alignment: "prds_to_clean_code"

overhead: "zero_waste_meetings"

sprint_status: features_deployed_successfully...