You have spent ten lessons learning how context engineering works. You know that context rots when left unmanaged, that most CLAUDE.md files contain more noise than signal, that position inside the context window determines how much attention an instruction receives, and that multi-session work requires explicit persistence. You understand the theory. Now it is time to prove the theory works by measuring it.
These exercises use a single evolving project: a Contract Review Agent that starts broken and gets progressively better as you apply each context engineering technique. The agent reviews legal contracts and produces risk assessments. In its starting state, it misses obvious red flags, contradicts its own earlier findings, and forgets critical requirements mid-session. By Module 7, the same agent will catch every issue, maintain consistency across sessions, and coordinate multiple specialist reviewers without contamination. The difference between the broken version and the production version is not a better model or more data. It is better context engineering.
Every module follows the same pattern: apply a technique, then measure the result. You will run the same three contract review tasks at each stage and score the output using a consistent rubric. This measurement discipline is the most important thing you will learn. Context engineering is not hand-waving about "better prompts." It is an engineering discipline where you can quantify the impact of every change you make.
Download Exercise Files
Download Context Engineering Exercises (ZIP)
After downloading, unzip the file. Each module has its own folder with an INSTRUCTIONS.md and any starter files you need.
If the download link doesn't work, visit the repository releases page directly.
Unlike standalone exercises where each problem is independent, the Context Lab is a single project that evolves across all 7 modules. You start with a broken Contract Review Agent and apply one context engineering technique per module. After each technique, you re-run the same three benchmark tasks and score the results. This lets you measure exactly how much each technique contributes.
The workflow for every module:
You do not need to complete all 7 modules in one sitting. Each module takes 20-40 minutes. But you must complete them in order because each module builds on the previous one.
Every time you run the three benchmark tasks, score each output on four criteria using a 1-5 scale:
Maximum score per task: 20 (4 criteria x 5 points) Maximum score per benchmark run: 60 (3 tasks x 20 points)
The starter agent typically scores 25-35 out of 60. By Module 7, students routinely reach 50-58.
Technique: Diagnosing the four types of context rot (Lesson 1)
What you will learn: How to identify accumulation, contradiction, staleness, and poisoning in a real CLAUDE.md — and why each type degrades agent performance differently.
The Setup: Open the module-1-context-rot/exercise-1.1-rot-audit/ folder. You will find a starter-agent/ directory containing a CLAUDE.md, several rules files, and a skills directory. This is your Contract Review Agent in its broken starting state. The CLAUDE.md is 650 lines long and was accumulated over months of ad-hoc additions by multiple team members.
Your Task: Read the entire CLAUDE.md and categorize every section into one of four rot types:
Produce a rot report: a table listing each section, its rot type, and the specific harm it causes. Count the total instructions and estimate what percentage are signal vs. noise.
What to Expect: Most students find the starter CLAUDE.md contains 15-25% signal. The rest is rot. The most dangerous category is usually poisoning — instructions that look helpful but actively degrade output quality. Staleness is the most common by volume.
Reflection Questions:
The Setup: Open the module-1-context-rot/exercise-1.2-baseline-measurement/ folder. You will find three benchmark contract files and a scoring template. The benchmark contracts are designed to test different aspects of review quality: Contract A has obvious liability issues, Contract B has subtle inconsistencies between clauses, and Contract C has standard terms that should NOT be flagged (testing for false positives).
Your Task: Run all three benchmark contracts through the starter agent (with the broken CLAUDE.md from Exercise 1.1). For each output, score it on the 4-criteria rubric. Record your scores in the provided tracking spreadsheet. This is your Module 1 baseline — every future module will be compared against these scores.
What to Expect: Typical baseline scores range from 25-35 out of 60. The agent usually catches the obvious issues in Contract A but misses the subtle inconsistencies in Contract B and produces false positives on Contract C. Consistency scores tend to be lowest because contradictory CLAUDE.md instructions produce contradictory output.
Reflection Questions:
Technique: The 4-question signal audit (Lesson 2)
What you will learn: How to separate actionable instructions from noise, and how dramatically a lean CLAUDE.md outperforms a bloated one.
The Setup: Open the module-2-signal-noise/exercise-2.1-four-question-audit/ folder. You will find the CLAUDE.md from Module 1 plus the 4-question audit framework. The four questions are: (1) Would Claude ask about this if not told? (2) Is this specific enough to act on? (3) Does this change Claude's default behavior? (4) Can compliance be verified?
Your Task: Apply the 4-question audit to every instruction in the CLAUDE.md. Remove everything that fails all four questions. Compress everything that passes some questions but not all. Keep only content where all four questions answer "yes." Your target is a CLAUDE.md of approximately 400 words — down from the original 650 lines.
Be ruthless. "Write high-quality reviews" fails all four questions (Claude already tries to write high-quality output, it is not specific, it does not change defaults, and compliance cannot be verified). "Flag any clause where liability exceeds 2x contract value" passes all four (Claude would not apply this threshold unprompted, it is specific, it changes behavior, and you can check compliance).
What to Expect: Students typically reduce the CLAUDE.md from 650 lines to 40-60 lines. The process is uncomfortable — it feels like you are throwing away important information. The measurement in Exercise 2.2 will show that less is more.
Reflection Questions:
The Setup: Open the module-2-signal-noise/exercise-2.2-quality-comparison/ folder. Use the same three benchmark contracts from Module 1, but now run them through the agent with your optimized CLAUDE.md.
Your Task: Re-run all three benchmark contracts. Score each output on the same 4-criteria rubric. Compare every score to your Module 1 baseline. Calculate the improvement per criterion and overall.
What to Expect: Students typically see a 8-15 point improvement (out of 60) from signal optimization alone. The biggest gains are usually in Consistency (removing contradictions eliminates contradictory output) and Actionability (specific instructions produce specific recommendations). Completeness may stay flat or even dip slightly — you will recover it in Module 3 when you redistribute removed content to the right tools.
Reflection Questions:
Technique: Mapping content to the right context tool (Lesson 3)
What you will learn: How to distribute information across CLAUDE.md, Skills, Hooks, and Subagents so that each piece lives where it gets the most attention at the lowest token cost.
The Setup: Open the module-3-architecture/exercise-3.1-tool-mapping/ folder. You will find the content you removed in Module 2 plus a tool mapping worksheet. The worksheet lists each context tool (CLAUDE.md Zones 1/2/3, Skills, Hooks, Subagents, External Files) with its characteristics: when it is loaded, how many tokens it consumes, and what it is best for.
Your Task: Take every piece of content you removed from CLAUDE.md in Module 2 and map it to the correct tool:
Create the actual files: write the skills, define the hook logic, specify the subagent prompts. Your Contract Review Agent should now have a multi-file context architecture, not just a single CLAUDE.md.
What to Expect: Most students distribute content roughly: 30% stays in CLAUDE.md (across all three zones), 25% becomes skills, 15% becomes hook triggers, 10% defines subagent scope, and 20% moves to external reference files. The exact distribution depends on the content, but a common mistake is keeping too much in CLAUDE.md because it "feels safer."
Reflection Questions:
The Setup: Open the module-3-architecture/exercise-3.2-token-budget/ folder. You will find a token budget calculator and instructions for estimating the token cost of your architecture.
Your Task: Calculate the token budget for two scenarios:
Produce a comparison table showing: tokens consumed at each measurement point, percentage of context window used, and estimated attention quality based on the utilization curves from Lesson 6.
What to Expect: The distributed architecture typically uses 40-60% fewer tokens at session start and degrades much more slowly over a 30-turn session. The most dramatic difference appears at turn 30, where the monolithic approach is often at 65-75% utilization (degraded attention) while the distributed approach stays at 35-45% (full attention).
Reflection Questions:
Technique: Task DAGs and tacit knowledge extraction (Lessons 4-5)
What you will learn: How to make your agent's work survive session boundaries — so that /clear does not destroy accumulated understanding.
The Setup: Open the module-4-persistence/exercise-4.1-tasks-and-knowledge/ folder. You will find a multi-session contract review scenario: a complex 50-page vendor agreement that requires analysis across multiple sessions. The scenario includes session transcripts showing how a naive agent loses context between sessions.
Your Task: Design two persistence artifacts for the Contract Review Agent:
What to Expect: The task DAG typically has 8-12 tasks with 3-5 dependency chains. Students often underestimate dependencies — Section 15 (liability) depends not just on Section 1 (definitions) but also Section 8 (scope of work) because liability limits reference deliverable categories. The tacit knowledge file usually captures 10-15 rules that were implicit in the session transcripts.
Reflection Questions:
The Setup: Open the module-4-persistence/exercise-4.2-survival-test/ folder. You will find instructions for the survival test: run a partial contract review session, execute /clear, then resume and verify continuity.
Your Task: Conduct this three-phase test:
Score the post-clear session on the same 4-criteria rubric and compare to pre-clear quality.
What to Expect: With well-designed persistence files, post-clear quality typically scores within 2-3 points of pre-clear quality. The most common failure is losing tacit knowledge — findings that were discovered during Phase 1 but not captured in the knowledge file. Students who wrote thorough knowledge files in Exercise 4.1 see nearly perfect continuity.
Reflection Questions:
Technique: Context zone monitoring and compaction strategy (Lessons 6-7)
What you will learn: How to keep your agent performing well across long sessions by actively managing context utilization and knowing when to compact.
The Setup: Open the module-5-lifecycle/exercise-5.1-zone-monitoring/ folder. You will find a 25-turn contract review session script and a zone monitoring worksheet. The worksheet tracks context utilization at each turn: how much is system prompt, CLAUDE.md, conversation history, tool outputs, and reserve.
Your Task: Execute the 25-turn session with your Contract Review Agent. At turns 1, 5, 10, 15, 20, and 25, estimate the context composition:
Plot these on the utilization curve. Identify the turn where utilization crosses 60% (caution zone) and predict when it would cross 70% (degradation zone).
Add a progress file that the agent updates every 5 turns with: current task state, key findings since last update, and context health assessment.
What to Expect: Most agents cross 60% utilization between turns 15-20, depending on how many files they read. The progress file adds 200-400 tokens per update but pays for itself by enabling effective compaction later. Students often discover that tool outputs (reading contract sections) are the biggest context consumer — not conversation history.
Reflection Questions:
The Setup: Open the module-5-lifecycle/exercise-5.2-compaction-strategy/ folder. You will find compaction instruction templates and a comparison framework.
Your Task: Design and test three compaction strategies for your Contract Review Agent at the 60% threshold:
Run each strategy and continue the review for 10 more turns. Score the output quality at turn 35 (10 turns after compaction) using the 4-criteria rubric.
What to Expect: Naive compaction typically loses 30-50% of critical context. Structured compaction preserves most decisions but may lose rationale. Progress-file compaction consistently produces the best results because the critical context is safely persisted outside the context window before compaction discards it. The quality scores after compaction usually tell the story clearly.
Reflection Questions:
Technique: Designing a memory corpus for domain expertise (Lesson 8)
What you will learn: How to build a persistent memory layer that makes your agent smarter over time — so turn 20 reviews are better than turn 1 reviews.
The Setup: Open the module-6-memory/exercise-6.1-memory-corpus/ folder. You will find a set of 10 previously reviewed contracts with annotated findings. These represent the "experience" your agent should learn from. You will also find a memory corpus template.
Your Task: Design a memory corpus for the contract review domain. For each of the 10 previously reviewed contracts, extract:
Organize these into a searchable memory structure. Define the injection strategy: which memories should be injected via hooks (PreToolUse), which should live in the knowledge file (always available), and which should be retrieved on-demand from external files.
What to Expect: Students typically extract 25-40 memories from the 10 contracts. The hardest part is deciding granularity — a memory that is too broad ("watch out for liability clauses") is noise, while one that is too narrow ("Contract #3, Section 12.4(b) had a typo") is not transferable. The best memories are pattern-level: specific enough to act on, general enough to apply to new contracts.
Reflection Questions:
The Setup: Open the module-6-memory/exercise-6.2-drift-measurement/ folder. You will find instructions for a controlled comparison test.
Your Task: Run a controlled experiment:
The key question: does your memory injection system maintain quality as the session progresses, or does accumulated conversation history dilute the injected memories?
What to Expect: With well-designed memory injection, turn 20 quality should be equal to or better than turn 1 — the agent has more context from the session's work. Without proper injection, turn 20 quality typically degrades by 3-5 points as conversation history crowds out memory content. Students who implemented deduplication in their hook design see the most stable results.
Reflection Questions:
Technique: Multi-agent pipeline with clean context boundaries (Lesson 9)
What you will learn: How to split a complex review into parallel specialist agents that produce better results than a single generalist — by keeping each agent's context clean and focused.
The Setup: Open the module-7-isolation/exercise-7.1-pipeline-design/ folder. You will find a complex contract (Contract E) that requires expertise in three domains: legal terms, financial analysis, and operational feasibility. You will also find a pipeline design template.
Your Task: Split the Contract Review Agent into a multi-agent pipeline:
For each agent, define:
For the orchestrator, define:
What to Expect: The biggest design challenge is the orchestrator's synthesis prompt. Students who simply concatenate specialist outputs get confused results. Students who define structured return formats (Summary, Key Findings, Risk Level, Recommendations) and synthesize by category get clean, coherent final reports. The shared constraints document is often overlooked but critical — without it, specialists make incompatible assumptions.
Reflection Questions:
The Setup: Open the module-7-isolation/exercise-7.2-clean-vs-dirty/ folder. You will find instructions for a head-to-head comparison test.
Your Task: Run Contract E through two architectures and compare:
Compare scores across all four criteria. Pay special attention to Accuracy (does the dirty-slate agent conflate legal and financial concepts?) and Consistency (does it maintain clear boundaries between analysis domains?).
What to Expect: The clean-slate pipeline typically scores 5-10 points higher than the dirty-slate single agent. The largest gains are in Accuracy and Consistency — isolated contexts prevent the cross-contamination that happens when legal, financial, and operational reasoning share the same attention space. Completeness often improves too, because specialists catch domain-specific issues that a generalist overlooks.
Reflection Questions:
Choose one (or more). These combine all seven modules — no step-by-step guidance provided.
Capstones are different from the module exercises. There are no guided walkthroughs — you design the entire approach yourself. Each project requires applying multiple context engineering techniques together to solve a realistic problem.
AYour Domain Agent
Open the capstone-A-your-domain-agent/ folder. You will find a project template and self-assessment rubric.
The Challenge: Build a production-quality agent for your own profession or domain using all seven context engineering techniques. This is not a contract review agent — it is an agent that does work you actually need done. A teacher might build a lesson planning agent. A marketer might build a campaign review agent. A developer might build a code review agent.
Apply every technique from the Context Lab:
Deliverable: A complete agent directory with CLAUDE.md, skills, hooks, persistence files, and memory corpus. Include a self-assessment scoring your agent on the Context Engineering Assessment Rubric (below).
Open the capstone-B-context-relay/ folder. You will find a 3-session project specification.
The Challenge: Execute a complex project across three separate Claude Code sessions. The project requires building a small application (specified in the folder). The constraint: each session must start fresh (/clear between sessions). Your only continuity comes from the persistence artifacts you create.
Scoring: Compare the quality of Session 3's output to what a single uninterrupted session would produce. Effective context engineering should make the multi-session version nearly as good as the single-session version.
Open the capstone-C-forensics-challenge/ folder. You will find three broken agents, each failing for a different context engineering reason.
The Challenge: Diagnose each agent's failure without being told what is wrong. For each agent:
The three agents have different problems — one is a rot issue, one is an architecture issue, and one is an isolation issue. You must determine which is which.
Scoring: For each agent, assess: (1) Did you correctly identify the failure type? (2) Was your root cause analysis accurate? (3) Did your fix resolve the problem without introducing new issues?
Use this rubric to evaluate your overall context engineering skill after completing the modules. This is also the rubric for Capstone A.
If you completed all seven modules in order, you now have a complete measurement trail: baseline scores from Module 1, incremental improvements from each module, and a final score from Module 7. Review your tracking spreadsheet and answer these questions: