Build the Pedagogy Tools

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

James stared at the WhatsApp response from TutorClaw. He had sent "Teach me about variables" and got back a wall of text: definitions, examples, exercises, all at once.

"It fetches the content and shows it," he said, scrolling through the message. "When I trained new hires at the warehouse, I never did this. I walked them through it. Asked them what they thought would happen before they tried the forklift. Watched them try. Then asked why it went sideways."

Emma looked at the screen. "You just described a pedagogical framework. Predict what happens, run it, investigate why the result was different from the prediction." She pulled up a chair. "That three-step loop has a name: PRIMM-Lite. Build it."

You are doing exactly what James is doing. Your content tools deliver raw material, but they do not teach. In this chapter, you describe two pedagogy tools to Claude Code that turn TutorClaw from a content dump into a tutor that walks learners through material the way James trained warehouse staff.

PRIMM-Lite: The Three-Stage Teaching Loop

Before you describe anything to Claude Code, you need to understand what you are asking it to build. PRIMM-Lite is a simplified teaching methodology with three stages:

Stage	What Happens	Example
Predict	The tutor shows code and asks "What do you think this will do?" before revealing the output	"Look at this loop. What will it print?"
Run	The tutor reveals the actual output and asks "Does this match your prediction?"	"Here is what it actually prints. Were you right?"
Investigate	The tutor asks "Why did the output differ?" or "What would happen if you changed X?"	"Why did it print 5 instead of 4? What if you changed the range?"

The three stages cycle for each topic. A learner who predicts correctly advances quickly. A learner who predicts incorrectly spends more time in the Investigate stage, building understanding before moving forward.

Two pieces make this work as MCP tools:

generate_guidance takes the learner's current stage and confidence, plus the chapter content, and returns a stage-appropriate prompt
assess_response takes the learner's answer, the current stage, and the expected concepts, and returns a confidence_delta (positive or negative) with feedback

Step 1: Describe generate_guidance to Claude Code

Open Claude Code in your tutorclaw-mcp project. You need to explain PRIMM-Lite as a requirement, not as code. Send this message:

text

I want to add a generate_guidance tool based on the PRIMM-Lite teaching loop.

Methodology:
- Predict: Show code (no output), ask for a prediction.
- Run: Reveal output, ask for comparison.
- Investigate: Ask "why" or prompt for code modifications.

Requirements:
- Inputs: learner_state (current stage + confidence) and chapter_content.
- Outputs:
  1. content: stage-appropriate teaching text.


2. system_prompt_addition: specific instructions for the agent (e.g., "Wait for prediction before revealing output").

Spec this before building.

The system_prompt_addition field is the key design insight in this tool. The content is what the learner sees. The system_prompt_addition is what the agent follows. This is a tool that returns instructions, not just data.

Review the spec Claude Code produces. The key things to check:

Element	What to Verify
Stage handling	Does the tool produce different output for each of the three stages?
system_prompt_addition	Does each stage return a different instruction? Predict should tell the agent to wait for a prediction. Run should tell the agent to compare. Investigate should tell the agent to ask probing questions.
Confidence awareness	Does it adjust prompt difficulty based on confidence? A learner at 0.2 confidence needs simpler prompts than one at 0.8
Tool description	Is it specific enough that the agent knows when to call this versus get_chapter_content?

If the tool description is vague ("generates teaching stuff"), steer it:

text

Refine the generate_guidance tool description.

Logic Update:
The current description is too vague. Ensure it explicitly states: "This tool generates stage-appropriate teaching prompts using the PRIMM-Lite methodology. Call this when the agent needs to facilitate a teaching concept, not when the learner asks for raw reading material."

Once the spec looks right, tell Claude Code to build it:

Specification

The spec looks good. Build this.

Run the tests after it finishes:

Specification

uv run pytest

Step 2: Describe assess_response to Claude Code

The second pedagogy tool evaluates whether the learner's answer demonstrates understanding. Send this message:

text

Add an assess_response tool to evaluate learner submissions.

Inputs:
- answer_text: the learner's response.
- primm_stage: the current teaching stage.
- expected_concepts: a list of concept tags for validation.

Outputs:
- confidence_delta: float (-0.3 to 0.3) to adjust learner score.
- feedback: specific pointers on what was missed or understood.
- recommendation: suggested next action for the agent (not a command).

Spec this before building.

Review the spec. Watch for:

Does the confidence_delta range make sense? Too wide (like -1.0 to 1.0) makes single answers swing the learner's state too far. Too narrow (like -0.01 to 0.01) makes progress invisible.
Does the tool description distinguish this from generate_guidance? One generates prompts, the other evaluates answers. The agent must never confuse them.

Approve and build:

Specification

The spec looks good. Build this.

Run tests:

Specification

uv run pytest

Step 3: Verify Both Tools

Now test the pedagogy tools together. Ask Claude Code to call them in sequence:

text

Conduct a Sequential Integration Test.

Sequence:
1. Call generate_guidance (Chapter 1, Predict stage, Confidence 0.5).

2. Call assess_response with a valid answer ("It prints hello world") and expected concepts ("print function").

3. Call assess_response with a vague answer ("It does stuff").

Evaluation:
Verify the system_prompt_addition is correct in step 1 and the confidence_delta reflects the quality difference between step 2 and step 3.

You are looking for three things:

generate_guidance at predict stage returns a prompt that shows code and asks the learner to predict, not a prompt that reveals the answer
assess_response with a reasonable answer returns a positive confidence_delta and encouraging feedback
assess_response with a vague answer returns a negative confidence_delta and feedback that points the learner toward what they missed

If any of these are wrong, describe the problem to Claude Code and steer the fix. The describe-steer-verify cycle from Module 9.2 applies to every tool you build.

Two Tools, One Teaching Method

Step back and look at what these two tools do together. Before this lesson, TutorClaw could store learner state and fetch content. It had a filing cabinet and a bookshelf. Now it has a teaching method.

When the agent needs to teach a concept, it calls generate_guidance to get a stage-appropriate prompt. When the learner responds, it calls assess_response to evaluate the answer. The confidence_delta feeds back into the learner state (via update_progress from Module 9.3, Chapter 3), and the next call to generate_guidance adjusts accordingly. A learner who keeps answering well moves through stages quickly. A learner who struggles gets more support at the current stage.

The agent orchestrates this loop by reading tool descriptions, not by following hardcoded logic. You described the methodology, Claude Code built the implementation, and the agent will use the descriptions to call the right tool at the right time.

Try With AI

Exercise 1: Test All Three Stages

Walk through one complete PRIMM-Lite cycle by calling generate_guidance at each stage:

text

Conduct a Full Loop Comparison.

Task:
Call generate_guidance three times for the same chapter, cycling through the "predict", "run", and "investigate" stages.

Goal:
Present the responses side-by-side to ensure each stage produces qualitatively distinct pedagogical steering.

What you are learning: Each stage should produce a qualitatively different response. Predict asks for a prediction. Run reveals the answer and asks for comparison. Investigate probes deeper. If the three responses look similar, the tool's stage handling needs steering.

Exercise 2: Confidence Boundaries

Test what happens at extreme confidence values:

text

Test for Support Scaling.

Scenario:
Call generate_guidance twice for the same chapter/stage (Predict).


1. Low Confidence: 0.1
2. High Confidence: 0.9

Analysis:
Compare the complexity and support level of the prompts. Verify if the tool adjusts its tone and scaffolding based on the learner's proficiency.

What you are learning: A learner at 0.1 confidence needs simpler, more supportive prompts than one at 0.9. If the tool produces identical prompts regardless of confidence, you may want to steer Claude Code to add confidence-aware difficulty scaling.

Exercise 3: Edge Case Responses

Test assess_response with responses that are technically correct but miss the point:

text

Verify Threshold Assessment.

Scenario:
Call assess_response for "investigate" with expected concepts "loop iteration" and "range function".

Test Case A (Vague):
"The code runs and gives output."

Test Case B (Precise):
"The for loop runs 5 times because range(5) generates 0-4, and the variable takes each value in sequence."

Task:
Compare the confidence_delta and feedback quality. Ensure the tool rewards technical precision over generic participation.

What you are learning: The quality gap between a vague answer and a specific one should produce a meaningful difference in confidence_delta. If both answers return similar scores, the assessment logic is too generous and needs tighter criteria.

James called generate_guidance with a learner in the predict stage. The response came back: a code snippet with the question "What do you think this will print?"

"It teaches instead of dumping." He grinned. "At the warehouse, the new hires who predicted first always learned the safety protocols faster. Something about committing to an answer before you see the result."

Emma nodded. "That commitment is the whole trick. Predict forces the learner to engage before they can passively scroll." She paused. "I shipped a tutor once without any pedagogical framework. Just search and display. Users said it was a search engine with extra steps. We bolted PRIMM onto it in a weekend and retention tripled." She shrugged. "Should have done it from the start."

"So we have state tools, content tools, and now pedagogy tools." James counted on his fingers. "What is left?"

"Two more. A way to run code and a way to get paid." Emma pointed at his screen. "Module 9.3, Chapter 6."