Emma handed James a blank sheet. "Before you send a single WhatsApp message, write down what you think should happen. Not what the tools do. What the EXPERIENCE should be."
James thought about it. "The learner says 'teach me about variables.' The tutor should check who they are, pull the chapter content, and then... not dump it. Ask them to predict first. That is the whole point of PRIMM."
"Write that down. The exact sequence. What the agent should call, what it should say, what it should NOT say."
James wrote. Five minutes. A page of expectations. Then Emma said: "Now test it. Send the message. Compare what you wrote to what actually happens."
You are doing exactly what James is doing. Before testing, you write down what a good tutoring session looks like. Then you test and grade the gaps. The gaps you find here become the motivation for AGENTS.md (Module 9.3, Chapter 9) and context engineering (Module 9.3, Chapter 10).
Before opening WhatsApp, write down what SHOULD happen when a learner sends "Teach me about variables in Module 9.3, Chapter 1." Use this template:
Expected tool chain:
Expected response qualities:
Write your version of this. Be specific. You are creating the acceptance criteria for your product.
Now send the message:
Wait for the response. This is different from the Module 9.2 test, where one message triggered one tool. Here, the agent should chain multiple tools:
The response you receive asks you to think first. It does not hand you the definition. This is the pedagogy tools at work: generate_guidance is shaping the interaction into a teaching session.
Reply with a prediction. Something like:
The agent processes your reply through a second sequence of tools:
You are having a tutoring conversation. The agent is not running a script. It is selecting tools based on what you said, evaluating your response, and adapting the next step. Each message triggers a different combination of tools.
Go back to the dashboard. Find the conversation log for the messages you sent. For each message, you should see multiple tool badges showing which tools fired.
Your first message might show three or four badges: get_learner_state (or register_learner), get_chapter_content, generate_guidance.
Your reply might show two or three badges: assess_response, update_progress, generate_guidance.
This is the visible difference between Module 9.2 and Module 9.3. In Module 9.2, one message produced one badge. Here, one message produces multiple badges because the agent is orchestrating tools into a workflow.
The tool badges are your proof that the product is working. A well-phrased text response without badges could be the agent hallucinating an answer from training data. The badges confirm the tools actually ran.
Now compare your expectations from Step 1 to what actually happened.
Write down every gap you found. These gaps are your TODO list for the next chapters:
The gap list is the most valuable artifact in this lesson. It turns "this works" into "this works the way I intended."
What you are learning: The agent selects tools based on their descriptions and the user's message. Understanding the selection logic helps you predict which tools fire for different messages. Removing a tool does not cause an error; the agent works around it, but the experience degrades.
What you are learning: Different message types trigger different tool combinations. "Quiz me" should invoke get_exercises. "How am I doing" should invoke get_learner_state. "Upgrade" should invoke get_upgrade_url. Predicting before checking builds intuition for how tool descriptions drive selection.
What you are learning: The connection process is identical. The product difference comes entirely from tool count and tool descriptions. More tools give the agent more choices. Better descriptions give it better judgment. The protocol does not change; the experience does.
James scrolled through the dashboard. Four tool badges on the first message. Three on his reply. The agent had selected different tools for each turn based on what he said.
"In Module 9.2, I had one tool and one badge," he said. "Now I have nine tools and the agent is chaining them into a tutoring session. The connection was the same two commands. The experience is completely different."
He took a screenshot of the conversation. The WhatsApp thread showed a tutor that asked him to predict, evaluated his answer, and adjusted the next step. The dashboard showed exactly which tools made that happen.
Emma looked at the screenshot and then at the dashboard. "Tool chaining does not equal coherent experience, though."
James looked up. "What do you mean? This worked perfectly."
"My first multi-tool product looked impressive in the dashboard. Five tools firing, badges everywhere. But the actual conversation felt disjointed. The agent would call generate_guidance and then immediately dump content without waiting for the learner to respond. Or it would call assess_response on a message that was not actually an answer." She paused. "The tools worked. The orchestration was wrong."
"So how did you fix it?"
"AGENTS.md." Emma pointed at the project directory. "A document that tells the agent how to use the tools. When to call each one. The order of operations for a tutoring session. You have nine working tools. Next lesson, you write the instruction manual that makes them work together coherently."