Expected vs Actual

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

Emma handed James a blank sheet. "Before you send a single WhatsApp message, write down what you think should happen. Not what the tools do. What the EXPERIENCE should be."

James thought about it. "The learner says 'teach me about variables.' The tutor should check who they are, pull the chapter content, and then... not dump it. Ask them to predict first. That is the whole point of PRIMM."

"Write that down. The exact sequence. What the agent should call, what it should say, what it should NOT say."

James wrote. Five minutes. A page of expectations. Then Emma said: "Now test it. Send the message. Compare what you wrote to what actually happens."

You are doing exactly what James is doing. Before testing, you write down what a good tutoring session looks like. Then you test and grade the gaps. The gaps you find here become the motivation for AGENTS.md (Module 9.3, Chapter 9) and context engineering (Module 9.3, Chapter 10).

Step 1: Write Your Expectations

Before opening WhatsApp, write down what SHOULD happen when a learner sends "Teach me about variables in Module 9.3, Chapter 1." Use this template:

Expected tool chain:

Agent calls get_learner_state (or register_learner if new)
Agent calls get_chapter_content for Module 9.3, Chapter 1
Agent calls generate_guidance for the predict stage
Agent responds with a prediction question, NOT a content dump

Expected response qualities:

Asks the learner to predict before showing answers (PRIMM predict stage)
Does not dump the entire chapter content
Mentions or references the specific topic (variables)
Feels like a tutor, not a search engine

Write your version of this. Be specific. You are creating the acceptance criteria for your product.

Step 2: Test from WhatsApp

Now send the message:

Specification

I want to learn about variables in Module 9.3, Chapter 1

Wait for the response. This is different from the Module 9.2 test, where one message triggered one tool. Here, the agent should chain multiple tools:

register_learner or get_learner_state: The agent checks if you are a known learner. If not, it registers you first.
get_chapter_content: The agent fetches the content for Module 9.3, Chapter 1.
generate_guidance with the predict stage: The agent produces a PRIMM-Lite prompt asking you to predict what a variable does before showing the answer.

The response you receive asks you to think first. It does not hand you the definition. This is the pedagogy tools at work: generate_guidance is shaping the interaction into a teaching session.

Continue the Conversation

Reply with a prediction. Something like:

Specification

I think a variable stores a value so you can use it later

The agent processes your reply through a second sequence of tools:

assess_response: Evaluates your prediction against the expected understanding.
update_progress: Records the interaction and adjusts your confidence score.
generate_guidance with the run stage: Produces the next part of the lesson, now showing the actual content with guidance tailored to your prediction.

You are having a tutoring conversation. The agent is not running a script. It is selecting tools based on what you said, evaluating your response, and adapting the next step. Each message triggers a different combination of tools.

Verify Tool Badges in the Dashboard

Go back to the dashboard. Find the conversation log for the messages you sent. For each message, you should see multiple tool badges showing which tools fired.

Your first message might show three or four badges: get_learner_state (or register_learner), get_chapter_content, generate_guidance.

Your reply might show two or three badges: assess_response, update_progress, generate_guidance.

This is the visible difference between Module 9.2 and Module 9.3. In Module 9.2, one message produced one badge. Here, one message produces multiple badges because the agent is orchestrating tools into a workflow.

What You See	What It Means
Single tool badge	Agent called one tool (Module 9.2 pattern)
Multiple tool badges	Agent chained tools into a sequence (Module 9.3 pattern)
No tool badge	Agent generated a response without calling any tool (check if something went wrong)

The tool badges are your proof that the product is working. A well-phrased text response without badges could be the agent hallucinating an answer from training data. The badges confirm the tools actually ran.

Step 4: Grade the Gaps

Now compare your expectations from Step 1 to what actually happened.

Expectation	Actual	Gap?
Agent calls get_learner_state first	Did it? Check the first badge.	If it called get_chapter_content first, the session start order is wrong
Response asks learner to predict	Did the response ask a prediction question?	If it dumped content, generate_guidance is not shaping the response
Does not dump entire chapter	Was the response a wall of text?	If so, the agent ignored the system_prompt_addition
Feels like a tutor	Would you come back to this tomorrow?	If it feels like a search engine, identity is missing

Write down every gap you found. These gaps are your TODO list for the next chapters:

Tool ordering problems → Module 9.3, Chapter 9 (AGENTS.md) defines the session protocol
Wrong tool selected → Module 9.3, Chapter 10 (context engineering) rewrites descriptions
Tests needed for these behaviors → Module 9.3, Chapters 11-12 (test suite)
No personality → Module 9.3, Chapter 17 (dedicated agent with SOUL.md)

The gap list is the most valuable artifact in this lesson. It turns "this works" into "this works the way I intended."

Try With AI

Exercise 1: Map the Tool Chain

text

Map the Tool Selection Logic.

Scene:
The learner sends the message: "I want to learn about variables in Chapter 1".

Investigation:
Explain the agent's decision-making process. Why does it select those specific tool badges in that exact order? What would be the systemic impact if one of those tools was temporarily offline?

What you are learning: The agent selects tools based on their descriptions and the user's message. Understanding the selection logic helps you predict which tools fire for different messages. Removing a tool does not cause an error; the agent works around it, but the experience degrades.

Exercise 2: Stress Test with Ambiguity

text

Conduct a Cross-Functional Stress Test.

Task:
Predict the tool sequences for the following ambiguous messages. Verify your predictions against the dashboard badges:
1. Quiz me on Chapter 2
2. How am I doing overall?


3. I want to upgrade to the paid plan

Analysis:
Does the agent reliably map intent to the correct underlying tool? Identify any selection collisions.

What you are learning: Different message types trigger different tool combinations. "Quiz me" should invoke get_exercises. "How am I doing" should invoke get_learner_state. "Upgrade" should invoke get_upgrade_url. Predicting before checking builds intuition for how tool descriptions drive selection.

Exercise 3: Compare Module 9.2 and Module 9.3

text

Compare Architecture vs Experience (Module 9.2 vs Module 9.3).

Context:
In Module 9.2, you tested a single function call from WhatsApp. In Module 9.3, you are testing a full pedagogical ecosystem.

Question:
What has changed in the agent's autonomous behavior? How does having a surface of nine tools transform the user experience from a utility to a product?

What you are learning: The connection process is identical. The product difference comes entirely from tool count and tool descriptions. More tools give the agent more choices. Better descriptions give it better judgment. The protocol does not change; the experience does.

James scrolled through the dashboard. Four tool badges on the first message. Three on his reply. The agent had selected different tools for each turn based on what he said.

"In Module 9.2, I had one tool and one badge," he said. "Now I have nine tools and the agent is chaining them into a tutoring session. The connection was the same two commands. The experience is completely different."

He took a screenshot of the conversation. The WhatsApp thread showed a tutor that asked him to predict, evaluated his answer, and adjusted the next step. The dashboard showed exactly which tools made that happen.

Emma looked at the screenshot and then at the dashboard. "Tool chaining does not equal coherent experience, though."

James looked up. "What do you mean? This worked perfectly."

"My first multi-tool product looked impressive in the dashboard. Five tools firing, badges everywhere. But the actual conversation felt disjointed. The agent would call generate_guidance and then immediately dump content without waiting for the learner to respond. Or it would call assess_response on a message that was not actually an answer." She paused. "The tools worked. The orchestration was wrong."

"So how did you fix it?"

"AGENTS.md." Emma pointed at the project directory. "A document that tells the agent how to use the tools. When to call each one. The order of operations for a tutoring session. You have nine working tools. Next lesson, you write the instruction manual that makes them work together coherently."