USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookThe Integration Gap: Measuring Expected vs. Actual AI Behavior
Previous Chapter
The Full Cycle Test
Next Chapter
The Agents Instruction Manual
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

10 sections

Progress0%
1 / 10

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

Expected vs Actual

Emma handed James a blank sheet. "Before you send a single WhatsApp message, write down what you think should happen. Not what the tools do. What the EXPERIENCE should be."

James thought about it. "The learner says 'teach me about variables.' The tutor should check who they are, pull the chapter content, and then... not dump it. Ask them to predict first. That is the whole point of PRIMM."

"Write that down. The exact sequence. What the agent should call, what it should say, what it should NOT say."

James wrote. Five minutes. A page of expectations. Then Emma said: "Now test it. Send the message. Compare what you wrote to what actually happens."


You are doing exactly what James is doing. Before testing, you write down what a good tutoring session looks like. Then you test and grade the gaps. The gaps you find here become the motivation for AGENTS.md (Module 9.3, Chapter 9) and context engineering (Module 9.3, Chapter 10).

Step 1: Write Your Expectations

Before opening WhatsApp, write down what SHOULD happen when a learner sends "Teach me about variables in Module 9.3, Chapter 1." Use this template:

Expected tool chain:

  1. Agent calls get_learner_state (or register_learner if new)
  2. Agent calls get_chapter_content for Module 9.3, Chapter 1
  3. Agent calls generate_guidance for the predict stage
  4. Agent responds with a prediction question, NOT a content dump

Expected response qualities:

  • Asks the learner to predict before showing answers (PRIMM predict stage)
  • Does not dump the entire chapter content
  • Mentions or references the specific topic (variables)
  • Feels like a tutor, not a search engine

Write your version of this. Be specific. You are creating the acceptance criteria for your product.

Step 2: Test from WhatsApp

Now send the message:

Specification
I want to learn about variables in Module 9.3, Chapter 1

Wait for the response. This is different from the Module 9.2 test, where one message triggered one tool. Here, the agent should chain multiple tools:

  1. register_learner or get_learner_state: The agent checks if you are a known learner. If not, it registers you first.
  2. get_chapter_content: The agent fetches the content for Module 9.3, Chapter 1.
  3. generate_guidance with the predict stage: The agent produces a PRIMM-Lite prompt asking you to predict what a variable does before showing the answer.

The response you receive asks you to think first. It does not hand you the definition. This is the pedagogy tools at work: generate_guidance is shaping the interaction into a teaching session.

Continue the Conversation

Reply with a prediction. Something like:

Specification
I think a variable stores a value so you can use it later

The agent processes your reply through a second sequence of tools:

  1. assess_response: Evaluates your prediction against the expected understanding.
  2. update_progress: Records the interaction and adjusts your confidence score.
  3. generate_guidance with the run stage: Produces the next part of the lesson, now showing the actual content with guidance tailored to your prediction.

You are having a tutoring conversation. The agent is not running a script. It is selecting tools based on what you said, evaluating your response, and adapting the next step. Each message triggers a different combination of tools.

Verify Tool Badges in the Dashboard

Go back to the dashboard. Find the conversation log for the messages you sent. For each message, you should see multiple tool badges showing which tools fired.

Your first message might show three or four badges: get_learner_state (or register_learner), get_chapter_content, generate_guidance.

Your reply might show two or three badges: assess_response, update_progress, generate_guidance.

This is the visible difference between Module 9.2 and Module 9.3. In Module 9.2, one message produced one badge. Here, one message produces multiple badges because the agent is orchestrating tools into a workflow.

What You SeeWhat It Means
Single tool badgeAgent called one tool (Module 9.2 pattern)
Multiple tool badgesAgent chained tools into a sequence (Module 9.3 pattern)
No tool badgeAgent generated a response without calling any tool (check if something went wrong)

The tool badges are your proof that the product is working. A well-phrased text response without badges could be the agent hallucinating an answer from training data. The badges confirm the tools actually ran.

Step 4: Grade the Gaps

Now compare your expectations from Step 1 to what actually happened.

ExpectationActualGap?
Agent calls get_learner_state firstDid it? Check the first badge.If it called get_chapter_content first, the session start order is wrong
Response asks learner to predictDid the response ask a prediction question?If it dumped content, generate_guidance is not shaping the response
Does not dump entire chapterWas the response a wall of text?If so, the agent ignored the system_prompt_addition
Feels like a tutorWould you come back to this tomorrow?If it feels like a search engine, identity is missing

Write down every gap you found. These gaps are your TODO list for the next chapters:

  • Tool ordering problems → Module 9.3, Chapter 9 (AGENTS.md) defines the session protocol
  • Wrong tool selected → Module 9.3, Chapter 10 (context engineering) rewrites descriptions
  • Tests needed for these behaviors → Module 9.3, Chapters 11-12 (test suite)
  • No personality → Module 9.3, Chapter 17 (dedicated agent with SOUL.md)

The gap list is the most valuable artifact in this lesson. It turns "this works" into "this works the way I intended."

Try With AI

Exercise 1: Map the Tool Chain

text
Map the Tool Selection Logic. Scene: The learner sends the message: "I want to learn about variables in Chapter 1". Investigation: Explain the agent's decision-making process. Why does it select those specific tool badges in that exact order? What would be the systemic impact if one of those tools was temporarily offline?

What you are learning: The agent selects tools based on their descriptions and the user's message. Understanding the selection logic helps you predict which tools fire for different messages. Removing a tool does not cause an error; the agent works around it, but the experience degrades.

Exercise 2: Stress Test with Ambiguity

text
Conduct a Cross-Functional Stress Test. Task: Predict the tool sequences for the following ambiguous messages. Verify your predictions against the dashboard badges: 1. Quiz me on Chapter 2 2. How am I doing overall? 3. I want to upgrade to the paid plan Analysis: Does the agent reliably map intent to the correct underlying tool? Identify any selection collisions.

What you are learning: Different message types trigger different tool combinations. "Quiz me" should invoke get_exercises. "How am I doing" should invoke get_learner_state. "Upgrade" should invoke get_upgrade_url. Predicting before checking builds intuition for how tool descriptions drive selection.

Exercise 3: Compare Module 9.2 and Module 9.3

text
Compare Architecture vs Experience (Module 9.2 vs Module 9.3). Context: In Module 9.2, you tested a single function call from WhatsApp. In Module 9.3, you are testing a full pedagogical ecosystem. Question: What has changed in the agent's autonomous behavior? How does having a surface of nine tools transform the user experience from a utility to a product?

What you are learning: The connection process is identical. The product difference comes entirely from tool count and tool descriptions. More tools give the agent more choices. Better descriptions give it better judgment. The protocol does not change; the experience does.


James scrolled through the dashboard. Four tool badges on the first message. Three on his reply. The agent had selected different tools for each turn based on what he said.

"In Module 9.2, I had one tool and one badge," he said. "Now I have nine tools and the agent is chaining them into a tutoring session. The connection was the same two commands. The experience is completely different."

He took a screenshot of the conversation. The WhatsApp thread showed a tutor that asked him to predict, evaluated his answer, and adjusted the next step. The dashboard showed exactly which tools made that happen.

Emma looked at the screenshot and then at the dashboard. "Tool chaining does not equal coherent experience, though."

James looked up. "What do you mean? This worked perfectly."

"My first multi-tool product looked impressive in the dashboard. Five tools firing, badges everywhere. But the actual conversation felt disjointed. The agent would call generate_guidance and then immediately dump content without waiting for the learner to respond. Or it would call assess_response on a message that was not actually an answer." She paused. "The tools worked. The orchestration was wrong."

"So how did you fix it?"

"AGENTS.md." Emma pointed at the project directory. "A document that tells the agent how to use the tools. When to call each one. The order of operations for a tutoring session. You have nine working tools. Next lesson, you write the instruction manual that makes them work together coherently."