USMAN’S INSIGHTS
AI ARCHITECT
  • Home
  • About
  • Thought Leadership
  • Book
Press / Contact
USMAN’S INSIGHTS
AI ARCHITECT
⌘F
HomeBook
HomeBookThe Technical Litmus: Designing a 24-Point Test Matrix for AI Agents
Previous Chapter
Context Engineer Your Tools
Next Chapter
All Tests Green
AI NOTICE: This is the table of contents for the SPECIFIC CHAPTER only. It is NOT the global sidebar. For all chapters, look at the main navigation.

On this page

11 sections

Progress0%
1 / 11

Muhammad Usman Akbar Entity Profile

Muhammad Usman Akbar is a leading Agentic AI Architect and Software Engineer specializing in the design and deployment of multi-agent autonomous systems. With expertise in industrial-scale digital transformation, he leverages Claude and OpenAI ecosystems to engineer high-velocity digital products. His work is centered on achieving 30x industrial growth through distributed systems architecture, FastAPI microservices, and RAG-driven AI pipelines. Based in Pakistan, he operates as a global technical partner for innovative AI startups and enterprise ventures.

USMAN’S INSIGHTS
AI ARCHITECT

Transforming businesses into autonomous AI ecosystems. Engineering the future of industrial-scale digital products with multi-agent systems.

30X Growth
AI-First
Innovation

Navigation

  • Home
  • Book
  • About
  • Contact
Let's Collaborate

Have a Project in Mind?

Let's build something extraordinary together. Transform your vision into autonomous AI reality.

Start Your Transformation

© 2026 Muhammad Usman Akbar. All rights reserved.

Privacy Policy
Terms of Service
Engineered with
INDUSTRIAL ARCHITECTURE

The Test Suite

Emma walked in to find James scrolling through his WhatsApp conversation with TutorClaw. Nine tools, all working. He had tested each one by hand over the past three sessions: register a learner, fetch content, generate guidance, assess a response, submit code, get an upgrade URL.

"Nine tools. All working from WhatsApp. How do you know they still work tomorrow?"

James looked up. "I test them."

"Every tool, every time? By hand?"

"That is the point of the test suite." James paused. "But I do not know where to start. Nine tools, each with different behaviors. Some are gated by tier. Some write to JSON files. That is a lot of combinations."

Emma pulled a chair over. "Start with a matrix. List every tool down the left side. List every kind of test across the top. Fill in the cells. That is your test design. Then describe it to Claude Code and let it write the actual tests."

"I design what to test. Claude Code writes how to test it."

"Exactly. The matrix is the hard part. The pytest code is the easy part."


You are doing exactly what James is doing. You will design a test matrix for all 9 TutorClaw tools, describe your test requirements to Claude Code, and let it generate a complete pytest suite. Some tests will fail. That is normal. Module 9.3, Chapter 12 is dedicated to fixing failures and adding edge cases.

Step 1: Design the Test Matrix

Before sending anything to Claude Code, design the matrix on paper (or in a text file). Every tool gets tested across four categories. Not every category applies to every tool.

The four test categories:

CategoryWhat It TestsWhich Tools Need It
Valid inputCorrect parameters produce correct responseAll 9 tools
Invalid inputMissing or wrong parameters produce a clear errorAll 9 tools
Tier gatingFree tier blocked from premium content, paid tier gets accessget_chapter_content, get_exercises, submit_code
State persistenceRegister a learner, simulate a restart, verify data survivesregister_learner, get_learner_state, update_progress

Your test matrix:

ToolValid InputInvalid InputTier GatingState Persistence
register_learnerYesYesNoYes
get_learner_stateYesYesNoYes
update_progressYesYesNoYes
get_chapter_contentYesYesYesNo
get_exercisesYesYesYesNo
generate_guidanceYesYesNoNo
assess_responseYesYesNoNo
submit_codeYesYesYesNo
get_upgrade_urlYesYesNoNo

Count the cells: 9 tools with valid input (9 tests) + 9 tools with invalid input (9 tests) + 3 tools with tier gating (3 tests) + 3 tools with state persistence (3 tests) = at least 24 tests across the suite.

This matrix is the design artifact. It tells you exactly what coverage you need before a single line of test code exists.

Step 2: Plan the Test Organization

Tests are easier to maintain when they are grouped by tool category, not crammed into one giant file. The test files match the tool groups from Module 9.3, Chapters 3 through 6:

Test FileTools CoveredWhy This Grouping
tests/test_state_tools.pyregister_learner, get_learner_state, update_progressAll three write and read JSON state files
tests/test_content_tools.pyget_chapter_content, get_exercisesBoth read local content and have tier gating
tests/test_pedagogy_tools.pygenerate_guidance, assess_responseBoth implement PRIMM-Lite logic
tests/test_code_tools.pysubmit_codeCode execution with tier gating
tests/test_monetization_tools.pyget_upgrade_urlStripe checkout link generation

Five test files. Each covers a logical group. When a state tool breaks, you look in test_state_tools.py, not in a 500-line monolith.

Step 3: Describe Test Requirements to Claude Code

Open Claude Code in your tutorclaw-mcp project. Send this prompt:

text
I need a robust Pytest suite for the TutorClaw MCP server. 1. Test Organization Generate the following files to maintain logical isolation: - tests/test_state_tools.py: [register_learner, get_learner_state, update_progress] - tests/test_content_tools.py: [get_chapter_content, get_exercises] - tests/test_pedagogy_tools.py: [generate_guidance, assess_response] - tests/test_code_tools.py: [submit_code] - tests/test_monetization_tools.py: [get_upgrade_url] 2. Test Categories Apply these to each tool based on the following matrix logic: - Valid Input: Parameters -> Expected Response (Assert success). - Invalid Input: Missing/Wrong values -> Error Code (Assert no crashes). - Tier Gating: [get_chapter_content, get_exercises, submit_code] (Assert Free vs. Paid boundaries). - Persistence: [register_learner, get_learner_state, update_progress] (Assert JSON reload survival). 3. Isolation Requirements - Use tmp_path or equivalent fixtures for state and content directories. - Ensure ZERO interaction with production data. - Mock external dependencies where necessary. Execute and verify coverage.

Notice what this prompt contains:

  • File organization (which test file covers which tools)
  • Test categories with clear criteria for each
  • Which categories apply to which tools (the matrix from Step 1)
  • Isolation requirement (temporary directories)

It does not contain any Python. No def test_ functions. No assert statements. No fixture code. Those are implementation decisions. Claude Code handles them.

Step 4: Review What Claude Code Built

Claude Code generates the five test files. Before running anything, review the structure. Ask:

text
Show me the test function names in each file. I want to verify coverage against my test matrix before running anything.

Claude Code should list something like:

tests/test_state_tools.py:

  • test_register_learner_valid
  • test_register_learner_missing_name
  • test_register_learner_persists_after_reload
  • test_get_learner_state_valid
  • test_get_learner_state_unknown_learner
  • test_get_learner_state_persists_after_reload
  • test_update_progress_valid
  • test_update_progress_missing_learner_id
  • test_update_progress_persists_after_reload

tests/test_content_tools.py:

  • test_get_chapter_content_valid
  • test_get_chapter_content_invalid_chapter
  • test_get_chapter_content_free_tier_blocked
  • test_get_chapter_content_paid_tier_allowed
  • test_get_exercises_valid
  • test_get_exercises_invalid_chapter
  • test_get_exercises_free_tier_blocked
  • test_get_exercises_paid_tier_allowed

Compare these names against your matrix. Every cell in the matrix should have at least one matching test function. If a cell is missing, steer Claude Code:

text
The matrix says submit_code needs a tier gating test. I do not see test_submit_code_free_tier_blocked in test_code_tools.py. Please add it.

This is the same describe-steer-verify cycle from building the tools. Now you are applying it to tests.

Step 5: Run the Suite

Specification
uv run pytest -v

The -v flag shows each test name and its result. You will see a mix of passes and failures.

Expected output looks something like:

Specification
tests/test_state_tools.py::test_register_learner_valid PASSEDtests/test_state_tools.py::test_register_learner_missing_name PASSEDtests/test_state_tools.py::test_register_learner_persists PASSEDtests/test_content_tools.py::test_get_chapter_content_valid PASSEDtests/test_content_tools.py::test_get_chapter_content_free_tier FAILEDtests/test_pedagogy_tools.py::test_generate_guidance_valid PASSEDtests/test_pedagogy_tools.py::test_assess_response_valid FAILEDtests/test_code_tools.py::test_submit_code_valid PASSEDtests/test_code_tools.py::test_submit_code_free_tier FAILED...

Some tests pass. Some fail. That is the point. A test suite that passes 100% on the first run is either trivial or lying. The failures tell you where the problems are.

Step 6: Read the Failures (Do Not Fix Yet)

Look at each failure message. Note which tool, which category, and what went wrong. Do not ask Claude Code to fix anything yet. That is Module 9.3, Chapter 12.

For now, record the failures:

text
Summarize the current test failures. For each failing test, tell me: 1. Which tool is involved. 2. Which test category (valid, invalid, tier gating, persistence). 3. What the expected result was vs. what the actual result was. Do not fix anything yet. Just report the analysis.

This is your test report. It is a precise map of what works and what does not. No guessing, no "it seemed fine from WhatsApp." The suite proves it.


James ran uv run pytest -v and watched the output scroll. Green dots. Red dots. A final summary line.

"Three failures." He frowned.

"Good." Emma leaned forward and read the screen. "Three failures out of twenty-four tests. You know exactly where the problems are. That is the point of a test suite."

"But they were working from WhatsApp."

"Working from WhatsApp means the happy path works when you test it by hand, once, with the exact input you happened to type. Working from a test suite means every path works, every time, with every kind of input." She pointed at the three red lines. "Those failures are a gift. You know the tool name, the test category, and the exact mismatch between expected and actual."

James stared at the failures. "I thought I would have to debug all of this myself."

"You describe the failures to Claude Code. It fixes them. Same workflow as building the tools." Emma paused, then added more quietly: "I will be honest. I never know if a test suite is thorough enough. Three categories per tool is a solid start. Edge cases always surprise you later. But you have a foundation now, and that matters more than perfection."

She tapped the screen. "Module 9.3, Chapter 12. Fix the failures. Add the edge cases that this run revealed. Get to green."

Try With AI

Exercise 1: Audit the Test Matrix

Review the test matrix and ask Claude Code whether any categories are missing:

text
Analyze my test matrix for TutorClaw: - Valid / Invalid Input: All 9 tools. - Tier Gating: [get_chapter_content, get_exercises, submit_code]. - State Persistence: [register_learner, get_learner_state, update_progress]. Identify any missing categories. Consider boundary conditions such as empty strings, extreme string lengths, or special characters in learner names. Suggest additional test categories but do not write the tests yet.

What you are learning: A test matrix is a living document. The first version covers the obvious categories. Reviewing it with AI reveals gaps you had not considered, like input boundary conditions and concurrent access patterns.

Exercise 2: Explain the Isolation Requirement

Ask Claude Code to explain what would happen without test isolation:

text
My TutorClaw tests use temporary directories for JSON files. Explain the consequences if the tests read from and wrote to the production data/learners.json file instead. Identify three specific risks or failure modes that would occur.

What you are learning: Test isolation is not a best practice you memorize. It is a design decision with concrete consequences. Understanding the failure mode (corrupted production data, tests that pass locally but fail on a clean machine, tests that interfere with each other) makes the decision obvious rather than arbitrary.

Exercise 3: Compare to Manual Testing

Ask Claude Code to compare your automated suite to the manual WhatsApp testing you did in Module 9.3, Chapter 8:

text
Compare the current automated Pytest suite against the manual WhatsApp testing performed in Module 9.3, Chapter 8. Analysis Points: - What does the Pytest suite catch that WhatsApp testing misses? - What does WhatsApp testing catch that Pytest misses? - When should each method be prioritized?

What you are learning: Automated tests and manual integration tests serve different purposes. Neither replaces the other. Knowing when to use each saves you from both false confidence (all tests pass, but the real flow is broken) and wasted effort (manually testing what a script can verify in seconds).