Emma walked in to find James scrolling through his WhatsApp conversation with TutorClaw. Nine tools, all working. He had tested each one by hand over the past three sessions: register a learner, fetch content, generate guidance, assess a response, submit code, get an upgrade URL.
"Nine tools. All working from WhatsApp. How do you know they still work tomorrow?"
James looked up. "I test them."
"Every tool, every time? By hand?"
"That is the point of the test suite." James paused. "But I do not know where to start. Nine tools, each with different behaviors. Some are gated by tier. Some write to JSON files. That is a lot of combinations."
Emma pulled a chair over. "Start with a matrix. List every tool down the left side. List every kind of test across the top. Fill in the cells. That is your test design. Then describe it to Claude Code and let it write the actual tests."
"I design what to test. Claude Code writes how to test it."
"Exactly. The matrix is the hard part. The pytest code is the easy part."
You are doing exactly what James is doing. You will design a test matrix for all 9 TutorClaw tools, describe your test requirements to Claude Code, and let it generate a complete pytest suite. Some tests will fail. That is normal. Module 9.3, Chapter 12 is dedicated to fixing failures and adding edge cases.
Before sending anything to Claude Code, design the matrix on paper (or in a text file). Every tool gets tested across four categories. Not every category applies to every tool.
The four test categories:
Your test matrix:
Count the cells: 9 tools with valid input (9 tests) + 9 tools with invalid input (9 tests) + 3 tools with tier gating (3 tests) + 3 tools with state persistence (3 tests) = at least 24 tests across the suite.
This matrix is the design artifact. It tells you exactly what coverage you need before a single line of test code exists.
Tests are easier to maintain when they are grouped by tool category, not crammed into one giant file. The test files match the tool groups from Module 9.3, Chapters 3 through 6:
Five test files. Each covers a logical group. When a state tool breaks, you look in test_state_tools.py, not in a 500-line monolith.
Open Claude Code in your tutorclaw-mcp project. Send this prompt:
Notice what this prompt contains:
It does not contain any Python. No def test_ functions. No assert statements. No fixture code. Those are implementation decisions. Claude Code handles them.
Claude Code generates the five test files. Before running anything, review the structure. Ask:
Claude Code should list something like:
tests/test_state_tools.py:
tests/test_content_tools.py:
Compare these names against your matrix. Every cell in the matrix should have at least one matching test function. If a cell is missing, steer Claude Code:
This is the same describe-steer-verify cycle from building the tools. Now you are applying it to tests.
The -v flag shows each test name and its result. You will see a mix of passes and failures.
Expected output looks something like:
Some tests pass. Some fail. That is the point. A test suite that passes 100% on the first run is either trivial or lying. The failures tell you where the problems are.
Look at each failure message. Note which tool, which category, and what went wrong. Do not ask Claude Code to fix anything yet. That is Module 9.3, Chapter 12.
For now, record the failures:
This is your test report. It is a precise map of what works and what does not. No guessing, no "it seemed fine from WhatsApp." The suite proves it.
James ran uv run pytest -v and watched the output scroll. Green dots. Red dots. A final summary line.
"Three failures." He frowned.
"Good." Emma leaned forward and read the screen. "Three failures out of twenty-four tests. You know exactly where the problems are. That is the point of a test suite."
"But they were working from WhatsApp."
"Working from WhatsApp means the happy path works when you test it by hand, once, with the exact input you happened to type. Working from a test suite means every path works, every time, with every kind of input." She pointed at the three red lines. "Those failures are a gift. You know the tool name, the test category, and the exact mismatch between expected and actual."
James stared at the failures. "I thought I would have to debug all of this myself."
"You describe the failures to Claude Code. It fixes them. Same workflow as building the tools." Emma paused, then added more quietly: "I will be honest. I never know if a test suite is thorough enough. Three categories per tool is a solid start. Edge cases always surprise you later. But you have a foundation now, and that matters more than perfection."
She tapped the screen. "Module 9.3, Chapter 12. Fix the failures. Add the edge cases that this run revealed. Get to green."
Review the test matrix and ask Claude Code whether any categories are missing:
What you are learning: A test matrix is a living document. The first version covers the obvious categories. Reviewing it with AI reveals gaps you had not considered, like input boundary conditions and concurrent access patterns.
Ask Claude Code to explain what would happen without test isolation:
What you are learning: Test isolation is not a best practice you memorize. It is a design decision with concrete consequences. Understanding the failure mode (corrupted production data, tests that pass locally but fail on a clean machine, tests that interfere with each other) makes the decision obvious rather than arbitrary.
Ask Claude Code to compare your automated suite to the manual WhatsApp testing you did in Module 9.3, Chapter 8:
What you are learning: Automated tests and manual integration tests serve different purposes. Neither replaces the other. Knowing when to use each saves you from both false confidence (all tests pass, but the real flow is broken) and wasted effort (manually testing what a script can verify in seconds).