The Test Suite

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

Emma walked in to find James scrolling through his WhatsApp conversation with TutorClaw. Nine tools, all working. He had tested each one by hand over the past three sessions: register a learner, fetch content, generate guidance, assess a response, submit code, get an upgrade URL.

"Nine tools. All working from WhatsApp. How do you know they still work tomorrow?"

James looked up. "I test them."

"Every tool, every time? By hand?"

"That is the point of the test suite." James paused. "But I do not know where to start. Nine tools, each with different behaviors. Some are gated by tier. Some write to JSON files. That is a lot of combinations."

Emma pulled a chair over. "Start with a matrix. List every tool down the left side. List every kind of test across the top. Fill in the cells. That is your test design. Then describe it to Claude Code and let it write the actual tests."

"I design what to test. Claude Code writes how to test it."

"Exactly. The matrix is the hard part. The pytest code is the easy part."

You are doing exactly what James is doing. You will design a test matrix for all 9 TutorClaw tools, describe your test requirements to Claude Code, and let it generate a complete pytest suite. Some tests will fail. That is normal. Module 9.3, Chapter 12 is dedicated to fixing failures and adding edge cases.

Step 1: Design the Test Matrix

Before sending anything to Claude Code, design the matrix on paper (or in a text file). Every tool gets tested across four categories. Not every category applies to every tool.

The four test categories:

Category	What It Tests	Which Tools Need It
Valid input	Correct parameters produce correct response	All 9 tools
Invalid input	Missing or wrong parameters produce a clear error	All 9 tools
Tier gating	Free tier blocked from premium content, paid tier gets access	get_chapter_content, get_exercises, submit_code
State persistence	Register a learner, simulate a restart, verify data survives	register_learner, get_learner_state, update_progress

Your test matrix:

Tool	Valid Input	Invalid Input	Tier Gating	State Persistence
register_learner	Yes	Yes	No	Yes
get_learner_state	Yes	Yes	No	Yes
update_progress	Yes	Yes	No	Yes
get_chapter_content	Yes	Yes	Yes	No
get_exercises	Yes	Yes	Yes	No
generate_guidance	Yes	Yes	No	No
assess_response	Yes	Yes	No	No
submit_code	Yes	Yes	Yes	No
get_upgrade_url	Yes	Yes	No	No

Count the cells: 9 tools with valid input (9 tests) + 9 tools with invalid input (9 tests) + 3 tools with tier gating (3 tests) + 3 tools with state persistence (3 tests) = at least 24 tests across the suite.

This matrix is the design artifact. It tells you exactly what coverage you need before a single line of test code exists.

Step 2: Plan the Test Organization

Tests are easier to maintain when they are grouped by tool category, not crammed into one giant file. The test files match the tool groups from Module 9.3, Chapters 3 through 6:

Test File	Tools Covered	Why This Grouping
tests/test_state_tools.py	register_learner, get_learner_state, update_progress	All three write and read JSON state files
tests/test_content_tools.py	get_chapter_content, get_exercises	Both read local content and have tier gating
tests/test_pedagogy_tools.py	generate_guidance, assess_response	Both implement PRIMM-Lite logic
tests/test_code_tools.py	submit_code	Code execution with tier gating
tests/test_monetization_tools.py	get_upgrade_url	Stripe checkout link generation

Five test files. Each covers a logical group. When a state tool breaks, you look in test_state_tools.py, not in a 500-line monolith.

Step 3: Describe Test Requirements to Claude Code

Open Claude Code in your tutorclaw-mcp project. Send this prompt:

text

I need a robust Pytest suite for the TutorClaw MCP server.


1. Test Organization
Generate the following files to maintain logical isolation:
- tests/test_state_tools.py: [register_learner, get_learner_state, update_progress]
- tests/test_content_tools.py: [get_chapter_content, get_exercises]
- tests/test_pedagogy_tools.py: [generate_guidance, assess_response]
- tests/test_code_tools.py: [submit_code]
- tests/test_monetization_tools.py: [get_upgrade_url]


2. Test Categories
Apply these to each tool based on the following matrix logic:
- Valid Input: Parameters -> Expected Response (Assert success).
- Invalid Input: Missing/Wrong values -> Error Code (Assert no crashes).
- Tier Gating: [get_chapter_content, get_exercises, submit_code] (Assert Free vs. Paid boundaries).
- Persistence: [register_learner, get_learner_state, update_progress] (Assert JSON reload survival).


3. Isolation Requirements
- Use tmp_path or equivalent fixtures for state and content directories.
- Ensure ZERO interaction with production data.
- Mock external dependencies where necessary.

Execute and verify coverage.

Notice what this prompt contains:

File organization (which test file covers which tools)
Test categories with clear criteria for each
Which categories apply to which tools (the matrix from Step 1)
Isolation requirement (temporary directories)

It does not contain any Python. No def test_ functions. No assert statements. No fixture code. Those are implementation decisions. Claude Code handles them.

Step 4: Review What Claude Code Built

Claude Code generates the five test files. Before running anything, review the structure. Ask:

text

Show me the test function names in each file. I want to verify coverage against my test matrix before running anything.

Claude Code should list something like:

tests/test_state_tools.py:

test_register_learner_valid
test_register_learner_missing_name
test_register_learner_persists_after_reload
test_get_learner_state_valid
test_get_learner_state_unknown_learner
test_get_learner_state_persists_after_reload
test_update_progress_valid
test_update_progress_missing_learner_id
test_update_progress_persists_after_reload

tests/test_content_tools.py:

test_get_chapter_content_valid
test_get_chapter_content_invalid_chapter
test_get_chapter_content_free_tier_blocked
test_get_chapter_content_paid_tier_allowed
test_get_exercises_valid
test_get_exercises_invalid_chapter
test_get_exercises_free_tier_blocked
test_get_exercises_paid_tier_allowed

Compare these names against your matrix. Every cell in the matrix should have at least one matching test function. If a cell is missing, steer Claude Code:

text

The matrix says submit_code needs a tier gating test. I do not see test_submit_code_free_tier_blocked in test_code_tools.py. Please add it.

This is the same describe-steer-verify cycle from building the tools. Now you are applying it to tests.

Step 5: Run the Suite

Specification

uv run pytest -v

The -v flag shows each test name and its result. You will see a mix of passes and failures.

Expected output looks something like:

Specification

tests/test_state_tools.py::test_register_learner_valid PASSEDtests/test_state_tools.py::test_register_learner_missing_name PASSEDtests/test_state_tools.py::test_register_learner_persists PASSEDtests/test_content_tools.py::test_get_chapter_content_valid PASSEDtests/test_content_tools.py::test_get_chapter_content_free_tier FAILEDtests/test_pedagogy_tools.py::test_generate_guidance_valid PASSEDtests/test_pedagogy_tools.py::test_assess_response_valid FAILEDtests/test_code_tools.py::test_submit_code_valid PASSEDtests/test_code_tools.py::test_submit_code_free_tier FAILED...

Some tests pass. Some fail. That is the point. A test suite that passes 100% on the first run is either trivial or lying. The failures tell you where the problems are.

Step 6: Read the Failures (Do Not Fix Yet)

Look at each failure message. Note which tool, which category, and what went wrong. Do not ask Claude Code to fix anything yet. That is Module 9.3, Chapter 12.

For now, record the failures:

text

Summarize the current test failures. For each failing test, tell me:
1. Which tool is involved.

2. Which test category (valid, invalid, tier gating, persistence).

3. What the expected result was vs. what the actual result was.

Do not fix anything yet. Just report the analysis.

This is your test report. It is a precise map of what works and what does not. No guessing, no "it seemed fine from WhatsApp." The suite proves it.

James ran uv run pytest -v and watched the output scroll. Green dots. Red dots. A final summary line.

"Three failures." He frowned.

"Good." Emma leaned forward and read the screen. "Three failures out of twenty-four tests. You know exactly where the problems are. That is the point of a test suite."

"But they were working from WhatsApp."

"Working from WhatsApp means the happy path works when you test it by hand, once, with the exact input you happened to type. Working from a test suite means every path works, every time, with every kind of input." She pointed at the three red lines. "Those failures are a gift. You know the tool name, the test category, and the exact mismatch between expected and actual."

James stared at the failures. "I thought I would have to debug all of this myself."

"You describe the failures to Claude Code. It fixes them. Same workflow as building the tools." Emma paused, then added more quietly: "I will be honest. I never know if a test suite is thorough enough. Three categories per tool is a solid start. Edge cases always surprise you later. But you have a foundation now, and that matters more than perfection."

She tapped the screen. "Module 9.3, Chapter 12. Fix the failures. Add the edge cases that this run revealed. Get to green."

Try With AI

Exercise 1: Audit the Test Matrix

Review the test matrix and ask Claude Code whether any categories are missing:

text

Analyze my test matrix for TutorClaw:
- Valid / Invalid Input: All 9 tools.
- Tier Gating: [get_chapter_content, get_exercises, submit_code].
- State Persistence: [register_learner, get_learner_state, update_progress].

Identify any missing categories. Consider boundary conditions such as empty strings, extreme string lengths, or special characters in learner names. Suggest additional test categories but do not write the tests yet.

What you are learning: A test matrix is a living document. The first version covers the obvious categories. Reviewing it with AI reveals gaps you had not considered, like input boundary conditions and concurrent access patterns.

Exercise 2: Explain the Isolation Requirement

Ask Claude Code to explain what would happen without test isolation:

text

My TutorClaw tests use temporary directories for JSON files.

Explain the consequences if the tests read from and wrote to the production data/learners.json file instead. Identify three specific risks or failure modes that would occur.

What you are learning: Test isolation is not a best practice you memorize. It is a design decision with concrete consequences. Understanding the failure mode (corrupted production data, tests that pass locally but fail on a clean machine, tests that interfere with each other) makes the decision obvious rather than arbitrary.

Exercise 3: Compare to Manual Testing

Ask Claude Code to compare your automated suite to the manual WhatsApp testing you did in Module 9.3, Chapter 8:

text

Compare the current automated Pytest suite against the manual WhatsApp testing performed in Module 9.3, Chapter 8.

Analysis Points:
- What does the Pytest suite catch that WhatsApp testing misses?
- What does WhatsApp testing catch that Pytest misses?
- When should each method be prioritized?

What you are learning: Automated tests and manual integration tests serve different purposes. Neither replaces the other. Knowing when to use each saves you from both false confidence (all tests pass, but the real flow is broken) and wasted effort (manually testing what a script can verify in seconds).