James counted on his fingers. "Seven tools done. Two to go. Code execution and the upgrade link."
Emma looked up from her terminal. "The code tool is the dangerous one. A learner sends Python code and your server runs it. Think about that for a second."
"So we need a sandbox."
"Eventually. Right now we need a mock that is good enough to test the tutoring flow. A subprocess with a timeout. Five seconds. If the code finishes, return the output. If it hangs, kill it. Basic safety: block imports of os and subprocess so nobody deletes your files from inside a tutoring session."
"And the upgrade URL?"
"Placeholder. A string that looks like a Stripe checkout link. Real Stripe comes after tests." She turned back to her screen. "Two tools, both mocked. One hour."
You are doing exactly what James is doing. Two more tools to build, both using mock implementations that are good enough to test the full product flow. Production-grade sandboxing and real Stripe integration come later. Right now, the goal is completing the tool surface.
This tool lets a learner submit Python code and get the output back. In production, you would run that code in a Docker container or a browser-based sandbox like Pyodide. For now, a subprocess with a timeout is sufficient to test the tutoring loop.
Open Claude Code in your tutorclaw-mcp project and describe the tool:
Notice what you told Claude Code and what you left out. You described the behavior (run code, capture output), the constraints (timeout, blocked imports), and the scope (mock, not production). You did not describe how to implement subprocess calls or how to parse import statements. Those are implementation decisions for Claude Code.
When Claude Code returns the spec, check three things:
If the description is vague, steer it:
Once the spec looks right, approve the build:
After Claude Code finishes building, test the tool with three cases:
Case 1: Valid code that finishes quickly
Call submit_code with print("hello"). You should get back stdout containing hello.
Case 2: Code that runs forever
Call submit_code with while True: pass. The tool should return a timeout error after 5 seconds, not hang indefinitely.
Case 3: Blocked import
Call submit_code with import os; os.listdir("."). The tool should reject the code before running it, returning an error that names the blocked import.
All three cases passing means the mock sandbox works for testing purposes. It is not production-safe (a determined user could still cause problems), but it is sufficient to test the tutoring flow where a learner submits exercises.
This tool creates a checkout link for learners who want to upgrade from free to paid. Real Stripe integration comes in Module 9.3, Chapter 14. For now, a placeholder URL is enough to prove the upgrade flow works.
Describe the tool to Claude Code:
Two test cases:
Case 1: Free-tier learner requests upgrade
Call get_upgrade_url with a learner ID that has a free tier. You should get back the mock URL.
Case 2: Paid-tier learner requests upgrade
Call get_upgrade_url with a learner ID that has a paid tier (you may need to manually edit the JSON state file to set a learner's tier to "paid" for this test). You should get an error message indicating the learner is already upgraded.
Both cases passing means the upgrade flow will work when you wire all nine tools together in Module 9.3, Chapter 7.
The tools themselves are different (state management, content delivery, teaching methodology, code execution, payments), but the workflow never changed.
Every tool was built the same way: describe what you need, steer the spec, let Claude Code implement, verify the result. The tools themselves are different (state management, content delivery, teaching methodology, code execution, payments), but the workflow never changed.
Two of those tools are mocks. That is intentional. The submit_code tool runs real code in a subprocess, which is enough to test whether the tutoring loop works. The get_upgrade_url tool returns a placeholder, which is enough to test whether the tier-gating flow works. Neither mock blocks progress on the product. Both will be replaced when the product needs production infrastructure.
When Mocks Become Debt
A mock becomes technical debt when you forget it exists. Write a comment in both tools: "MOCK: Replace with real implementation in Module 9.3, Chapter 14 (Stripe) and production sandbox." The comment is a reminder, not an apology.
The current submit_code mock blocks os, subprocess, shutil, and open(). There are other dangerous operations a learner could attempt. Describe an additional constraint to Claude Code:
Run the tests after the change. Does the new constraint work without breaking existing tests?
What you are learning: Safety constraints are additive. Each one narrows what untrusted code can do. The skill is knowing which constraints matter for your threat model and which ones are overkill for a mock.
Ask Claude Code to evaluate where your mock falls short:
What you are learning: Mocks have known limitations. The value of a mock is not that it is safe. The value is that it lets you test the product flow while you defer the safety investment. Understanding the gap between mock and production helps you decide when to replace it.
When get_upgrade_url is called for a paid learner, the error message matters because the agent reads it and decides what to tell the learner. Ask Claude Code to improve the error:
What you are learning: Tool error messages are instructions to the agent. A good error message tells the agent what to do next, not just what went wrong. This is the same context engineering principle from tool descriptions, applied to error paths.
James scrolled through his terminal. "Nine tools. register_learner, get_learner_state, update_progress, get_chapter_content, get_exercises, generate_guidance, assess_response, submit_code, get_upgrade_url." He counted them off on his fingers. "Four chapters. All working."
"All working in isolation," Emma corrected. "Each tool does its job when you call it directly. But nobody is calling them in sequence yet. A real tutoring session starts with register, moves to content, generates guidance, assesses a response, maybe executes submitted code. That is a chain of tool calls, not nine separate calls."
"So the next step is wiring them together?"
Emma nodded, then paused. "I will say this, though. I have shipped mocks that became permanent. Three years later someone finds a subprocess call with a five-second timeout in production and wonders who thought that was acceptable." She pointed at his screen. "Set a reminder. Write a comment. Something that says: this is a mock, replace it by Module 9.3, Chapter 14. If you do not mark it, you will forget it."
"You sound like you are speaking from experience."
"I am speaking from a post-mortem." She picked up her coffee. "Module 9.3, Chapter 7: wire all nine tools into one server and run a complete tutoring flow. Isolation is over. Integration starts."