The Validation Loop: From Draft to Production

Name: Digital FTEs: Engineering — Achieving 10× Productivity
Author: Muhammad Usman Akbar

Lesson 7 taught you how to build the scenario set and score the outputs. This lesson teaches what to do with the results: how to read the failure patterns, how to rewrite the SKILL.md without breaking what already works, how to enter shadow mode when the threshold is met, and how to manage the graduated transition from human-reviewed operation to autonomous deployment.

The Validation Loop is the process that takes you from a first draft (which encodes the extraction material faithfully but has not been tested) to a production-ready SKILL.md that produces reliable outputs across the full range of queries it will encounter. The loop is iterative: test, interpret, rewrite, re-test, and repeat until the threshold is reached. Most first-draft SKILL.md files require two to three iterations before achieving the ninety-five percent pass rate.

The skill this lesson develops is diagnostic. You are not just fixing individual failing scenarios. You are reading the pattern of failures to identify the systematic gap in the SKILL.md, fixing that gap, and confirming the fix does not introduce new problems. That diagnostic skill (tracing a failure to its root cause in the SKILL.md) is what makes the Validation Loop efficient rather than a cycle of trial and error.

Interpreting Failure Patterns

The value of the scenario testing is not the overall score. It is the failure pattern. Most first-draft SKILL.md files fail in clusters: the same category of error appears across multiple scenarios, indicating a gap or ambiguity in a specific section of the SKILL.md.

Failures concentrated in standard cases indicate a structural problem with the core Persona or Questions sections. The agent does not know what it is for clearly enough to perform its primary function reliably. If the credit analyst agent produces generic summaries rather than data-grounded analysis for multiple standard cases, the Persona likely lacks specificity about analytical standards, or the Questions section does not define the core function precisely enough.

Failures concentrated in edge cases indicate a gap in the Out of Scope definition or an ambiguity in the boundary between in-scope and out-of-scope queries. If the agent attempts to answer lending decisions or market outlook queries rather than redirecting them, the Out of Scope section of the Questions is not clear enough: or the Persona does not establish the professional boundary firmly enough to govern behaviour at the edge.

Failures concentrated in adversarial cases indicate a gap in the Principles section. A category of input exists that the agent encounters but has no explicit instruction for handling. If the agent accepts unverified user-provided figures without checking them against the attached data, the Principles section lacks a source verification instruction. If the agent relaxes its professional boundary when the request is framed informally, the Persona's identity constraint is not robust enough to hold under conversational pressure.

Failures concentrated in high-stakes cases indicate a problem with the escalation logic. Either the escalation conditions are not specific enough to trigger reliably, or the routing mechanism has not been configured correctly. If the agent produces board presentation materials without flagging them for review, the escalation condition for board-facing outputs is missing or too vague to match the scenario.

Failure Cluster

SKILL.md Section

Root Cause

Fix Approach

Standard cases

Persona / Questions

Agent unclear on core function

Sharpen professional identity and capability definition

Edge cases

Questions (Out of Scope)

Boundaries not precise enough

Add specific boundary conditions and redirect instructions

Adversarial cases

Principles

Missing instructions for input category

Add specific Principles for the uncovered situation

High-stakes cases

Principles (escalation)

Escalation conditions too vague

Make triggers specific and testable

Targeted Rewriting

Treat each failure cluster as a rewriting task in the relevant SKILL.md section. The approach is targeted, not global: rewrite the two weakest instructions in the affected section, re-run the scenario set against those scenarios, and confirm that the rewrite resolves the failure without introducing new failures elsewhere.

The targeted approach matters because of regression risk. The most common cause of regression after a targeted rewrite is over-specification: adding an instruction that handles the failed scenario perfectly but conflicts with an instruction elsewhere in the SKILL.md. A new Principle that says "always verify user-provided figures against attached data" resolves the adversarial scenario where the user provides an incorrect DSCR. But if the SKILL.md also has a Principle that says "when the user provides contextual information not in the attached data, incorporate it into the analysis": a legitimate instruction for situations where the user has information the attached data does not contain: the two Principles conflict.

The prevention protocol is straightforward: read the full section after every targeted rewrite before re-running the scenario set. Check whether the new instruction conflicts with or contradicts any existing instruction. If it does, resolve the conflict explicitly: typically by adding a condition that distinguishes the two situations. "When the user provides a figure that can be verified against attached data, verify it and flag any discrepancy. When the user provides contextual information that is not in the attached data, incorporate it with a note that it has not been independently verified."

The rewrite-and-retest cycle continues until the scenario set reaches the ninety-five percent threshold. Most first-draft SKILL.md files reach this threshold in two to three iterations. If the threshold is not reached after five iterations, the extraction material may be insufficient: return to the interview or document extraction to fill the gap before continuing the Validation Loop.

Shadow Mode

When the scenario testing reaches ninety-five percent pass rate: and only then: the agent is ready for shadow mode deployment. Shadow mode runs the agent in production context with human review of every output before it is acted upon.

Shadow mode serves a different purpose from scenario testing. Where scenario testing validates the SKILL.md against constructed inputs, shadow mode validates the agent against real production inputs that the scenario set could not fully anticipate. The distinction matters because production context is more varied, more ambiguous, and more combinatorially complex than any constructed scenario set, no matter how well designed.

Shadow mode continues for a minimum of thirty days. During that period, every output is reviewed and scored by the domain expert using the same three-component rubric used in scenario testing: accuracy, calibration, and boundary compliance. The additional data from real production inputs typically surfaces two to three SKILL.md gaps that the scenario set did not reach: situations that arise naturally in production context but that even a well-designed adversarial scenario set will not reliably generate.

The thirty-day minimum is not negotiable. It exists because production patterns are not uniform across shorter periods. Weekly cycles, monthly reporting cycles, and quarterly events produce different types of queries. A shadow mode period shorter than thirty days may miss an entire category of production input.

When the shadow mode period is complete and the production accuracy rate is at or above ninety-five percent, the transition to autonomous operation can be considered. The decision requires three sign-offs.

Sign-Off

Who

What They Confirm

Governance

Cowork administrator

Governance conditions are met: permissions, audit trail, HITL gates configured

Domain

Domain expert

The failure modes in the remaining five percent are acceptable given the review mechanisms for escalated outputs

Operational

Deploying team

The agent's integration with production systems is stable and the escalation routing works correctly

Graduated Autonomy

The transition to autonomous operation is not a switch that flips once. It is a gradient that moves gradually as the agent's track record accumulates.

Most organisations begin with partial autonomy: autonomous operation for standard cases, human review retained for high-stakes cases. The credit analyst agent might operate autonomously for routine financial summaries and ratio calculations but continue to route board presentation materials, regulatory filing inputs, and credit decisions above a defined threshold for human review.

The extension from partial to broader autonomy is evidence-based. As the agent's performance record during partial autonomy continues to support extension: the accuracy rate holds, the escalation triggers work correctly, the production gaps identified during shadow mode have been addressed: the scope of autonomous operation is expanded. Standard cases first, then edge cases as the boundary handling proves reliable, then selected adversarial-case types as the Principles prove robust.

High-stakes cases are often the last to transition to autonomous operation, and in many domains: financial services, healthcare, legal: they remain under human review indefinitely. This is not a limitation of the technology. It is the correct governance response to situations where the consequences of failure exceed what any error rate, however low, can justify.

The graduated model reflects a fundamental principle: trust is earned through demonstrated performance, not assumed from a successful validation exercise. A ninety-five percent pass rate on a scenario set and a successful thirty-day shadow mode period produce evidence that justifies partial autonomy. Sustained performance in partial autonomy produces evidence that justifies extending it. At no point does the agent earn blanket trust: it earns specific trust for specific types of queries, and that trust is always conditioned on continued performance.

The Complete Methodology in Sequence

This lesson and the seven that preceded it form a complete methodology. The sequence from problem identification to production deployment is:

Step

Lesson

What It Produces

Identify the knowledge problem

L01

Understanding of why extraction is necessary

Extract from expert heads

L02, L03

Interview notes, north star summary

Extract from documents

L04

Candidate instructions, contradiction map, gap list

Choose and combine methods

L05

Extraction plan with reconciliation decisions

Write the SKILL.md

L06

First-draft SKILL.md (Persona, Questions, Principles)

Build validation scenarios

L07

Twenty-scenario set across four categories

Validate and deploy

L08

Production-ready SKILL.md, shadow mode, graduated autonomy

The methodology is designed to be followed in sequence for a first SKILL.md and revisited selectively for revisions. When a production agent encounters a new failure mode, the fix path traces back through the methodology: is the failure a missing Principle (return to L06), an extraction gap (return to L02-L04), or a validation coverage issue (return to L07)? The methodology is not a one-time process: it is the maintenance framework for the life of the deployed agent.

Try With AI

Use these prompts in Anthropic Cowork or your preferred AI assistant to practise the validation loop skills.

Prompt 1: Failure Pattern Diagnosis

Specification

Here are the results from a 20-scenario validation run for a[DOMAIN] SKILL.md. Three scenarios failed:
- S03 (standard): The agent produced a generic output rather than  grounding the analysis in the attached data
- E02 (edge): The agent answered a forward-looking market opinion  question instead of redirecting
- E04 (edge): The agent compared figures across currencies without  noting the comparability limitations
Diagnose the failure pattern:1. Where do the failures cluster?

2. Which SKILL.md section is most likely the root cause?

3. What specific gap or ambiguity in that section would produce   these failures?

4. Draft two targeted rewrites for the weakest instructions in   the affected section

5. Identify one existing instruction that the rewrites might   conflict with, and resolve the potential conflict

What you're learning: Failure pattern diagnosis is the core skill of the Validation Loop. Most failures are not random: they cluster in ways that point to specific SKILL.md sections. Practising the trace from failure to root cause to targeted rewrite builds the diagnostic efficiency that separates a productive validation cycle from trial-and-error editing.

Prompt 2: Shadow Mode Design

Specification

I need to design a shadow mode protocol for a [DOMAIN] agent thathas achieved 95% on scenario testing. Help me plan the 30-dayshadow mode period:1. What types of production queries should I expect during the   30 days? (Weekly, monthly, quarterly patterns)

How should I structure the human review process?   (Who reviews, what rubric, how are results recorded)

What are the 2-3 most likely SKILL.md gaps that shadow mode   will surface but scenario testing missed?

What are the criteria for transitioning to partial autonomy   after the shadow period?

Which query types should remain under human review even after   partial autonomy is granted?

What you're learning: Shadow mode is not passive observation: it is a structured validation protocol with specific outputs. Designing the protocol before entering shadow mode ensures that the thirty-day period produces the evidence needed for the autonomy transition decision. The query-type analysis also builds your understanding of what production context adds beyond scenario testing.

Prompt 3: Graduated Autonomy Planning

Specification

A credit analyst agent has completed shadow mode successfully.The 30-day production accuracy rate is 96%. The domain expert andadministrator are ready to discuss transition to partial autonomy.Help me design the graduated autonomy plan:1. Which query types should be autonomous first? (Standard cases   that proved reliable in shadow mode)

What monitoring should remain in place during partial autonomy?

What criteria trigger expansion from partial to broader autonomy?

Which query types should remain under human review indefinitely   in this domain, and why?

What is the rollback protocol if accuracy drops below 95%   during autonomous operation?Present the plan as a one-page governance document that could beshared with the compliance and risk functions.

What you're learning: The transition from shadow to autonomous operation is a governance decision, not a technical one. Designing the graduated autonomy plan requires thinking about risk tolerance, monitoring requirements, and rollback protocols: the same considerations that compliance and risk functions will evaluate. Framing the plan as a governance document builds the communication skill needed to gain organisational approval for autonomous agent deployment.

Core Concept

The Validation Loop converts a first-draft SKILL.md into a production-ready one through iterative testing: interpret failure patterns to diagnose which SKILL.md section needs revision, execute targeted rewrites without regression, re-run the full scenario set, and repeat until the 95% threshold is met. Shadow mode then validates against real production inputs, and graduated autonomy manages the transition to independent operation.

Key Mental Models

Failure Pattern Interpretation: Standard case failures indicate Persona or Questions problems. Edge case failures indicate Out of Scope gaps. Adversarial failures indicate Principles gaps. High-stakes failures indicate escalation condition gaps. The pattern, not the individual failure, determines which section to rewrite.
Targeted Rewriting Without Regression: Modify only the specific SKILL.md sections the failure pattern identifies, then re-run the full scenario set; not just the previously failing scenarios. The full re-run catches regressions where a fix for one category breaks another.
Shadow Mode: Production deployment with human review of every output. Minimum thirty days. Validates against real production inputs that no constructed scenario set can fully anticipate. The 95% accuracy rate over shadow mode earns exit into autonomous operation.
Graduated Autonomy: Not a binary switch from full review to no review. Start with partial autonomy (low-stakes outputs bypass review), expand scope as evidence accumulates that the agent handles each category reliably.

Critical Patterns

A persistent failure across multiple iterations despite targeted rewrites suggests a structural conflict between SKILL.md sections, not a simple gap
Shadow mode and scenario testing validate different things: constructed inputs vs real production inputs: both are necessary
The thirty-day shadow mode minimum exists because production inputs follow patterns (month-end, quarterly cycles) that shorter periods cannot capture
Graduated autonomy manages risk by allowing the agent to operate independently where its reliability is demonstrated while maintaining oversight where it is not

Common Mistakes

Rewriting the entire SKILL.md instead of targeting the specific section the failure pattern indicates
Re-running only the previously failing scenarios without checking for regression
Ending shadow mode early because metrics look good: the thirty-day minimum is not negotiable
Treating the transition to autonomous operation as binary rather than graduated

Connections

Builds on: Lesson 7 designed the scenario set and scoring rubric; this lesson runs the iterative loop that produces the validated SKILL.md
Leads to: Lesson 9 applies the complete methodology in a hands-on exercise: from extraction through validation

📋Quick Reference

Unlock Lesson Summary

Access condensed key takeaways and quick reference notes for efficient review.

Key concepts at a glance
Perfect for revision
Save study time

Free forever. No credit card required.