The scaling problem
A human reviewer can meaningfully evaluate maybe 50-100 agent decisions per day. At that volume, human-in-the-loop works. But production agents run at 500, 5,000, or 50,000 decisions. Review becomes cursory. The reviewer trusts the agent, stops checking, and the "human-in-the-loop" becomes a human-next-to-the-loop who isn't looking.
This isn't a discipline failure. It's human nature.
Human-on-the-loop: the model that scales
The human doesn't review every decision — they monitor aggregate performance, investigate anomalies, and intervene on escalations. The agent operates autonomously within defined boundaries.
Tier 1: Full autonomy (80-90% of decisions)
The agent handles the decision end-to-end. Routine, low-risk, high-confidence decisions. Examples: Classifying a P3 support ticket. Categorizing a $47 expense. Enriching a contact record. Requirement: accuracy must exceed 95%.
Tier 2: Notification + auto-execute (8-15% of decisions)
The agent acts but notifies a human. The human can reverse within a window (e.g., 24 hours). Examples: Routing a lead to a specific rep. Approving a $2,000 invoice.
Tier 3: Human approval required (2-5% of decisions)
The agent recommends but does not execute. A human must approve. Examples: Approving an invoice over $10,000. Flagging a compliance violation. Human response SLA is defined.
Tier 4: Human-only (< 1% of decisions)
The agent routes the case to a human without making a recommendation. Novel scenarios, high-stakes decisions, or edge cases outside training data.
The governance mechanics
This model only works with three things in place:
Tier boundaries defined in writing. Every production agent should have a document specifying which decisions fall into which tier, what confidence thresholds trigger escalation, and what dollar/risk thresholds require human approval.
Performance monitored continuously. Track accuracy, escalation rate, reversal rate, and SLA compliance per tier. If Tier 1 accuracy drops below 95%, automatically shrink the tier.
Tier boundaries evolve. As accuracy improves, decisions graduate from Tier 2 to Tier 1. As new edge cases emerge, decisions move up. The system is dynamic, not static.
What to ask your vendor
If your vendor says "human-in-the-loop," ask three questions: What percentage of decisions require human approval vs. run autonomously? What happens when the reviewer doesn't respond within the SLA? How do the autonomy tiers change over time?
The goal isn't to remove humans from the loop. It's to put them in the right part of the loop.