Evals, or it's vibes — AI & Agentic POV

Every production AI agent needs an eval harness. Without one, you’re shipping vibes.

This is not a hot take. It’s the most common production failure mode I see across the agentic GTM and retail engagements I run. And it’s almost entirely preventable.

The pattern at scale

It plays out the same way every time:

A team ships an agent. It works in demo. The CEO is excited. Confidence is high.
Two months in, the agent makes a wrong call on something subtle. A scoring agent disqualifies a high-fit lead. A drafting agent produces a hallucinated stat. A routing agent sends a Tier 1 prospect to the SDR queue. The team notices, patches the prompt, ships the patch. Confidence is still high.
Six months in, the agent has been making the same kind of wrong call quietly across thousands of customer interactions. Nobody noticed at the individual level because each instance looked plausible. Nobody noticed at the aggregate level because nobody measured.

This is the difference between an experiment and a system. Experiments fail visibly. Systems fail silently — until the failure mode is structural and the operational debt is enormous.

What a real eval harness looks like

Five components. If your production agent doesn’t have all five, you’re not running a system. You’re running a long-form demo.

1. Held-out cohort. Not the data the agent was prompted on. Real production data the agent has not seen during prompt design or fine-tuning. Refreshed quarterly. The agent’s score on the held-out cohort is the only number that translates to real-world performance.

2. Scheduled weekly run. Not “when we get around to it.” Not “when the engineer remembers.” The eval runs every week, on the same day, automatically. The output goes to a Slack channel or dashboard the team actually reads. If the eval doesn’t run, the agent doesn’t ship to that week’s cohort. Hard gate.

3. Precision, recall, calibration drift — the whole curve. Not just one number. Calibration drift especially: a scoring agent that was 95% confident in January and 95% confident in May with very different precision underneath is the silent killer. Calibration tells you whether the agent’s confidence still tracks reality.

4. Time-to-first-action. Latency on agent decisions matters at scale. An agent that takes 90 seconds per decision is fine for 100 leads/day. Catastrophic at 10,000 leads/day. Track P50, P95, P99 of decision latency in the same harness.

5. Failure-mode taxonomy. When the agent gets it wrong, why? Categorized. Trended over time. Common categories I see in scoring/drafting agents: stale-data error, ambiguous-input error, model-overconfidence error, missing-context error, hallucination, conflict-with-CRM-truth. The taxonomy gets sharper over time; the eval surfaces which category is growing.

What this looks like in production

For the GTMify Cortex scoring agent, the eval harness is a Supabase Edge Function on a weekly cron. It pulls the held-out cohort from a labeled training table, runs the production scoring prompt against each lead, compares the agent’s score band + disqualifier reasoning to the human label, computes precision/recall/calibration per band, and posts the report to Slack.

About 40 lines of TypeScript on top of the agent. Roughly four hours of build time the first time. Half an hour to evolve when the agent prompt changes. That’s the entire ongoing operational cost.

For a drafting agent (the kind that auto-drafts POVs from a topic queue), the eval harness is different but the principle is identical. Held-out topics with human-written gold drafts. Weekly run. Side-by-side compare on voice, structure, fact-density, and CODN-framing presence. Same Slack report.

The pattern is the same across agent shapes. Held-out, scheduled, multi-metric, taxonomized failures.

The objection: “evals are expensive to set up”

They are not. The objection is almost always a proxy for “we didn’t budget for them at design time.”

Building an eval harness after an agent is in production is harder than building one before. The retrofit requires labeling cohorts the team didn’t prepare, instrumenting tool calls that weren’t designed for observability, and reverse-engineering failure taxonomies from incident postmortems. That cost is real.

Building an eval harness alongside the agent is comparatively trivial. Four to ten hours for the first agent. Faster for every agent after that, because the harness shape is reusable.

If your team’s first response to “we need evals” is “we don’t have time,” you’re already in the failure pattern. The cost is not the harness. The cost is the operational debt the missing harness is silently accumulating.

The CODN angle

By the time vibes catch up with you, the agent has acted on bad data hundreds of thousands of times. The reputational and operational debt compounds quarter over quarter.

Concretely:

In a scoring agent, the cost is misallocated SDR time. If 12% of disqualifications are wrong and you process 50,000 leads/quarter, that’s 6,000 falsely-disqualified leads — and proportional revenue.
In a drafting agent, the cost is brand erosion. If 8% of drafts contain a fact you wouldn’t have asserted, and they go out under your name, the reputational cost compounds with audience growth.
In a routing agent, the cost is response-time decay. If the agent routes Tier 1 prospects to the wrong queue 5% of the time, your fastest opportunities are also the ones taking longest to respond.

The CODN of skipping evals is not the eval cost. It’s the cumulative cost of every silent failure between deployment and the day you finally measure.

The bottom line

If you’re shipping a production agent without an eval harness, you are not shipping a system. You are shipping a long-form demo that costs production money.

Evals — or it’s vibes.