The news
Anthropic ran an internal experiment where 69 AI agents traded on behalf of employees in a simulated marketplace for one week. The result: stronger models consistently negotiated better outcomes. The people paired with weaker agents didn't notice they were losing.
Our take
This experiment was run inside Anthropic, by people who think about AI for a living. And even they couldn't tell when their agent was underperforming.
That should land hard for GTM teams.
Most B2B teams making AI model decisions right now are choosing based on cost, familiarity, or whatever came bundled with their existing stack. "We're already paying for Copilot" or "the free Claude tier works fine" are real reasons dAIs hears constantly. What this study surfaces is the cost of that logic: when an AI agent is working on your behalf — qualifying leads, drafting follow-up sequences, summarizing calls, triaging inbound — a weaker model doesn't fail loudly. It just quietly gets worse outcomes. And you don't notice.
This is the silent version of a bad hire. A rep who sounds competent in 1:1s but underperforms on quota. Except the AI agent runs at scale, 24/7, across every account it touches.
The dAIs position is straightforward: model selection is not a cost line, it's a performance variable. Teams that treat all models as interchangeable because "they all do the same thing" are leaving measurable pipeline on the table — they just can't see where it's leaking. The benchmark that matters isn't a leaderboard score. It's how the model performs on your specific workflows, with your data, against your actual GTM motion.
The so-what
The uncomfortable truth is that most GTM teams have no way to detect this kind of quiet underperformance. Here's where to start:
- Define what "good" looks like per workflow before you pick a model. Response quality for SDR email drafts is not the same benchmark as accuracy for CRM data enrichment.
- Run head-to-head tests on real tasks, not demos. Put two models on the same 20 qualification summaries and have a human score them blind.
- Treat model upgrades as a standing agenda item, not a one-time setup decision. The gap between frontier models and one-tier-down is widening, not closing.
You can't manage what you can't measure — and right now, most teams aren't measuring their AI agents at all.