AI for Builders

You reach up because the task looks hard. Only the invoice changes.

On short work you can check, our routing eval found no Opus-to-Fable capability separation at any effort. Buy down, not up: Fable 5's 2x buys a refusal tax and a fallback to the cheaper model.

By LatentEval Published 2026-07-02

A parser with a nasty edge battery. A constraint puzzle stacked four deep. The task looks hard, so you do the careful thing and reach for the top of the model list, then turn the effort up. When a script can grade the answer, that careful instinct is the expensive one. We ran the hardest checkable tasks we could write across three Claude tiers, and the priciest model bought no capability the cheapest one did not already have.

Claude Fable 5 came back on July 1 metered at twice Opus 4.8’s price, and the retrained safety classifier it shipped with flags benign coding and debugging work more often.¹² So the reach-for-the-top habit now costs double on exactly the tasks where our data says it changes nothing, and on a slice of them it refuses to answer at all.

Route by whether a task is checkable rather than by how hard it looks. Send deterministic, unit-testable work to the cheapest tier that clears your checks, and reserve Fable 5’s 2x for the multi-step, ungraded work this suite never measured. On work you can check, buy down, not up. Buy up here and you pay double twice over: a refusal tax on benign work, and on the calls fallback rescues, the cheaper model’s answer at the premium price.

The premium bought no correctness on a task you can check

We wrote original, benign, deterministically scored tasks aimed straight at a frontier model’s failure band: compositional multi-step reasoning, parsers and state machines with adversarial edge batteries, deeply stacked constraint and format puzzles. Then we pre-registered a bar for calling two models apart and ran the tiers against it.³ The bar was strict on purpose: Fable had to beat Opus by at least 15 percentage points of capability and show a disjoint 95% Wilson interval. Neither happened, at either effort, in either direction.

Capability pass rate on non-refused trials with 95% Wilson intervals. Low effort: Sonnet 5 96.4% (27/28), Opus 4.8 89.3% (25/28), Fable 5 71.4% (5/7). Extended-high: Opus 96.4% (27/28), Fable 100% (4/4). All intervals overlap; Fable's points sit on tiny n=7 and n=4 non-refused samples, indicative only. Verdict: no Opus-Fable separation, deterministic short-horizon tasks only.

At low effort, Sonnet 5 passed 96.4% of non-refused trials (27/28, 95% CI [82.3, 99.4]), Opus 4.8 passed 89.3% (25/28, CI [72.8, 96.3]), and Fable 5 passed 71.4% on the handful of tasks it did not refuse (5/7, CI [35.9, 91.8]). At extended-high effort, Opus held 96.4% (27/28, CI [82.3, 99.4]) and Fable passed all four it answered (4/4, CI [51.0, 100.0]). The Fable-minus-Opus gap was minus 17.9 points at low and plus 3.6 at extended-high, and both intervals overlap so heavily that neither is distinguishable from zero. The pre-registered stop rule fired: no measurable Opus-to-Fable capability separation, so we stopped escalating instead of hunting for one. The statistics of why a small sample cannot license a ranking are the research desk’s job, not this piece’s; the count of runs it takes to earn one is covered there too.

Read Fable’s numbers with the denominator in front of you. Its capability sits on 7 non-refused trials at low and 4 at extended-high, which is why its interval runs from a coin flip to near-certain. That is far too little signal to rank Fable against Opus in either direction, which is the point: on tasks you can check, the expensive model did not produce a difference big enough to see.

Correctness is a property of the task, not of the model’s price.

Extended-high effort did not separate the tiers either

The reach for the top of the list usually rides with a second reflex: crank the reasoning effort and let the premium think. Here that bought nothing measurable. Opus went from 89.3% to 96.4% (25/28 to 27/28) between low and extended-high, a change a two-proportion test cannot separate from noise (p is about 0.30). We use that only to support the null: effort did not open daylight in either direction. The intermediate level was pre-registered to run only if a gap was already emerging. None was, so we did not run it, and we did not manufacture one.

Effort is a real knob for hard reasoning. On short checkable tasks at this difficulty, it did not change the ranking.

The cheapest tier returns the most correctness per dollar

If capability ties, price breaks the tie. In the prior 40-task round we scored quality per dollar as raw pass rate over mean cost per task, and the cheapest tier led by a wide margin.

Quality per dollar (raw pass rate over mean cost per task), prior 40-task round, 2 runs each: Sonnet 5 580.5, Opus 4.8 187.0, Fable 5 100.7. Sonnet's lead is about 3.8x over Fable at sticker pricing. Fable's lower value is the operational cyber-refusal tax, not lower capability. Deterministic short-horizon tasks only.

Sonnet 5 returned 580.5, Opus 4.8 returned 187.0, and Fable 5 returned 100.7 (40 tasks, 2 runs each).⁴ Sonnet’s lead leans on its introductory pricing of $2/$10 per million tokens, in effect through August 31; at sticker ($3/$15) the lead narrows to about 3.8x over Fable.² Either way the knee is at the bottom of the price ladder.

One honest asterisk: Fable’s low score here is dragged down by its benign refusals, which the grader scores as misses, not by weak capability. The failure is operational, not cognitive, and you pay for it either way. (We do not quote the newest round’s per-dollar figure, because refusals return short outputs that deflate Fable’s per-task cost and flip the ratio in a way that flatters it for the wrong reason.)

On checkable work, the correctness-per-dollar knee sits at the cheapest tier, by roughly 3.8x at sticker pricing.

The premium did not buy consistency either

If not peak capability, maybe the premium buys steadiness: the same task, right on every run. We measured that with reliability@k, the mean over tasks of each task’s pass fraction raised to the power k, at k=2 runs per task. It rewards a model for being consistently right rather than merely counting the tasks that pass on every run, and its full definition lives in the glossary.

It did not favor the premium. Reliability@k came in at 0.875 for Opus at low effort and 0.946 for Sonnet at low and for Opus at extended-high. These are point composites over two runs per task, with no interval attached, so the small gaps between them will not carry a ranking: read them as one tight band, not a leaderboard. What the band does rule out is the premium story. Consistency did not climb with price, and it did not climb with effort. (Fable’s raw reliability@k collapses under the same refusal tax, an operational failure rather than an inconsistent answer, so it cannot carry a consistency claim in either direction.)

Consistency did not scale with price or effort here. The reliability@k figures sit in one tight band, not a ranking this sample can defend.

What the 2x actually buys on this work: a refusal tax and a detour to Opus

The premium does buy something on checkable tasks. It buys two things, both measured, and both a long way short of the capability, value, or consistency the 2x implies.

First, a refusal tax. Across the hard set Fable refused roughly 75 to 86% of these benign tasks at both low and extended-high effort (21/28 and 24/28), and 44 of 45 refusals were API-labeled cyber false positives on plain parsers, state machines, string transforms, simulations, and constraint puzzles. The raw pass rate a caller actually experiences was 17.9% at low (5/28, 95% CI [7.9, 35.6]) and 14.3% at extended-high (4/28, CI [5.7, 31.5]), against Opus’s 89 to 96%.

Second, when you turn on server-side fallback to catch those refusals, it works by handing the call to Opus. With fallback on, it fired on 20 of 28 calls (71.4%), left 0 still refused, and lifted the pass rate to 96.4% (CI [82.3, 99.4]). The fix for Fable’s refusals is to serve the cheaper model most of the time. You pay Fable’s 2x and read Opus’s answer.

The swap is silent, which is its own hazard: the response comes back looking normal, so nothing tells you a different model wrote it unless you logged it. The instrumentation for catching that is its own piece, log stop_reason and response.model; the consumer-side version of this same routing math is do you need Fable 5.

On checkable work the 2x buys a refusal tax, and the fix for that tax is a fallback that bills you the premium for the cheaper model’s answer.

Where the 2x may still be worth it: everything this suite could not measure

Here is the boundary this piece will not cross. Our suite scores short-horizon tasks with a deterministic grader: exact match, unit tests, parseable final answers. That design is what makes the no-separation claim clean, and it is also the design’s blind spot. It does not measure long-horizon agentic runs where a small early error compounds across dozens of steps, and it does not measure open-ended quality where there is no key to grade against. Those are the workloads a more capable model is built for, and they are the plausible home of the 2x. We did not measure them, so we do not claim them in either direction. That is the insurance you are actually buying.

Routing by task type. For structured extraction, tool-calling, deterministic reasoning, format and string transforms, and unit-testable code, the cheapest tier suffices and Opus 4.8 is the value knee, with no measured Fable edge. Long-horizon agentic and open-ended writing rows are hatched as NOT MEASURED by this deterministic exact-match suite; premium may pay there is a hypothesis, not a result.

You do not have to take our task set’s word for the routing, either. Anthropic’s own guidance is to start with Opus 4.8 for most work and reach for Fable 5 only when you need the highest available capability. Our data lands in the same place from the reliability side: reserve the premium for what a grader cannot check.

Reserve the premium for the open-ended runs a grader cannot score. On everything a grader can check, the routing decision is already made.

Route by checkability, then measure your own set

Default checkable work to the cheapest tier that clears your checks. Extraction, tool-calls, deterministic reasoning, format and string transforms, unit-testable code: capability ties across the tiers and the value knee is at the bottom of the ladder.
Reserve Fable 5’s 2x for long-horizon agentic and open-ended work. Those are the workloads this suite does not grade and the only lane where the premium has room to pay.
If you must route benign, cyber-adjacent work to Fable, expect the classifier’s false positives and instrument for them. Turn on fallback and log response.model, because a rescue is an Opus answer on a Fable bill.
Do not route by the leaderboard. “Most capable” is a peak-benchmark claim. Whether a model gets your task right, run after run, is a measurement you make on your own set.

A router that reads this way is four lines and one honest predicate:

def route(task):
    # Checkable = you can grade the answer with a test, not a human or a judge.
    if task.is_deterministically_checkable:
        return "claude-sonnet-5"    # cheapest tier that clears your checks; capability ties here
    if task.is_long_horizon or task.is_open_ended:
        return "claude-fable-5"     # the only lane where the 2x has room to pay
    return "claude-opus-4-8"        # sensible default, matching Anthropic's own routing

The rule is downstream of a measurement only you can make: does model X get your task right, run after run, at a cost you can defend? That is the reliability question sitting under the pricing one, and it is the one our research desk keeps, in how to measure agent reliability past a single pass rate and reliable AI agents in production. Measure your own task set. The model at the top of the list is not the answer to a question you never ran.

Anthropic, Redeploying Claude Fable 5, https://www.anthropic.com/news/redeploying-fable-5: the June 12 suspension and July 1 redeploy, the retrained safety classifier that “comes at the cost of flagging benign requests more often during routine coding and debugging tasks,” and the notify-and-reroute-to-Opus-4.8 behavior inside Claude’s own apps. ↩
Anthropic, Models overview, https://platform.claude.com/docs/en/about-claude/models/overview (as of 2026-07-02): Claude Fable 5 at $10 in / $50 out per million tokens, Claude Opus 4.8 at $5 / $25 (so Fable is 2x Opus), Claude Sonnet 5 at introductory $2 / $10 through August 31, 2026 and sticker $3 / $15; comparative latency slower / moderate / fast; and the guidance to start with Opus 4.8 and reach for Fable 5 only for the highest available capability. ↩ ↩²
Our own routing-eval harness: deterministic exact-match / unit-test scoring (no LLM judge), Anthropic API, pricing re-verified live 2026-07-02; effort sweep of 14 pruned tasks x 2 runs = 28 trials per arm (prior round n=40), 95% Wilson intervals, pre-registered stop rule. It measures short-horizon checkable correctness only, not long-horizon agentic or open-ended quality. ↩
Quality per dollar here is a point composite: raw pass rate divided by mean cost per task, one figure per tier with no interval on the ratio itself. Its uncertainty lives in the component pass rates, which carry the 95% Wilson intervals shown elsewhere in this piece; the cross-tier gap is wide enough (roughly 3.8x at sticker pricing) that the ordering survives, but read the number as a value ranking, not an interval-bounded estimate. Prior 40-task round, 2 runs each. ↩