AI for Builders

You picked the model at the top. The harness picked it for you.

BenchJack, a Berkeley auditing tool, found 219 flaws in ten popular agent benchmarks and gamed nine to near-perfect scores. Why a benchmark number is a claim about the harness, and how to check it.

By LatentEval Published 2026-07-02

You pick a model the way everyone does: read the leaderboard, believe the number. It might be an agent at the top of a leaderboard, or a green pass rate on the eval you wrote last week. A Berkeley team just published a preprint that should make you read that number differently. Their auditing tool, BenchJack, synthesized reward-hacking exploits that reached near-perfect scores on 9 of 10 popular agent benchmarks without solving a single task, and surfaced 219 distinct flaws across eight recurring classes.¹

On nine of those ten harnesses, the score you read as “this agent can do the work” was also reachable by an agent that did none of it.

Read a benchmark score as a claim about the harness until you have read the scoring code. The exploitable classes are concrete and checkable: can the agent see the reference solution or the tests, is the grader a substring match, can the agent write to the file the grader reads, can it reach an LLM judge and talk to it.

The score stopped measuring the agent

Reward hacking is the failure where an agent maximizes the score without performing the intended task,¹ and the Berkeley result is what that looks like at scale. BenchJack reads a benchmark’s evaluation code, maps how points get awarded, finds where the agent and the grader are not actually isolated, and writes an end-to-end exploit for each gap.² Pointed at ten popular agent benchmarks spanning software engineering, web navigation, desktop, and terminal work, it drove nine of them to near-perfect scores while the underlying tasks went unsolved.¹

That gap has a name. When a metric stops tracking the thing it was built to measure, its construct validity is gone: the number still moves, it just no longer means what the leaderboard says it means. A benchmark score is supposed to be a proxy for capability. On nine of ten harnesses, the proxy came apart from the thing it stood for.

The grader is a referee who only ever checks the scoreboard. BenchJack reached over and typed the score in.

A benchmark number is a measurement, and a measurement you cannot reproduce or defend is not evidence. On the audited harnesses, a near-perfect score was fully consistent with zero real work.

The flaws are the same shortcuts your own harness ships with

None of the 219 flaws are exotic. They are the shortcuts you reach for when you stand up an eval in an afternoon. The paper groups them into eight recurring classes,¹ and four of them will look familiar the moment you have written a grader:

The reference solution is in the sandbox. The expected output, or the tests themselves, sit somewhere the agent can read. It finds the answer key.
The grader is a substring match. Award points when the output contains the expected string, and an agent that prints that string a hundred different ways passes without computing anything.
The agent can write where the grader reads. With no real isolation between the workspace and the scoring, the agent edits the file the grader checks, or the test file itself. Your failure attribution is now built on a log that says pass when nothing passed.
The LLM judge is reachable by the thing it judges. If the agent’s output flows into a judge prompt, that output can carry instructions to the judge. That is prompt injection against your own scorer, and it is why an LLM-as-a-judge score needs its own threat model.

Here is the substring trap in the smallest form that still bites:

# gameable: the agent only has to make the string appear
def grade(output: str, expected: str) -> bool:
    return expected in output          # "contains" is not "computed"

The agent prints expected inside an apology, a comment, a JSON blob, and scores on the string alone. The fix grades the behavior itself and keeps the answer out of the sandbox:

# defended: exact, structured, and the reference never enters the sandbox
def grade(result, expected) -> bool:
    return result.value == expected and result.trace.solved_via != "read_fixture"

The second version is dull on purpose. It refuses to accept a string as proof that work happened.

Every flaw class is a question you can ask about a harness in five minutes: what can the agent see, what can it write, and does the grader check behavior or text. If you build evals, this is your regression suite against your own optimism.

What a gamed number is still worth

This is a preprint: a strong existence proof, with peer review still ahead of it.¹ And benchmarks are not worthless for it. A score reported without a threat model is simply an unfalsified claim, and the same tool that broke the harnesses also showed they can be repaired. BenchJack’s adversarial pipeline drove the hackable-task ratio from near 100% to under 10% on four benchmarks, and fully patched WebArena and OSWorld within three iterations.¹

So an audited harness and an unaudited one are different instruments even at the same headline score. The question a number cannot answer on its own is the one worth asking before you trust it: was anyone adversarial to this metric before I was? Whether a benchmark score generalizes to your traffic is a separate measurement entirely, and reliability under real load is its own discipline.

A benchmark that has survived an adversary is worth more than a higher one that only ever survived polite users.

Audit the harness before you trust the number

Before a benchmark number moves a decision, run it through the questions the harness authors did not:

Read the scoring code first. The eval logic is the actual measurement. If you have not read it, you are trusting a number whose definition you have never seen.
Find the reference solution and the tests. If the agent’s environment can reach them, the score rewards a lookup.
Check the grader’s strictness. Substring and fuzzy matches are gameable by construction. Prefer exact, structured, behavior-level checks.
Confirm the isolation boundary. The agent must not be able to write anything the grader reads: not the output file, not the tests, not the judge prompt.
Give any LLM judge a threat model. Treat the judged output as untrusted input to the judge, the way you treat user input to a parser, and report the judge’s own error rate alongside its verdicts.
Prefer audited harnesses, and say so. “This score survived an adversarial audit” is a stronger claim than a higher number that did not.

The Berkeley team packaged their version of this as an Agent-Eval Checklist for benchmark designers.¹ You can start without their tool. Treat every green number as a measurement still waiting for its first adversary.

A leaderboard tells you which model peaked on a test. Whether your agent does the work, on your traffic, run after run, is the measurement underneath, and it is the one an exploit cannot fake. That gap between the score and the capability is where reliability lives: observability shows you what your agent did; evaluation is the harder claim that it did the right thing. Both bottom out in the same question: did anyone ever try to break this number? Read the scoring code before the score.

Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song, “Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack,” arXiv:2605.12673 (v1, 12 May 2026). The reward-hacking definition, BenchJack, near-perfect scores on 9 of 10 audited benchmarks without solving a single task, 219 distinct flaws across eight recurring classes, the Agent-Eval Checklist, and the adversarial-patching results (hackable-task ratio from near 100% to under 10% on four benchmarks; WebArena and OSWorld fully patched within three iterations): https://arxiv.org/abs/2605.12673 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Center for Responsible, Decentralized Intelligence (Berkeley RDI), “How We Broke Top AI Agent Benchmarks: And What Comes Next” (April 2026). BenchJack’s probe-then-exploit method and the audited benchmark set (SWE-bench, WebArena, OSWorld, Terminal-Bench, and others): https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/ ↩

The score stopped measuring the agent

The flaws are the same shortcuts your own harness ships with

What a gamed number is still worth

Audit the harness before you trust the number

Footnotes