How we measure

How we measure

Every number we publish about AI-agent reliability is a choice: which method, which assumptions, which source. This page explains those choices in plain language, so you can decide whether to trust a result and check our work if you want to.

Published by LatentEval Published Last updated

We publish meta-analyses, original benchmarks, and experiments on AI-agent reliability. The goal is the same every time: a clear, defensible number, produced by a method you can check and reported with an honest sense of how much to trust it. To earn that trust we hold ourselves to a few fixed rules about how results are produced and described.

How we pick a method

For most measurements there is more than one accepted method. When we publish an analysis, a benchmark, or an experiment, we choose by the same order of preference every time:

  1. A documented statistical procedure, where one exists: for example, a bias-corrected bootstrap confidence interval, or a paired significance test for comparing two systems on the same runs.
  2. A widely cited, peer-reviewed method, when there is no single standard: for example, a published estimator for eval variance or for inter-judge agreement, cited to the paper it comes from.
  3. Plain arithmetic the math agrees on, for things like a containment rate (faults contained over faults injected), where the result is not a matter of opinion once the runs are fixed.

Where a respected alternative exists, we say so in the write-up and explain why we chose the default. We never silently average competing methods to manufacture a single tidy number.

What “estimate” actually means

Most of our results are estimates, and we use that word on purpose. An estimate is a careful, defensible projection from measured runs, reported with an honest account of its uncertainty. It describes the runs and conditions it was measured under, and stops there. Two things make a result an estimate:

So when a benchmark reports “containment 82% (95% CI 76–88%),” the point estimate is exact on the runs sampled, and still an estimate, because a different fault mix or a rerun could move it. Where we can, we show that margin of error, so a single number isn't mistaken for a guarantee.

How we cite our sources

When an analysis or benchmark relies on an external figure (a metric definition, a statistical procedure, a baseline number), we cite it with three fixed parts: the publisher who actually issued it (the standards body, the journal, the paper's authors; never a blog that re-posted it), a direct link so you can confirm it in one click, and the retrieved date we read and recorded it. That is our standard citation format across research write-ups: a benchmark pins the paper a method came from the same way a meta-analysis pins each study it pools. Here is exactly how a pinned source reads:

Example: how we pin a source

  1. e-Handbook of Statistical Methods: confidence intervals and hypothesis tests. U.S. National Institute of Standards and Technology (NIST/SEMATECH). Retrieved .

How often we review

Publishing a result is the start of maintaining it. We sort published work into review bands by how fast it goes stale:

Benchmarks & datasets
Re-run and re-reported when the underlying models, harness, or baselines change (the events that move a measured result), and we note when each result was last reviewed.
Published analyses
Reviewed when the statistical methods they rest on are revised, or when new studies materially change a pooled conclusion.
Owner
The LatentEval editorial process owns these reviews. The maintainers stand behind the math; we do not attach a named individual “reviewed by” byline (see below).

Our no-fabrication rule

This is the rule that overrides the others. We do not invent numbers, sources, reviewers, accuracy claims, or outcomes to make a result look more authoritative than it is. Concretely:

More on how we work

Everything here is meant to be checkable. If a figure looks wrong or a source has moved, that's a bug we want to fix.