Construct validity (benchmarks)

Construct validity is the degree to which a benchmark measures the specific capability it claims rather than a proxy a system can score high on without having it; a benchmark is construct-valid only when its top score cannot be earned without the capability it advertises.

Construct validity asks whether a benchmark measures the capability it names, or a proxy a system can score high on without possessing it. It comes from measurement theory, which applies to any test read as a measure of an attribute that is not “operationally defined” (Cronbach & Meehl, 1955, peer-reviewed). A review of 445 LLM benchmarks found patterns that undermine their claims’ validity (Bean et al., Measuring what Matters, NeurIPS 2025, peer-reviewed).

The operative test is simple. A benchmark is construct-valid only when a top score cannot be earned without the capability it advertises.

Consider a reliability benchmark that ranks agents by single-run pass rate. A system that passes once by luck, then fails its next two runs, posts the same number as one that passes every time: the score credits a lucky draw and calls it reliability. pass^k counts only success held across all k runs of a task, and its suite-level aggregate, reliability@k, ties the score to the construct.

This differs from eval reproducibility, which asks only whether re-running returns the same number; a benchmark can be perfectly reproducible and still measure the wrong thing. Check what a benchmark measures before trusting its ranking; see AI agent evaluation for where this check sits among eval-rigor tests.