The prompt wording is a hyperparameter you never swept.
Rewording the same task swings a model's pass rate: format, option order, even a 'please'. A one-phrasing eval samples one point from a spread you never measured. Pin the prompt and measure it.
You wrote the prompt once. It read fine, the eval went green, you shipped. What you never ran was the same prompt with the multiple-choice options in a different order, or a please bolted on the front, or the few-shot examples flipped, because those are the same task. To you they are. On LLaMA-2-13B, moving between prompt formats that mean exactly the same thing swung accuracy by as much as 76 points1, on tasks where the meaning never changed.
One weak model would be easy to wave off. The catch is that this is the number you just shipped your reliability against. Your eval ran one phrasing, so it reported one point out of a distribution you never looked at, and you have no idea whether you drew the lucky end or the unlucky one.
Pin the exact prompt string as a versioned artifact, then run a handful of trivial rewordings of the same task and measure the pass-rate spread across them. A wide spread means the prompt is underspecified and your single-phrasing eval number is noise wearing a metric’s clothes. Constrain the format and pin the order until the spread closes, and only then trust the number.
Here is the catalog of changes that feel like nothing and move the number:
| Lever | What changes | Measured effect | Scope / caveat | Source |
|---|---|---|---|---|
| Format | Q: or Question:, (A) or A., a newline or a space after the label | Up to 76 accuracy points on LLaMA-2-13B; median 7.5 points across models and few-shot counts; 6.4-point median across 320 formats for GPT-3.5 | 76 is the ceiling off a small, older model; sensitivity persists across model size, few-shot count, and instruction tuning; 53 tasks from Super-NaturalInstructions | Sclar, Choi, Tsvetkov, Suhr (FormatSpread) |
| Order | The sequence of your few-shot examples, or which slot holds the correct answer | Permuting the same examples can move a model between near state-of-the-art and near-random-guess; a standing selection bias toward option IDs like A moves the score with the slot | Ordering measured across eleven text-classification tasks, and a good permutation does not transfer between models; selection bias measured across 20 models and three benchmarks | Lu et al. (ACL 2022); Zheng et al. (ICLR 2024) |
| Shots | Zero-shot, or the same task with two examples pasted above it | A different input that moves the number; the article reports no isolated effect size for shot count | One of the four changes that read as the same task while changing the model’s input | |
| Tone | A please on the front, a curt instruction versus a courteous one | Rudest phrasing 84.8% against politest 80.8% across 50 questions on one frontier model | Sign is unstable: a cross-lingual study found impolite prompts often hurt, with the best politeness level shifting by language; small, single-model study | Dobariya and Kumar; Yin et al. |
All four read as the same task to you. Each is a different input to the model, and the model answers that input, whatever you meant it to say.
Reword the format, keep the meaning, lose the score
Melanie Sclar and colleagues built a tool called FormatSpread that takes one task and re-renders it across choices any of us would call cosmetic: Q: versus Question:, a space or a newline after the label, (A) versus A), the separator between fields. Same task, same examples, same answer key. On LLaMA-2-13B they measured a spread of up to 76 accuracy points between the best and worst of those formats1. Call that the spread: the best pass rate across your variants minus the worst.
Seventy-six is the ceiling, and it comes off a small, older model, so discount it if you like. The median is the part that should worry you. Across their models and few-shot counts the median spread was 7.5 accuracy points1, and even GPT-3.5 carried a median spread of 6.4 points across 320 formats1. A 7-point swing from nothing but punctuation is wider than most of the model-to-model gaps teams argue about in review. And it did not wash out with scale: the paper reports the sensitivity holding as they increased model size, added few-shot examples, and applied instruction tuning1.
The formatting you picked by reflex is a hyperparameter you never tuned, and it can be worth more points than the model choice you spent a week on.
Reorder the same examples and the model can slide from near-SOTA to chance
Formatting is the cosmetic layer. Ordering is worse, because you almost certainly froze an order by accident. Yao Lu and colleagues found that permuting the same few-shot examples, identical content and only the sequence changed, was enough to move a model between near state-of-the-art and near-random-guess performance across eleven text-classification tasks2. That is the gap between shipping and a coin flip, from reordering a list you assumed was inert.
Multiple choice has its own version of this. Language models are not robust option selectors: across 20 models and three benchmarks they carry a selection bias, a standing preference to answer with a particular option ID like A regardless of what A actually says3. Move the correct answer into a different slot and the score moves with it, because the model’s answer is partly a vote for the position the option sits in.
If your prompt has a fixed few-shot order or a fixed option layout, that order is part of your model. You just never measured what it was contributing.
Tone moves the score, and which way it moves flips by model and language
So write the polite, well-structured version and be done, you say. Tone moves the number too, and nobody can tell you in advance which direction. One small study rewrote 50 questions into five tone levels and ran them through a single frontier model; the rudest phrasing scored highest and the politest lowest, 84.8% against 80.8% across the 50 questions4. Read that as a headline and you would start being rude to your model. Do not: a cross-lingual study found impolite prompts often hurt, with the best politeness level shifting by language5.
Hold those two results next to each other. Tone clearly moves the score. The sign of the move is not stable across models, studies, or languages. The effect ignores your intuitions, so you cannot reason your way to the right phrasing from your chair. The wording that helps one model on one task can hurt the next, and the only instrument that reads the effect is your own eval.
Your eval samples one phrasing, so it reports one draw
Here is where prompt sensitivity turns into a measurement bug. Every result above says the same structural thing: for a fixed task and model, pass rate is a distribution over phrasings that only reads as a scalar. Your eval picks one phrasing, runs it, and hands you a number. That number is one draw from the distribution, and by construction the eval cannot show you the width of what it sampled from.
You already treat run-to-run variance this way. You know a single run is not a reliability estimate, which is why you run the task enough times to get an interval instead of trusting one pass. Phrasing variance is the same problem one level up, and most evals ignore it entirely. They pin the prompt precisely so the eval is reproducible, and in doing so they measure a single point with false confidence. Reproducibility across runs of one phrasing is not the same property as robustness across phrasings, and a green eval quietly conflates them. The pass rate you report is a point estimate whose confidence interval you never computed, because the variance you left out lives between the phrasings you did not run.
The eval did its job. It answered a question about one string.
Pin the prompt, sweep the variants, measure the spread
The fix is to measure the spread and drive it down until the prompt is specified tightly enough that phrasing stops mattering. Three moves.
Pin it. Treat the exact prompt string as a versioned artifact, byte for byte, checked in next to the code. The string itself is the artifact, down to the punctuation. If you cannot diff the prompt that produced last week’s eval against this week’s, you are running a fresh experiment each time and comparing the results as if they were the same one.
Sweep it. Generate a handful of trivial, meaning-preserving variants of the same task and score each one the way you score the original.
# Pin the exact string, then vary only the wording of the SAME task.
variants = [
base, # the prompt you shipped
reorder_options(base), # same options, different slots
reorder_fewshot(base), # same examples, different order
to_zero_shot(base), # drop the examples entirely
add_politeness(base), # a "please" and a "thank you"
]
# Each rate is itself a distribution over runs, so give each one an
# interval, not a single pass/fail. See how-many-runs-for-a-reliable-eval.
rates = [pass_rate(v, runs=50) for v in variants]
spread = max(rates) - min(rates) # wide spread == underspecified prompt
Measure it. Read the spread as a first-class reliability signal, standing level with the pass rate. A tight spread means the prompt is specified enough that wording no longer swings the outcome, and your single number is trustworthy. A wide spread means the opposite, and the fix is to constrain the prompt until it closes: pin the option order, fix the output format, add the structure and explicit instructions that leave the model less room to answer a slightly different question each time.
That is the difference between watching your agent and testing it, which is the whole argument of observability is not evals: a production log shows you what one phrasing did once, the sweep shows you how much the phrasing itself was carrying.
Ship the spread next to the pass rate. A pass rate with no phrasing spread is a benchmark you already know how to game, and you gamed it by accident the moment you wrote exactly one prompt.
Every pass rate inherits the spread of the prompt behind it
The lazy prompter puts in the effort. The laziness is in one assumption: that the wording they happened to type is the task, when it is one sample from a spread the model can feel and they cannot. Closing that spread sits upstream of every reliability number you report, which is why it belongs in the same discipline as how to measure agent reliability past a single pass rate and agent reliability testing. Re-run the sweep every time you swap models, because a good phrasing for one does not carry to the next2. Measure the prompt first. Everything downstream is inheriting its variance.
Footnotes
-
Sclar, Choi, Tsvetkov, and Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design.” Meaning-preserving formatting changes produced accuracy differences of up to 76 points on LLaMA-2-13B; a median spread of 7.5 accuracy points across models and few-shot counts; a median spread of 6.4 points across 320 formats for GPT-3.5; sensitivity persisting across model size, few-shot count, and instruction tuning; 53 tasks from Super-NaturalInstructions: https://arxiv.org/abs/2310.11324 ↩ ↩2 ↩3 ↩4 ↩5
-
Lu, Bartolo, Moore, Riedel, and Stenetorp, “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity,” ACL 2022. Permuting the same few-shot examples can be the difference between near state-of-the-art and near-random-guess performance across eleven text-classification tasks for GPT-family models, and a good permutation for one model does not transfer to another: https://aclanthology.org/2022.acl-long.556/ ↩ ↩2
-
Zheng, Zhou, Meng, Zhou, and Huang, “Large Language Models Are Not Robust Multiple Choice Selectors,” ICLR 2024. Across 20 models and three benchmarks, models show a selection bias, a prior preference for particular option IDs (such as
A), that makes accuracy vulnerable to reordering the options: https://arxiv.org/abs/2309.03882 ↩ -
Dobariya and Kumar, “Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy.” A small study: 50 base questions rewritten into five tone levels (250 prompts) on a single model (GPT-4o), where the “very rude” phrasing scored 84.8% and the “very polite” phrasing 80.8%: https://arxiv.org/abs/2510.04950 ↩
-
Yin, Wang, Horio, Kawahara, and Sekine, “Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance,” 2024. Impolite prompts often degrade performance, and the best-performing politeness level differs by language, so the direction of the tone effect is not stable: https://arxiv.org/abs/2402.14531 ↩