Pattern reliability

How to evaluate a RAG pipeline beyond a single score

Evaluate a RAG pipeline by scoring retrieval and generation separately, putting a confidence interval on every metric, and attributing each failure to the stage that produced it.

By LatentEval Published 2026-04-01 Updated 2026-05-25

A RAG pipeline hands back a wrong answer, and your evaluation prints one number: 0.71, a green check, a label that reads “good.” That number cannot settle the only question you have to answer before you change any code, which stage broke. Did retrieval fail to fetch the passage that held the fact, or did the generator hold the right passage and write a claim the passage never made? Those are two different bugs with two different owners, and a blended score averages them into a value nobody can act on.

Evaluating a RAG pipeline well starts by refusing that average. You score retrieval and generation as separate systems, you put a confidence interval on each metric so a rerun cannot quietly move your ranking, and you read every failure back to the stage that produced it. The method below runs those three in order.

Grade retrieval and generation as two systems

A RAG system is two machines in series: a retrieval module that selects context, and an LLM that generates an answer from it. The peer-reviewed RAGAS framework (Es et al., EACL 2024 System Demonstrations) scores those machines on separate axes, evaluating “the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages faithfully, and the quality of the generation itself.” Keep the axes apart and a single failing number turns into a diagnosis.

Metric	Stage it grades	The question it answers	A low score means	Where the fix lives
Context recall	Retrieval	Did the retrieved set contain the passage that holds the answer?	The evidence never reached the generator, so no prompt change can recover it	Chunking, embeddings, top-k recall
Context precision	Retrieval	Of the passages retrieved, are the relevant ones present and ranked first?	Distractors crowd the window and pull the answer off target	Reranking, a per-passage relevance filter
Faithfulness	Generation	Can every claim in the answer be inferred from the retrieved context?	The generator asserts beyond its evidence	Grounding constraints, an entailment check, an abstain path
Answer relevance	Generation	Does the answer directly and completely address the question?	The answer is off target, incomplete, or padded	Prompt, decoding, answer-shaping

RAGAS defines faithfulness precisely: an answer is faithful when “the claims that are made in the answer can be inferred from the context.” Answer relevance, in the same framework, asks whether the answer directly addresses the question, setting factual correctness aside and penalizing answers that are incomplete or padded with redundant text. On the retrieval side, its context-relevance idea is that the retrieved context should hold only what the question needs. Teams operationalize the retrieval axis as two familiar quantities: recall, whether the answer-bearing passage was retrieved at all, and precision, whether the retrieved set is free of distractors.

The split matters because the two axes fail independently and can mask each other. An answer scores high on faithfulness while being flatly wrong whenever faithfulness measures fidelity to context that missed the fact, because a faithful path over bad evidence still lands on a false answer. Read generation metrics alone and that failure is invisible.

Put an interval on every metric

Every RAG metric is an average over a finite test set, which makes it an estimate carrying sampling error, not a property of the pipeline. A per-question judgment (faithful or not, relevant or not) is a proportion, so its uncertainty follows the binomial, and the width of the interval is set by how many questions you scored. Report the point value with a reported confidence interval or you are ranking on sampling noise.

Use an interval that behaves. The standard textbook (Wald) interval is a poor choice here: a study of binomial intervals found that “the chaotic coverage properties of the Wald interval are far more persistent than is appreciated” and recommended the Wilson score interval instead (Brown, Cai and DasGupta, Statistical Science, 2001). The Wilson interval is what the numbers below use.

Test-set size	Answers scored faithful	Point estimate	~95% Wilson interval	What it lets you claim
20	18	0.90	0.70 to 0.97	Almost nothing; the band spans a quarter of the scale
100	90	0.90	0.83 to 0.94	A rough level, too coarse for a fine ranking
400	360	0.90	0.87 to 0.93	A band tight enough to compare two builds

Those intervals are computed from stated assumptions (Wilson score interval at a nominal 95% level, faithfulness point estimate 0.90), not measured. A faithfulness score of 0.90 on twenty questions could sit anywhere from roughly 0.70 to 0.97, which certifies nothing you would stake a release on. The same 0.90 on four hundred questions tightens to about 0.87 to 0.93.

Comparing two builds is that same discipline one step on, and it is a significance test, not a read on whether two intervals overlap. Reading a ranking off overlapping intervals is the textbook mistake: two confidence intervals can overlap while the difference between them is real. Two pipeline versions scoring 0.90 and 0.93 on the same hundred-question set were graded on identical items, and a shared item set calls for a paired significance test rather than a comparison of two separate intervals. The sensitive instrument is McNemar’s test on the discordant items, the questions one build passed and the other failed; it cancels the item difficulty common to both runs and on the same items can turn an unpaired “not significant” into a resolved difference. Run it on this delta, though, and a net swing of three questions out of a hundred rests on too few discordant pairs to separate from a coin toss, so the +3-point gain stays inside run-to-run noise and is not distinguishable from zero at this sample size. This is why reporting an evaluation with its intervals, and confirming a score can survive a rerun before you trust it, is the difference between a ranking and a guess.

Faithfulness is a judged number, so judge it too

There is a second source of uncertainty the sample-size interval does not capture. RAGAS computes faithfulness and answer relevance with an LLM, evaluating those dimensions “without relying on ground truth human annotations.” That makes each score an LLM-as-judge output, so it inherits the judge’s documented biases and its run-to-run variance on top of the sampling error already in the average.

A faithfulness number is only as trustworthy as the model producing it agrees with a human on the same traces, which is a construct-validity question you answer before you let the metric rank anything. The practical move is to interval-report the judge as well: rescore a sample twice, treat the spread between runs as part of your error budget, and report the metric bias-corrected where you can. A model-scored number without that discipline is a point estimate wearing a decimal it did not earn.

Attribute the failure to a stage

The payoff of the split is attribution. The pattern of high and low metrics tells you which stage owns a wrong answer, so you tune the right machine instead of the loud one. The table reads a metric signature back to a stage, which is the operational form of stage-level failure attribution for a two-stage pipeline.

Metric signature	What actually happened	Stage at fault	What it propagates into
Recall low, faithfulness high	The answer is faithful to context that missed the fact	Retrieval (recall)	A well-supported answer that is wrong
Recall and precision adequate, faithfulness low	The generator asserted claims the context does not entail	Generation	A fabricated claim with a citation attached
Precision low, faithfulness high, answer relevance low	Distractors entered the window and the generator anchored on them	Retrieval (precision)	An on-domain answer to the wrong sub-question
Retrieval metrics adequate, answer relevance low	The generator under-used or ignored the right context	Generation	An incomplete answer despite good evidence
Retrieval low and faithfulness low together	Both stages failed, and the retrieval fault came first	Retrieval, then generation	A compound error whose earliest stage you fix first

Two rules keep the reading honest. Attribute a compound failure to the earliest failing stage first, because a retrieval error becomes the generator’s premise and every step after it inherits the fault. And treat a high faithfulness score sitting on a wrong answer as a retrieval signal, since the generator did its job faithfully over bad input and grading it as a generation success points you at the wrong machine.

When retrieval poisons generation

Stage attribution matters because RAG faults propagate forward. A missed or distractor-heavy retrieval does not stay in the retrieval stage; it enters the generator as trusted evidence, and a faithful generation step then dresses it as a sourced, well-formed answer that is wrong. That is the same error-propagation dynamic that turns one bad agent hand-off into a system failure, worked through in our analysis of cascading failures in agent systems. The retrieval-to-generation edge is one hop of that chain, and it is the hop your metrics can actually localize.

Containment is the lever that bounds it. A grounding gate that checks whether the retrieved context entails the answer, and lets the generator abstain when it does not, catches a retrieval fault at the seam before it reaches a user. The fraction of injected retrieval faults stopped at that gate is a containment-rate reading for your pipeline, and like every metric here it deserves an interval, measured by fault injection rather than assumed from a clean run.

What the calculator will, and will not, claim

This site points toward a reliability profiler, and the RAG view it is designed to include would compute the stage-split scores above, each with its interval, from your own eval runs. That is design intent; no score is measured yet, so the page quotes none; every external number here traces back to a source you can open.

The next action is small, and it changes what your eval can tell you. Stop reporting one RAG score. Report retrieval and generation separately, attach an interval to each, and attribute a failure to its stage before you tune anything. An evaluation built that way tells a team what to fix; a blended point estimate only tells them that something, somewhere, went wrong. The reliability glossary and the research index hold the measurement vocabulary this method leans on.