Research

Long-form research on AI-agent reliability: how multi-agent systems fail, how far faults propagate, and whether the evals meant to certify them hold up.

Eval rigor 1 Apr 2026 Flagship pillar
AI agent evaluation that follows the whole trajectory
AI agent evaluation breaks when it scores the final answer and skips the path. Evaluate the trajectory, catch early-step corruption, and report pass rates with intervals.
Reliability testing 1 Apr 2026 Topic hub
AI agent reliability, from consistency to containment
AI agent reliability is a discipline of five properties: consistency, robustness, predictability, safety, and error propagation, with a map of where each is measured.
Eval rigor 1 Apr 2026
Bias-correct your LLM-as-a-judge eval before reporting it
An LLM judge is an imperfect classifier, so its raw pass rate is biased. Correct it with the judge's sensitivity and specificity, then report a calibration-aware confidence interval.
Propagation & containment 1 Apr 2026
Do microservices resilience patterns port to AI agents?
Five proven microservices resilience patterns, from circuit breaker to timeout budget, mapped to their AI-agent equivalents, with a judgment on how far each analogy actually holds.
Propagation & containment 1 Apr 2026 Flagship pillar
Error propagation and cascade containment in agent systems
OWASP's ASI08 files cascading failures under security. Reframed as error-propagation engineering, one fault becomes two measurable quantities, blast radius and containment rate, mapped to topologies.
Eval rigor 1 Apr 2026
How many runs a reliable eval needs to catch a regression
How many runs a reliable eval needs is a power calculation set by the regression you must catch, your target power, and the baseline pass rate. Includes a runs-needed table and the formula behind it.
Pattern reliability 1 Apr 2026
How to evaluate a RAG pipeline beyond a single score
Evaluate a RAG pipeline by scoring retrieval and generation separately, putting a confidence interval on every metric, and attributing each failure to the stage that produced it.
Reliability testing 1 Apr 2026
How to measure agent reliability past a single pass rate
How to measure agent reliability with metrics that capture consistency, not just capability: pass@k versus pass^k, a reliability@k suite aggregate, and a confidence interval on every rate.
Reliability testing 1 Apr 2026 Flagship pillar
How to test agent reliability beyond a single eval run
Single-run eval samples agent reliability once. Rigorous testing measures it across many runs with confidence intervals, statistical power, pass^k, and fault injection for cascade propagation.
Eval rigor 1 Apr 2026
Is your eval difference statistically significant?
Two eval runs a few points apart. Separate a real gain from run-to-run noise with a paired McNemar test on the same items: a p-value and a confidence interval on the pass-rate delta.
Eval rigor 1 Apr 2026 Topic hub
Is your LLM-as-a-judge reliable? Test the evaluator
An LLM-as-a-judge is a fallible evaluator. Its reliability breaks along three axes, agreement, calibration, and bias, each with a test and a correction. This hub routes to all three.
Eval rigor 1 Apr 2026 Topic hub
LLM evals: which methods to trust and where they lie
LLM evals report whether a model passed. Whether that score is valid is a separate question. This hub maps the eval methods and the four ways an eval number lies, each routed to its fix.
Eval rigor 1 Apr 2026 Flagship pillar
LLM-as-a-judge bias, and the tests that catch it
LLM-as-a-judge bias is systematic, measurable distortion in an evaluator. A per-bias map pairs each bias with a detection test and a correction, so you can tell when a judge's ranking would flip.
Reliability testing 1 Apr 2026
Make AI agents reliable in production: a budget playbook
Reliable AI agents in production start with the compounding math: 0.95 per step over 20 steps is ~36% end to end. A topology-aware playbook of levers: retry, fallback, checkpoint, human-gate, degrade.
Multi-agent failures 1 Apr 2026
Multi-agent orchestration patterns and the failures they amplify
The five multi-agent orchestration patterns, supervisor, sequential-pipeline, swarm, debate, and blackboard, mapped to how errors cascade in each and the failure modes each one amplifies.
Multi-agent failures 1 Apr 2026 Topic hub
Multi-agent systems, defined by how they fail
A multi-agent system is defined by its failure surface: agent, orchestration, coordination, shared state, and topology, each defined through the failure it enables, then routed to the research.
Eval rigor 1 Apr 2026
What LLM evals are, and what each type can certify
LLM eval covers four instruments: offline benchmark, LLM-as-judge, human, and online, each answering a different question, plus the benchmark-vs-product line and the rigor behind a trustworthy score.
Pattern reliability 1 Apr 2026 Flagship pillar
When RAG retrieves wrong chunks: failure modes and containment
RAG pipeline failure modes are containment failures at the retrieval-to-generation boundary: five modes mapped to how each propagates, its detection signal, and the gate that holds it.
Pattern reliability 1 Apr 2026
Where RAGAS wins at RAG evaluation, and where it stops
A positioning read on RAGAS for RAG evaluation: its real strengths, a limitation its own authors flag, and the rigor gaps around confidence intervals, significance, and propagation-aware attribution.
Multi-agent failures 1 Apr 2026 Flagship pillar
Why multi-agent LLM systems fail, and how to contain it
Why multi-agent LLM systems fail, grounded in the MAST failure taxonomy and mapped to how each failure propagates across agent topologies and the containment levers that bound the blast radius.