Research
Long-form research on AI-agent reliability: how multi-agent systems fail, how far faults propagate, and whether the evals meant to certify them hold up.
Showing all 20 pieces
- AI agent evaluation that follows the whole trajectory
AI agent evaluation breaks when it scores the final answer and skips the path. Evaluate the trajectory, catch early-step corruption, and report pass rates with intervals.
- AI agent reliability, from consistency to containment
AI agent reliability is a discipline of five properties: consistency, robustness, predictability, safety, and error propagation, with a map of where each is measured.
- Bias-correct your LLM-as-a-judge eval before reporting it
An LLM judge is an imperfect classifier, so its raw pass rate is biased. Correct it with the judge's sensitivity and specificity, then report a calibration-aware confidence interval.
- Do microservices resilience patterns port to AI agents?
Five proven microservices resilience patterns, from circuit breaker to timeout budget, mapped to their AI-agent equivalents, with a judgment on how far each analogy actually holds.
- Error propagation and cascade containment in agent systems
OWASP's ASI08 files cascading failures under security. Reframed as error-propagation engineering, one fault becomes two measurable quantities, blast radius and containment rate, mapped to topologies.
- How many runs a reliable eval needs to catch a regression
How many runs a reliable eval needs is a power calculation set by the regression you must catch, your target power, and the baseline pass rate. Includes a runs-needed table and the formula behind it.
- How to evaluate a RAG pipeline beyond a single score
Evaluate a RAG pipeline by scoring retrieval and generation separately, putting a confidence interval on every metric, and attributing each failure to the stage that produced it.
- How to measure agent reliability past a single pass rate
How to measure agent reliability with metrics that capture consistency, not just capability: pass@k versus pass^k, a reliability@k suite aggregate, and a confidence interval on every rate.
- How to test agent reliability beyond a single eval run
Single-run eval samples agent reliability once. Rigorous testing measures it across many runs with confidence intervals, statistical power, pass^k, and fault injection for cascade propagation.
- Is your eval difference statistically significant?
Two eval runs a few points apart. Separate a real gain from run-to-run noise with a paired McNemar test on the same items: a p-value and a confidence interval on the pass-rate delta.
- Is your LLM-as-a-judge reliable? Test the evaluator
An LLM-as-a-judge is a fallible evaluator. Its reliability breaks along three axes, agreement, calibration, and bias, each with a test and a correction. This hub routes to all three.
- LLM evals: which methods to trust and where they lie
LLM evals report whether a model passed. Whether that score is valid is a separate question. This hub maps the eval methods and the four ways an eval number lies, each routed to its fix.
- LLM-as-a-judge bias, and the tests that catch it
LLM-as-a-judge bias is systematic, measurable distortion in an evaluator. A per-bias map pairs each bias with a detection test and a correction, so you can tell when a judge's ranking would flip.
- Make AI agents reliable in production: a budget playbook
Reliable AI agents in production start with the compounding math: 0.95 per step over 20 steps is ~36% end to end. A topology-aware playbook of levers: retry, fallback, checkpoint, human-gate, degrade.
- Multi-agent orchestration patterns and the failures they amplify
The five multi-agent orchestration patterns, supervisor, sequential-pipeline, swarm, debate, and blackboard, mapped to how errors cascade in each and the failure modes each one amplifies.
- Multi-agent systems, defined by how they fail
A multi-agent system is defined by its failure surface: agent, orchestration, coordination, shared state, and topology, each defined through the failure it enables, then routed to the research.
- What LLM evals are, and what each type can certify
LLM eval covers four instruments: offline benchmark, LLM-as-judge, human, and online, each answering a different question, plus the benchmark-vs-product line and the rigor behind a trustworthy score.
- When RAG retrieves wrong chunks: failure modes and containment
RAG pipeline failure modes are containment failures at the retrieval-to-generation boundary: five modes mapped to how each propagates, its detection signal, and the gate that holds it.
- Where RAGAS wins at RAG evaluation, and where it stops
A positioning read on RAGAS for RAG evaluation: its real strengths, a limitation its own authors flag, and the rigor gaps around confidence intervals, significance, and propagation-aware attribution.
- Why multi-agent LLM systems fail, and how to contain it
Why multi-agent LLM systems fail, grounded in the MAST failure taxonomy and mapped to how each failure propagates across agent topologies and the containment levers that bound the blast radius.