AI for Builders
Practical, measured guidance for engineers building agentic systems: instrument the failures, size the risk, and ship reliability you can actually verify.
- A refused Claude call returns HTTP 200, empty content. Dashboards log success.
When Claude Fable 5 refuses, it returns HTTP 200 with an empty content array, so error dashboards read it as a success, and with fallback on the model can swap to Opus 4.8 unseen. Instrument now.
- The MCP changelog told you what's deprecated. It stayed silent on what breaks.
Roots, Sampling, and Logging are only annotation-deprecated; SEP-2567 deletes Mcp-Session-Id and sticky routing on July 28 with no grace period. Audit the session layer first.
- The prompt wording is a hyperparameter you never swept.
Rewording the same task swings a model's pass rate: format, option order, even a 'please'. A one-phrasing eval samples one point from a spread you never measured. Pin the prompt and measure it.
- You picked the model at the top. The harness picked it for you.
BenchJack, a Berkeley auditing tool, found 219 flaws in ten popular agent benchmarks and gamed nine to near-perfect scores. Why a benchmark number is a claim about the harness, and how to check it.
- You reach up because the task looks hard. Only the invoice changes.
On short work you can check, our routing eval found no Opus-to-Fable capability separation at any effort. Buy down, not up: Fable 5's 2x buys a refusal tax and a fallback to the cheaper model.
- Your dashboard stays green while your users still get wrong answers.
LangChain's State of Agent Engineering survey of 1,340 practitioners: 89% run observability, only 52.4% run offline evals. An offline eval is what scores whether the answer was right.
Reliability how-tos from the research library
Deep, cited method pieces that live in the research lane: the measurement backbone behind everything here.
- How to test agent reliability beyond a single eval run
Single-run eval samples agent reliability once. Rigorous testing measures it across many runs with confidence intervals, statistical power, pass^k, and fault injection for cascade propagation.
- How to evaluate a RAG pipeline beyond a single score
Evaluate a RAG pipeline by scoring retrieval and generation separately, putting a confidence interval on every metric, and attributing each failure to the stage that produced it.
- How many runs a reliable eval needs to catch a regression
How many runs a reliable eval needs is a power calculation set by the regression you must catch, your target power, and the baseline pass rate. Includes a runs-needed table and the formula behind it.
- Is your eval difference statistically significant?
Two eval runs a few points apart. Separate a real gain from run-to-run noise with a paired McNemar test on the same items: a p-value and a confidence interval on the pass-rate delta.
- Bias-correct your LLM-as-a-judge eval before reporting it
An LLM judge is an imperfect classifier, so its raw pass rate is biased. Correct it with the judge's sensitivity and specificity, then report a calibration-aware confidence interval.
- Multi-agent orchestration patterns and the failures they amplify
The five multi-agent orchestration patterns, supervisor, sequential-pipeline, swarm, debate, and blackboard, mapped to how errors cascade in each and the failure modes each one amplifies.
- Make AI agents reliable in production: a budget playbook
Reliable AI agents in production start with the compounding math: 0.95 per step over 20 steps is ~36% end to end. A topology-aware playbook of levers: retry, fallback, checkpoint, human-gate, degrade.