May 20, 2026

The Yardstick Problem

A feather quill pen resting on an old handwritten ledger with aged cream-colored pages.

You can’t unit-test a probability distribution. Run the same prompt twice, you might get two different outputs. Run it on next quarter’s model and you definitely will. Somewhere in there, the product is either getting better or getting worse, and you’re supposed to know which.

This is the problem every team shipping LLM-based products has to solve, and most don’t solve it well. We didn’t either, at first.

The ground keeps moving

The naive picture of an LLM is a function: input goes in, output comes out. The picture is wrong in two directions. Same input, same model, two runs: different outputs, by design.¹ Same input, new model version: also different outputs, because each release is in some sense a different function.

The second one is sneakier. When Sonnet 4.5 became Sonnet 4.6, we quietly inherited a new default: more emojis. In a chat assistant, that’s just a bit of personality. In a product whose deliverable is a board-ready financial memo, an emoji on a debt-to-equity line item is unacceptable. The instruction to avoid them was still in the prompt. The model was just less inclined to follow it.

This is the second-order version of the failure mode the Trust post described: the model drifting toward “things it simply knows” rather than things it can cite. Earlier it was prompt-following degrading inside a session. Here it’s prompt-following degrading across releases. Both have the same shape: a behavior that used to be reliable is now slightly less so, and nothing on the surface tells you, until a customer does.

Emojis are the easy case. The hard case is when the model gets the numbers right but takes a shortcut on the reasoning: pads a citation with common knowledge, declines to flag uncertainty, recalls a figure it “semi-remembers” from training rather than retrieving it. The output looks the same. The product is quietly worse. This is also why smarter models aren’t automatically better: the more internalized knowledge they have, the more likely they are to use it. While this is the right trade-off when latency is critical (e.g. when chatting), it really isn’t when it comes to sourcing. ²

You can’t grade essays with a unit test

The standard testing playbook breaks down here. Software tests work because the right answer is a single value and the wrong answer is anything else. An LLM-generated report is not a value. It’s a paragraph, and “is this paragraph good” is not a function you can write.

Some of it is, though. Is the revenue number correct. Is it cited. Does the cited page actually contain that figure. Are there other documents in the corpus that should have been cited and weren’t. Those are functions you can write. They’re boring. They’re also the load-bearing ones.

The rest: is the explanation well-connected, does the answer actually address the question, did we wander off into speculation the user didn’t ask for, are we emitting jargon nobody outside the SEC cares about³. Those aren’t expressible as code. They require judgment. They look like the kind of thing only a human can grade.

For a while, that’s what we did. It didn’t scale.

Double-entry, applied

The trick of double-entry bookkeeping isn’t that the numbers are right. It’s that wrongness can’t hide. Every transaction is recorded twice, in two independent ledgers, and the two have to agree. If they don’t, you know something is wrong before you know what.

Our eval system is double-entry. Every output runs through two independent paths, and the two have to agree.

The first path is hard checks: deterministic code asking deterministic questions. Does the spreadsheet contain the right numbers, extracted formatting-agnostically and matched against a known-correct fixture using normalization and stable matching, so a layout tweak doesn’t blow up the comparison? Are the load-bearing figures present at all? Do the citations resolve to documents that actually contain the claimed values? Are there documents in the corpus we should have cited and didn’t? This is code. It’s not opinionated. It either passes or it fails.

The second path is soft checks: an LLM grading another LLM’s output. The recursion is the obvious objection, and it would be a strong one if we used soft checks to grade the numbers. We don’t. Soft checks grade things code can’t see: is the prose actually addressing the question the user asked, are the steps logically connected, does the response stay inside the scope the user defined, is the tone professional, is the model leaking jargon the user shouldn’t have to learn. Tone and relevance are exactly the things LLMs are good at evaluating. Numbers are exactly the things they’re bad at evaluating. We assign each layer the job it’s actually competent at.

That’s the same principle the product is built on, one floor up. Code handles the data. AI handles the language. At runtime, and at test time.

When both paths agree, we have something close to certainty. When they disagree, we have a finding. Most of the value is in the disagreements.

There’s a third leg to the stool, which the previous paragraphs have been quietly assuming: ground truth itself. Both paths are checking work against fixtures, and the fixtures don’t build themselves: domain experts assemble them, case by case. The interesting wrinkle is that the relationship runs in both directions. Experts build the fixtures, and the system, in turn, flags back to them when the world has moved underneath those fixtures. Restatements, amended filings, numbers that quietly stop being canonical. The next story is one of those.

A representative disagreement

A real one, lightly anonymized.

One of the companies we use for our deeper internal evaluations changed an accounting standard mid-year and restated several years of prior figures to match. The latest version of a long-running report we generated picked up the new filing and appended its column, exactly as designed. The earlier columns were the old, pre-restatement numbers. On the surface, the report looked clean: every cell traced to a real filing, every citation resolved, every check at the level of an individual claim passed.

The hard checks flagged it anyway. The eval had ingested both the original and the restated filings, and the cross-document consistency check noticed that we had two different revenue figures for the same fiscal year coming from the same issuer. The report didn’t contradict any individual source. It contradicted the universe of sources, taken together.

The flag did two things at once. It kept the report from going out, and it told the analyst who maintains that issuer’s fixtures that there was a decision to make: restate the historical column to match the new standard, or preserve the original figures as a record of what was known at the time. Neither answer is automatic. Both need a human.

This is the class of failure no per-cell test would catch and no model is reliably going to surface on its own, because each individual claim is grounded. The error lives between the claims, in the implicit assumption that they belong in the same row.

Making evals load-bearing

A passing test that nobody ever reads is the same as no test. A lot of the work in building an eval system is the unglamorous part: making sure the results show up where the work is happening.

Ours emit JUnit XML, which means CI tooling treats them like any other test suite: failures block merges, regressions surface in the same place ordinary flakiness does. They integrate with our internal debugger so when a check fails, an engineer can step backward through the run and inspect the model’s intermediate states. And every run generates a portable report: a self-contained artifact you can attach to a ticket, send to a reviewer, or archive against a release. Evals that live in a separate dashboard nobody opens are evals that don’t exist.

What this buys

Two things, mostly.

The first is that we can change models without holding our breath. When a new frontier model lands, the question isn’t “does it feel better” (a question with no good answer) but “does the suite get worse.” If it doesn’t, we ship the upgrade. If it does, we know exactly where, and we can fix the scaffolding before our users see it. That’s how we’ve moved across model generations without any of our customers noticing, except that things got faster.

The second is that we can refactor with the same confidence. Large changes to our agent architecture used to be terrifying. Now they’re routine, which doesn’t mean they’re safe by default. It means we have a reliable way of finding out whether they are.

The yardstick problem isn’t fully solved. You don’t solve probability; you build infrastructure around it. But the difference between a system you can measure and one you can’t is the difference between an engineering discipline and an art form. We are building a product that has to stand up in front of regulators, auditors, and MDs about to walk into a board meeting. Art form isn’t an option.

If this is the kind of problem you want to work on, we’re hiring.

Temperature can be turned down. In practice it doesn’t go to zero, and even if it did, plenty of the model’s run-to-run variability comes from sources upstream of sampling. ↩
The Trust post has a simple example: Netflix’s November 2025 stock split was not within the knowledge boundary. In our testing, the smarter the model the more steering/prompting it required to not just “remember” the stock price/count. ↩
XBRL is the SEC’s structured-data format for financial filings. We use it internally because it’s the most reliable way to pull numbers out of US filings. We don’t use it in user-facing copy. The model occasionally needs reminding.⁴ ↩
Yes, our own website mentions XBRL. The website is talking to a different audience than the product is. ↩