Evaluation harnesses: the LLM engineering practice nobody talks about.

Every production LLM system we operate has an evaluation harness. Every production LLM system that does not have one is either about to have an incident or is already quietly failing — silently, because without a harness, regressions are invisible. If you build one thing alongside your LLM system, build this.

What an evaluation harness is

An evaluation harness is a curated set of inputs, expected behaviours, and automated scoring that runs whenever the system changes — prompt, model, tool, retriever, or downstream API. It is the test suite for probabilistic systems.

Three components:

The golden dataset. Inputs and expected outputs, covering the distribution of real-world usage, including the adversarial edges.
The scorers. The functions that turn a model output into a pass/fail or a quality score.
The runner. The infrastructure that executes the dataset against the system and aggregates the scores.

The golden dataset

Most teams under-invest here. 100 examples is not enough. 30 is definitely not enough. Our baseline for production systems is 500–2,000 examples, stratified across:

The common cases (80% of traffic)
The long-tail cases (real queries that were hard)
Adversarial cases (prompt injection attempts, out-of-scope queries)
Sensitive cases (PII, regulated content, escalation triggers)
Historical failures (every incident becomes an eval case)

Datasets are versioned. Every addition is reviewed. Every case has a provenance — who added it, why, what it tests.

Scorers — the honest hierarchy

Different tasks need different scoring strategies. In ascending order of cost and fidelity:

Deterministic scorers

Exact match, regex match, JSON schema validation, SQL query equivalence. Cheapest, most reliable, works only when the expected output is structured. Our first choice whenever possible.

Similarity scorers

BLEU, ROUGE, embedding cosine similarity. Useful for summarisation and paraphrase tasks, but easy to fool — high similarity can coexist with wrong content. Never sole scorer.

LLM-as-judge

A stronger model grades the output against a rubric. Effective for open-ended tasks. Expensive. Biased toward verbose answers unless carefully prompted. We recommend double-scoring — two models, disagreements reviewed by a human.

Human scoring

Still the gold standard for subjective quality. Used on a sampled fraction (5–10%) of eval runs, to calibrate automated scorers and catch drift in LLM-as-judge.

The runner

Keep it simple. A CLI that takes a dataset version and a system version and emits a score report. Run it in CI. Block merges on regressions beyond a threshold. Store every run, so you can diff behaviour across months.

We use a light internal harness on most engagements; LangSmith, Braintrust and Weights & Biases are all reasonable off-the-shelf choices. The specific tool matters less than the discipline.

The cadences that matter

Every prompt change. Run the full harness. No exceptions.
Every model upgrade. Before swapping models — even "just" a minor version — run the harness. Small version bumps have surprised us before.
Weekly in production. Run a sample of live traffic through the harness. Compare against baselines.
Monthly dataset review. Add cases from any incidents. Retire cases that no longer reflect usage.

Anti-patterns

Eyeballing. "Looks good to me" is not evaluation.
Single-pass scoring. A probabilistic system needs multiple runs to stabilise the score.
Dataset leakage. Never use examples that may have leaked into the model's training data for evaluation.
Improvement theatre. Tuning the system until the eval passes, without reviewing whether the eval still reflects reality.

What it buys you

Three things. First, the confidence to change — models, prompts, retrievers, tools — without breaking the system. Second, a factual conversation with stakeholders: "we moved from 87% to 91% on the finance evals, from 72% to 79% on the clinical". Third, a basis for regulated deployments: an ICO, FCA or PRA reviewer asking how you know the system is fit for purpose can be given a dataset, a scoring method, and a twelve-month run history. Nothing else has that property.