Golden datasets for AI testing: 10 practical rules

Every team that ships AI eventually hits the same wall. A new model comes out, a prompt gets tweaked, a retrieval step gets rewritten, and nobody can answer the question that matters: did this make the product better or worse? The vibes-based answer ("feels about the same") is the expensive one, because the cost of every regression lands on users and on the on-call engineer who has to explain it next week.

A golden dataset is the simplest fix. It is a small, curated set of inputs and expected behaviours that you run on every meaningful change. It does not replace user feedback, A/B testing, or production telemetry. It catches the regressions those mechanisms catch too late, and it does it in minutes rather than days.

Building one is not easy. A dataset that is too small does not catch much; one that is too big becomes a chore to maintain. One scraped from a workshop tests the inputs the team imagined rather than the ones users send. And a dataset scored in a way that does not correlate with what users actually care about can pass while the product gets worse. Below are the ten rules we apply when we build one, in roughly the order you should think about them.

Build it from production traffic, not from a brainstorm

The most common failure mode is a golden dataset written by the engineering team in a workshop. It is fast to produce and tests the inputs the team finds interesting. It almost never tests the inputs users actually send.

Users are weirder and more specific than any brainstorm. They type half-sentences, mix languages, paste a screenshot's worth of error, ask for things the product was never designed to do, and use feature names that have not been on the roadmap for three quarters. If your golden set is full of well-formed prompts, your evals will look great while your users churn.

Sample from real logs from week one, with provenance, consent, retention, and PII handling tracked from the start. If the product is pre-launch, sample from internal usage or a closed beta. If you must seed with synthetic examples, label them as synthetic and replace them with real examples as soon as you have any. The goal is for the dataset to look like a small, representative slice of last month's traffic, not a list of things the team hoped users would ask.

Size for review, not for benchmarks

Academic eval sets are large because they need statistical separation between models that are already close. Your product eval set is doing a different job: catching regressions in a specific feature against your own baseline. The numbers that matter are very different.

A practical rule: 50 to 300 examples per distinct behaviour, where a behaviour is something like "summarise a meeting transcript" or "answer a billing question". Below that, expect more noise because individual examples dominate the score. Above 300, nobody reviews the failures and the dataset becomes write-only. The right size is the largest one your team will actually review when something regresses.

If you have several behaviours, build several small datasets rather than one big one. The aggregate "overall score" is rarely the question you want answered. "Did summarisation get worse this week?" almost always is.

Stratify by failure mode, not by happy path coverage

The instinct is to cover the happy path: one example for each supported input type, tidy and balanced. That gives you a dataset that any sensible model passes, which means it tells you almost nothing on the day a model change matters.

Build the dataset around failure modes. Sit with two or three people who answer support tickets or read user feedback, and list the ways the feature fails today: ambiguous inputs, multi-step questions, inputs in the wrong language, prompts that try to push past the system prompt, edge cases the product was never meant to handle but users send anyway, and the long tail of "this should have worked but did not".

Then weight the dataset toward those. A useful split is roughly 30% happy path, 50% known failure modes, and 20% adversarial or edge inputs. Treat these ratios as targets, not gates, especially below 100 examples. The exact ratios matter less than the principle: a dataset that is mostly happy path tells you the model is polite. One weighted toward failure modes shows you whether the product is improving.

Freeze the inputs, version the expected outputs

Treat the dataset like code. Inputs go in version control, expected outputs go in version control next to them, and every change is a reviewable pull request. No spreadsheets, no shared docs, no "the latest version is in my Notion". If you cannot git diff two versions of the dataset, it is not a golden dataset.

Two practical conventions help here:

Inputs are immutable once they are in. If an input is wrong (bad PII, broken encoding, no longer representative), remove it and add a new one rather than editing in place. This keeps historical eval runs comparable.
Expected outputs are versioned, with a reason. When you change an expected output, the PR description says why: "product now supports multi-currency, so the expected answer mentions both". Six months later, this is the only way to understand why the score moved.

Keep a hold-out set you never look at

If the team sees every example, the team will (consciously or not) tune prompts and pipelines to pass those examples. The dataset ends up measuring memorisation, not quality.

Split the dataset 80/20 or 90/10 into a dev set and a hold-out set. The dev set is fair game: people read it, debug against it, iterate on it. The hold-out set is run by CI or by one person on the team, and the individual examples are not shown in dashboards or shared in Slack. If the dev score improves but the hold-out score does not, you are overfitting to your own evals, which is one of the more embarrassing failure modes because it is invisible from the inside.

Rotate a fraction of the hold-out set into the dev set every quarter — those examples are now fair game to review — and refill the hold-out set with fresh production samples.

Label outcomes, not just answers

"Expected output: the exact string X" works for classification, extraction, and a small set of deterministic tasks. For everything else (summarisation, chat, code generation, anything with a paragraph of output) exact match is the wrong target. Two correct answers to the same question can be word-for-word different.

What you actually want to assert is the outcome: did the response contain the key fact, in the right currency, without making up the customer's name, while staying under 200 words. So label each example with a small set of checks:

yaml

- id: refund-001
  input: "I want my money back for order 4471"
  checks:
    must_contain: ["refund policy", "order 4471"]
    must_not_contain: ["sorry, I can't help"]
    must_call_tool: "lookup_order"
    max_tokens: 300
    tone: "empathetic, not defensive"

Some of those checks are mechanical (substring, tool call, token count). Some are subjective ("tone"). Both belong in the dataset. The mechanical ones run for free in CI; the subjective ones get scored by a model-as-judge or a human reviewer, depending on how much you trust the judge for that behaviour.

Score with the cheapest metric that catches the regression

Teams routinely overspend on evaluation. They run a frontier model as a judge over every example, on every PR, costing more per eval run than the feature earns in a day. Most of that cost is wasted, because most regressions are caught by far cheaper checks.

Build the scoring stack in layers, cheapest first:

Deterministic checks (substring, regex, schema validation, tool-call presence, latency, token budget). These catch the majority of regressions, run in milliseconds, and cost nothing. Run them on every PR.
Lightweight model-as-judge (a small, cheap model scoring a short rubric). Useful for tone, faithfulness to source, and basic correctness. Run them on every PR for the dev set, and on a schedule for the hold-out set, with periodic calibration against human review or a stronger judge.
Strong model-as-judge or human review. Reserved for the subset of examples where the cheaper layers disagree, or for the quarterly deep review. Run them sparingly.

The cost structure should look like a pyramid. If your CI bill for evals is bigger than your inference bill for users, you have inverted it.

Tie every regression to a specific example, not an aggregate

A dashboard that says "summarisation accuracy dropped from 84% to 81%" is useless without the next click. The team needs to see, in one screen, which specific examples regressed, what the old output was, what the new output is, and a diff between the two. Without that, the score is just noise on a chart, and the team will eventually stop looking at it.

Two practical things make this work:

Store every run's outputs alongside the dataset version and the system version. A regression is the difference between two stored runs. If you do not store outputs, you cannot debug regressions, only re-run them.
Make the failure view the default view. When the eval run finishes, the link in Slack should land on the list of examples that got worse since the last run, not on a green-vs-red summary card.

Refresh the dataset on a schedule, and after every meaningful change

Products move. New features ship, the user base shifts, a partner integration changes the kinds of inputs that arrive. A golden dataset that is twelve months old is measuring a product that no longer exists.

Two refresh triggers cover almost every case:

A quarterly refresh. Replace 10–20% of the dataset with new production samples. Retire examples that no longer reflect the product. Re-review expected outputs against current product behaviour.
An event-driven refresh. Whenever you ship a feature that changes what the model is expected to do (new tool, new policy, new tone of voice, new supported language), add examples that exercise it before the feature goes live. The dataset should change in the same PR as the feature, not three sprints later.

Treat the refresh as a maintenance task with a named owner, not an aspiration. Datasets that are "refreshed when we get time" are never refreshed.

Make one person own it

Datasets that belong to "the team" rot. The labels drift, the expected outputs go out of sync with the product, and within six months the eval score is uncorrelated with quality and everyone quietly stops trusting it.

Name one person as the dataset owner. They do not have to write every example, but they do have to make sure the dataset reflects current product behaviour, that the refresh cadence happens, that new failure modes get added when support sees them, and that the scoring pipeline keeps running. Rotate the role every couple of quarters so the knowledge spreads, but never leave it unowned.

It is a typical problem on engineering teams, and this single change (assigning the owner) tends to do more for eval quality than any tooling decision.

Common mistakes, in one table

Mistake	Why it happens	What it costs you
Dataset is too big to review	Felt safer at the time	Nobody looks at failures; the score becomes wallpaper
Examples written by the team, not sampled from users	Faster on day one	Evals pass while users complain
Exact-match scoring on open-ended outputs	Easy to implement	Real regressions hide behind cosmetic rewrites
Single shared dev set, no hold-out	Simpler workflow	Slow overfitting to your own evals
Frontier model as judge on every example, every PR	Looked rigorous	Eval bill exceeds product bill; team stops running it
Dataset not version-controlled	"It is just test data"	Cannot reproduce last quarter's number; arguments instead of answers
No owner	Felt like a shared responsibility	Dataset rots in two quarters and gets quietly abandoned

How to start when you have nothing

If you are starting from zero and want to be useful by the end of the week, the shape we recommend is:

Step one. Pick the single most important behaviour the AI feature is meant to do. Pull 30–50 real inputs from logs (or internal usage, if pre-launch). Stick them in a YAML file in the repo.
Step two. For each input, write down what a good response must contain and must not contain. Aim for 2–4 checks per example. Get one other person to review.
Step three. Wire it into CI. Deterministic checks only, no model-as-judge yet. Make it run on every PR and fail loudly when the score drops.
Step four. Add a second behaviour, repeat the process. Hold out 10% of each dataset and do not show those examples in any dashboard.
Step five. Name the owner. Put the quarterly refresh on the calendar.

That is enough to catch most regressions for most AI features. Everything else (model-as-judge, multi-turn evals, agent trajectory scoring, production replay, online evaluation) builds on the same skeleton once the basics work.

References and further reading

OpenAI, Evaluation best practices - practical guidance on designing evals for production AI systems, including the limits of traditional testing for generative outputs.
OpenAI Cookbook, Getting Started with OpenAI Evals - a hands-on walkthrough of building evals with ideal answers, deterministic checks, and model-graded scoring.
Anthropic, Demystifying evals for AI agents - a useful taxonomy for code-based, model-based, and human graders, plus the distinction between capability and regression evals.
Braintrust, How to improve your golden datasets with human review - a practical article on turning reviewed production traces into durable golden datasets that can be run in CI.
Twine, Building a Golden Dataset for Model Evaluation - a broader model-evaluation article covering coverage design, provenance, contamination, metadata, baselines, and versioning.
Ma, Yang, and Kästner, "(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs" - a research paper on why regression testing for LLM-backed software needs different assumptions from ordinary deterministic software tests.

If you are building AI features in production and want help designing the eval stack, picking the right scoring layers, or auditing an existing golden dataset, get in touch. We will review what you have today and tell you whether it is doing the job you think it is.

Golden datasets for testing AI: ten practical rules