Reference
Glossary
Key terms in AI evaluation, defined for technical leaders — without unnecessary jargon.
Eval (Evaluation)
An automated check that tests whether an AI system's output meets a defined standard. The AI equivalent of a unit test — except instead of checking whether code runs correctly, it checks whether the model's response is actually good. Evals can test factual accuracy, adherence to instructions, tone, cost, latency, or any other measurable property of AI output.
Dataset
A collection of inputs used to run evals. Each entry typically includes a prompt (what gets sent to the model) and an expected output or grading criteria (what a correct response looks like). A good dataset covers representative cases, known failure modes, and edge cases that have caused problems in the past.
Grader
The component that decides whether a model's output passes or fails an eval. Graders range from simple (exact string match, regex) to sophisticated (another language model that judges the output). The right grader depends on the task: factual recall can use exact match; summarization quality requires a more nuanced judge.
LLM-as-judge
A grading technique where a separate language model evaluates the output of the model under test. Useful for subjective qualities — tone, helpfulness, instruction-following — that can't be checked programmatically. Requires its own validation to ensure the judge model's assessments are reliable and not subject to position bias or sycophancy.
Regression
A drop in quality, accuracy, or expected behavior compared to a previous baseline. In traditional software, regressions usually have an obvious cause (a code change broke something). In AI systems, regressions can be silent: a model provider updates their model, behavior subtly changes, and the product gets worse without any change to your codebase. Catching regressions is one of the primary purposes of an eval suite.
Threshold
The minimum acceptable score for an eval to be considered passing. If your eval suite scores each response 0–100, a threshold of 80 means the feature fails if the average score drops below 80. Setting thresholds requires judgment: too strict and you're blocked by noise, too loose and you don't catch real problems. Thresholds should be set based on what actually matters to users, not arbitrary round numbers.
Hallucination
When a language model generates confident, plausible-sounding output that is factually incorrect. The term is widely used but imprecise — models don't "hallucinate" in any meaningful cognitive sense, they generate statistically likely continuations of their context. More useful in practice: defining the specific factual accuracy requirements for your use case and evaluating against them, rather than hoping to eliminate hallucination generally.
Drift
Gradual degradation in AI system performance over time without an identifiable single cause. Drift can be caused by changes to the underlying model, changes in the distribution of user inputs, or changes in the world that make previously correct outputs incorrect. Detecting drift requires continuous evaluation against a consistent baseline.
Harness
The infrastructure that runs evals: takes a dataset, sends each input to the model, collects outputs, passes them to a grader, and produces a score. A harness handles the mechanics so that defining new evals requires only specifying inputs and grading criteria, not rebuilding the execution infrastructure each time.
Acceptance criteria (for AI features)
A pre-defined statement of what the AI feature must do to be considered shippable. Borrowed from software product management, but harder to write for AI: "summarizes correctly" is not an acceptance criterion. A useful one specifies the input class, the expected output property, and how it will be measured — for example, "on a sample of 100 customer support emails, the summary must include the primary request with no hallucinated details in at least 90% of cases."