Framework

The AI Eval Maturity Model

Five levels from vibes-based QA to eval-driven development. Where is your organization, and what does the path forward look like?

Most engineering teams know their AI features could be better tested. Few know specifically what better looks like. This model gives you a vocabulary for where you are and a concrete description of what the next level requires.

The five levels are not aspirational — they describe real patterns that appear across organizations at different stages of AI maturity. Most teams are at Level 1 or 2. Very few operate consistently at Level 4.

Level 0

No evals

Quality is assessed through intuition. Developers eyeball outputs before pushing. There are no automated checks. When something breaks, users find it first. This is the default state for teams that have shipped their first AI feature without stopping to ask how they'll know if it regresses.

Level 1

Manual spot-checks

Someone on the team runs through a set of test cases before each release. The process is inconsistent — dependent on individual effort and tribal knowledge about which cases matter. It catches obvious failures but misses subtle regressions and doesn't scale as the feature surface grows.

Level 2

Automated evals in CI

An eval suite runs on every pull request. The team has defined, in code, what "correct" looks like for the AI feature. Regressions get caught before they reach users. This is the most important transition in the model — it changes quality from a manual activity into a system property.

Level 3

Continuous evals in production

Evals run continuously against live traffic samples. The team receives alerts when quality or cost drifts outside acceptable thresholds. Model updates by providers no longer surprise anyone — the team knows within hours when behavior has changed, not after a customer complaint.

Level 4

Eval-driven development

Evals are written before features, not after. Product requirements are expressed as measurable criteria. The team knows what "done" means for an AI feature before anyone writes a line of implementation code. Quality is designed in, not inspected for.

Why the transition from Level 1 to Level 2 is the hardest

Moving from Level 0 to Level 1 just requires discipline. Moving from Level 1 to Level 2 requires investment: defining criteria, building infrastructure, and integrating evals into the CI pipeline. It’s a one-time cost that teams consistently underestimate.

The teams that make this transition successfully usually do it in response to a specific incident — a silent regression that reached production, a model update that changed behavior without anyone noticing, a cost spike that went undetected for weeks. The incident provides the organizational justification that abstract arguments about quality rarely do.

What Level 4 actually requires

Eval-driven development is a cultural change more than a technical one. It requires product managers who can express requirements as measurable criteria. It requires engineers who treat eval coverage as a first-class deliverable. It requires leadership that has decided AI quality is worth investing in systematically.

Most organizations that reach Level 4 do so incrementally. They start with a single high-stakes AI feature, build eval coverage for it, demonstrate that the investment paid off, and expand from there.

Where most organizations are

Based on observable patterns across the industry: the majority of teams that have shipped AI features in production are at Level 1. A meaningful minority have reached Level 2. Level 3 and Level 4 are uncommon.

This means the biggest opportunity for most organizations is not optimizing an existing eval program — it’s building the first one. The move from manual spot-checks to automated evals in CI is the single change with the highest return on investment in AI quality.

Grey Newell

Grey Newell builds AI evaluation infrastructure and advises engineering teams on AI quality strategy.

greynewell.com →