The Problem: AI Quality Isn't Testable Yet
You've built a Claude Opus 4 agent that routes customer support tickets. It works brilliantly in your notebook. You ship it to staging. Everything looks good. You deploy to production on Tuesday morning.
By Wednesday, you're fielding complaints. The agent is occasionally hallucinating ticket classifications. It's missing context in 3% of cases. You roll back. You tweak the prompt. You test manually again. You deploy again.
This cycle repeats because you're treating AI quality like a feature you ship once and hope works. You're not treating it like code.
In traditional software engineering, we test relentlessly. We run unit tests, integration tests, regression tests—all automatically, every commit. We gate deployments on test results. A broken test blocks the build. That's non-negotiable.
AI quality gates don't exist in most organisations yet. There's no automated mechanism that says: "This agent's accuracy dropped 2 percentage points compared to the baseline. Fail the deployment. Don't ship it."
That's what evals as code solves.
Evals as code means treating AI quality measurement exactly like you treat code quality: write tests declaratively, run them automatically on every change, gate deployments on results, and iterate continuously. No manual spot-checking. No hope-driven shipping. Just measurable, reproducible quality gates.
At Brightlume, we've shipped 85%+ of our AI pilots directly to production because we treat evals as first-class CI/CD citizens from day one. This article walks you through the engineering patterns that make that possible.
What Are Evals, and Why They're Different From Traditional Tests
An eval is a test for AI behaviour. But it's fundamentally different from a unit test.
A unit test is deterministic: given input X, you always get output Y. You assert Y == expected_Y. Pass or fail, done.
An AI eval is probabilistic. Given input X, you might get output Y1, Y2, or Y3—all reasonable. You can't hardcode assertions. You need a way to measure whether the output is good enough.
That's where evals diverge from traditional tests:
Traditional Test: assert classify_ticket("urgent bug") == "critical"
AI Eval: "Did the agent correctly classify this ticket? Measure semantic similarity, check if the classification aligns with the intent, score it on a scale of 0–10."
The second requires a judgment call. That judgment can come from a human, a rule-based scorer, or another AI model (an LLM-as-judge). The key insight is that you're measuring quality dimensions—accuracy, latency, cost, safety, clarity—rather than exact outputs.
Evals as code formalises this measurement. You write evals declaratively (like tests), run them in CI/CD pipelines, aggregate results into a score, and gate deployments on that score. If your accuracy drops below 92%, the build fails. If latency exceeds 800ms, the build fails. You don't deploy regressions.
This is production discipline applied to AI quality.
The Architecture: Evals in Your CI/CD Pipeline
Let's build the mental model. Your CI/CD pipeline today looks something like this:
- Developer commits code.
- Pipeline runs unit tests, linting, security scans.
- Tests pass → build succeeds → deploy to staging → manual QA → deploy to production.
With evals as code, it looks like this:
- Developer commits code and updates the agent prompt, model config, or retrieval logic.
- Pipeline runs unit tests, linting, security scans, and AI evals.
- Evals run against a test dataset (e.g., 100 past customer tickets).
- Results are scored (accuracy, latency, cost) and compared to baseline.
- If any metric regresses beyond threshold, the build fails. Developer must fix it.
- Tests pass → build succeeds → deploy to staging → evals run again on staging data → deploy to production.
The critical difference: you're catching quality regressions before they hit production, the same way a failing unit test blocks a deploy.
Here's the concrete architecture:
Eval Dataset: A curated set of test cases representing real production scenarios. For a customer support agent, this might be 100 actual tickets with known correct classifications. For a clinical workflow agent, it's 50 patient intake forms with expected outcomes.
Eval Harness: Code that runs your agent against the eval dataset, captures outputs, and scores them. This runs in your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins—whatever you use).
Scorers: Functions that measure quality. A scorer might be:
- Rule-based: "Did the agent's output contain the required fields?" (deterministic)
- LLM-as-judge: "Is this response factually accurate?" (Claude Opus 4 scores it 0–10)
- Metric-based: "Latency < 500ms?" (deterministic)
Baseline & Thresholds: Your eval results from the last production deployment. New results are compared to baseline. If accuracy drops more than 2 percentage points, or latency increases more than 100ms, the build fails.
Results Storage: Eval results are logged, versioned, and tracked over time. You can see how agent quality evolves across commits, deployments, and model updates.
This architecture mirrors how automated quality gates work in traditional CI/CD: you define policies, run automated checks, and gate deployments on policy compliance. The difference is the policy is "agent accuracy must stay above 92%" instead of "code coverage must stay above 80%."
Building Your First Eval: A Concrete Example
Let's make this tangible. Say you're building an AI agent that summarises clinical notes for a health system. The agent reads a doctor's notes and produces a structured summary for the EHR.
Here's your eval:
eval_name: clinical_note_summarisation
dataset:
- input: "Patient presented with persistent cough for 3 weeks, fever 38.5C, chest pain on deep breath. CXR shows right lower lobe infiltrate. Started on amoxicillin-clavulanate."
expected_summary:
chief_complaint: "Persistent cough, fever, chest pain"
findings: "RLL infiltrate on CXR"
assessment: "Likely community-acquired pneumonia"
plan: "Antibiotic therapy initiated"
- input: "..."
expected_summary: "..."
# 48 more test cases
scorers:
- name: structural_completeness
type: rule_based
checks:
- "summary contains chief_complaint"
- "summary contains findings"
- "summary contains assessment"
- "summary contains plan"
- name: semantic_accuracy
type: llm_as_judge
prompt: "Does this summary accurately capture the clinical note? Score 0-10."
model: claude-opus-4
- name: latency
type: metric
threshold_ms: 2000
thresholds:
structural_completeness: '>= 95%'
semantic_accuracy: '>= 8.5'
latency: '<= 2000ms'
You run this eval in CI/CD. The harness:
- Loads the dataset (50 clinical notes).
- Calls your agent on each note.
- Runs each scorer (checks structure, asks Claude to score accuracy, measures latency).
- Aggregates results: 48/50 notes have complete structure (96%), average semantic score 8.7, median latency 1,400ms.
- Compares to baseline: structure was 96% (no change ✓), semantic was 8.5 (improved +0.2 ✓), latency was 1,200ms (increased +200ms ⚠️).
- Decision: latency increased but still under threshold. Build passes. Deploy.
If semantic accuracy had dropped to 7.8, the build would fail. Developer must investigate: Did the prompt change? Did the model change? Is the retrieval broken? Fix it, push again, re-run evals.
This is evals as code. It's discipline.
Practical Implementation: Tools and Patterns
You don't build this from scratch. There's a growing ecosystem of tools designed specifically for AI eval pipelines.
Braintrust's eval tools provide a comprehensive survey of the landscape. The core categories are:
Eval Frameworks: Libraries that make it easy to define evals, run them, and collect results. Examples include Langsmith (LangChain's eval framework), Arize Evals, and Inspectability. These let you write evals declaratively and integrate them into CI/CD.
LLM-as-Judge Platforms: Services that handle the "judge" model for you. You define a scoring prompt, they run Claude/GPT-4/Gemini against your outputs, you get scores. Services like Evals.ai and LangChain's evaluators abstract away the model management.
CI/CD Integration Tools: GitHub Actions workflows and GitLab CI templates that run evals automatically on every commit. Some tools (like LaunchDarkly) provide pre-built CI/CD pipelines for AI configs with quality gates baked in.
Eval Dashboards: Platforms that visualise eval results over time, track regressions, and alert when thresholds are breached. Weights & Biases, Arize, and Humanloop all offer this.
The pattern most teams follow (and the practical guide from Dev.to outlines this well) is:
- Define evals in code (YAML or Python).
- Store eval datasets in version control or a data store (S3, GCS).
- Run evals in CI/CD on every commit (GitHub Actions, etc.).
- Compare to baseline automatically.
- Gate deployments on eval results (fail the build if thresholds breach).
- Log results to a dashboard for visibility.
A concrete GitHub Actions workflow looks like:
name: AI Agent Evals
on: [push, pull_request]
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- run: pip install -r requirements.txt
- run: python run_evals.py --dataset eval_dataset.json --model claude-opus-4
- name: Compare to baseline
run: python compare_evals.py --current eval_results.json --baseline baseline.json
- name: Fail if regressions
run: |
if grep -q "REGRESSION" eval_report.txt; then
echo "Eval regressions detected. Build failed."
exit 1
fi
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: eval_results
path: eval_results.json
This is the simplest form. You run evals, check for regressions, fail the build if found. No agent ships with degraded quality.
Scoring Strategies: Rule-Based, LLM-as-Judge, and Hybrid
The scorer is the heart of your eval. It's the mechanism that turns "Is this good?" into a number. Different scorers suit different scenarios.
Rule-Based Scorers are deterministic. They check for the presence of required fields, format compliance, or simple heuristics.
Example: For a customer support ticket classifier, you might score on:
- "Did the agent assign a category?" (0 or 1)
- "Is the category in the allowed list?" (0 or 1)
- "Did the agent provide a confidence score?" (0 or 1)
Rule-based scorers are fast, reproducible, and cheap. But they're brittle. They can't judge semantic quality. A ticket might be classified correctly but with a poor explanation, and a rule-based scorer won't catch it.
LLM-as-Judge Scorers delegate judgment to an LLM (typically Claude Opus 4 or GPT-4). You define a scoring prompt, the LLM reads the agent's output and the ground truth, and scores on a scale (0–10, or binary pass/fail).
Example scoring prompt:
You are evaluating a clinical summary generated by an AI agent.
Original Note:
{original_note}
Agent Summary:
{agent_summary}
Expected Summary:
{expected_summary}
Score the agent's summary on these dimensions:
1. Completeness: Does it capture all key clinical findings? (0-10)
2. Accuracy: Is everything factually correct relative to the original note? (0-10)
3. Clarity: Is it written in clear, structured clinical language? (0-10)
Provide a single overall score (0-10) and brief justification.
LLM-as-judge is powerful. It can evaluate semantic correctness, tone, safety, and nuance. But it's slower (API calls to Claude/GPT-4) and more expensive. A typical LLM-as-judge eval on 100 test cases costs $2–10 depending on model and prompt length.
Hybrid Scorers combine both. You run fast rule-based checks first (structural validation), then LLM-as-judge for semantic quality only if rule-based checks pass. This saves cost and latency while maintaining quality gates.
Example flow:
- Rule check: "Does the summary have all required sections?" If no, score 0. Done.
- If yes, LLM-as-judge: "How accurate is the summary?" Score 0–10.
- Final score: (rule_check_pass * 10) * 0.3 + llm_score * 0.7
This hybrid approach is what most production teams use. It's cost-effective and catches both structural and semantic regressions.
The key insight: choose scorers that measure what matters to your business. For a customer support agent, accuracy and resolution time matter. For a clinical agent, safety and completeness matter. Your evals should reflect those priorities.
Building Eval Datasets That Represent Reality
Your eval is only as good as your dataset. A 50-test-case eval that doesn't represent production is useless. You'll pass evals and fail in production.
Here's how to build representative eval datasets:
Start With Production Data: Sample 100–500 real examples from production. For a support agent, grab real tickets. For a clinical agent, grab real notes (anonymised). These are your ground truth.
Annotate Carefully: For each example, define the correct output. This is manual work. A clinical note might require a human doctor to label the correct summary. A support ticket might require a domain expert to verify the correct classification. Budget 5–10 minutes per example. 100 examples = 8–16 hours of annotation.
Stratify by Scenario: Don't just grab random examples. Ensure your dataset covers edge cases:
- Simple, straightforward cases (80% of production)
- Ambiguous or borderline cases (15%)
- Hard, unusual cases (5%)
For a clinical agent, this might mean: 40 routine intake forms, 8 complex multi-condition cases, 2 rare disease presentations.
Version Your Dataset: Store it in version control. When you update annotations or add new examples, commit the change. You want to know which eval dataset was used for each deployment.
Refresh Periodically: Every quarter, add 20–30 new examples from production. Your agent's blind spots change as it encounters new patterns. Your evals should evolve with them.
Dataset quality is often the bottleneck. It's unglamorous work, but it's foundational. A mediocre eval pipeline with a high-quality dataset beats a sophisticated pipeline with a weak dataset.
Threshold Setting and Regression Detection
Once you're running evals, you need to decide: what's a pass? What's a fail?
Thresholds are the gate. They're the boundary between "safe to deploy" and "not safe to deploy."
Setting thresholds is both science and art:
Science: Look at your baseline metrics. If your agent achieves 94% accuracy on your eval dataset, and you've validated that 94% accuracy maps to acceptable production quality, then set your threshold at 93%. A 1 percentage point regression is acceptable; a 2 percentage point regression fails the build.
Art: You need to know what "good enough" means for your business. A clinical agent might require 98% accuracy (safety-critical). A hotel concierge agent might accept 85% (user can always escalate). These aren't technical decisions; they're business decisions.
Common threshold patterns:
Absolute Thresholds: "Accuracy must be >= 92%." Simple, clear, easy to enforce.
Relative Thresholds: "Accuracy must not drop more than 2 percentage points from baseline." Useful when you're improving the agent incrementally and don't want to block all progress.
Composite Thresholds: "Accuracy >= 92% AND latency <= 500ms AND cost <= $0.10 per request." You're gating on multiple dimensions simultaneously.
Percentile Thresholds: "95th percentile latency must be <= 1000ms." Useful for tail latency concerns.
At Brightlume, we typically recommend:
- Set absolute thresholds based on production requirements (safety, cost, latency).
- Set relative thresholds to catch regressions (e.g., "don't degrade by more than 1%").
- Composite thresholds that reflect your SLO (service level objective).
Once thresholds are set, regression detection is automatic. Your CI/CD pipeline compares current results to baseline, checks against thresholds, and gates the deployment. Autonomous quality gates enforce these policies consistently, the same way a linter enforces code style.
From Evals to Continuous Improvement
Evals as code isn't just about gating deployments. It's about creating a feedback loop that drives continuous improvement.
Here's the loop:
- Deploy with evals: You ship an agent with a passing eval score.
- Monitor in production: You track real-world performance (accuracy, latency, cost, user satisfaction).
- Compare eval to production: You discover a gap. Evals said 94% accuracy, but production shows 89%. Why?
- Debug the gap: You investigate. Maybe your eval dataset didn't cover a common production pattern. Maybe the model behaves differently under load. Maybe user expectations are higher than you thought.
- Update evals: You add new test cases to your eval dataset that cover the gap. You adjust thresholds if needed.
- Iterate: You improve the agent (better prompt, different model, updated retrieval), run evals, deploy.
- Repeat: The loop continues. Each iteration, your evals get more representative, your thresholds more accurate, your deployments more confident.
This is how you move from 50% pilot-to-production rates (typical for most teams) to 85%+ (Brightlume's rate). You're not hoping agents work; you're measuring them systematically.
The awesome AI eval resources curate frameworks and benchmarks that support this loop. They're tools for continuous improvement, not one-time validation.
Scaling Evals: Multi-Agent Systems and Complex Workflows
So far, we've focused on single agents. But production AI systems are often more complex: multi-step workflows, agent-to-agent communication, human-in-the-loop handoffs.
Evals scale to these scenarios, but you need to think about it differently.
End-to-End Evals: You test the entire workflow, not just individual agents. For a clinical intake workflow, you might have: intake agent → triage agent → scheduling agent → confirmation agent. Your eval dataset includes 50 complete patient journeys, and you measure success as "patient was correctly triaged, scheduled, and received confirmation."
Component Evals: You also test individual agents in isolation. The intake agent should extract patient information with 98% accuracy. The triage agent should assign the correct priority. These component evals catch regressions early.
Integration Evals: You test the handoff between agents. Does the triage agent receive the intake agent's output correctly? Does it handle missing data gracefully? These evals catch integration bugs.
For complex workflows, your eval harness becomes more sophisticated. You're orchestrating multiple agents, tracking state across steps, and measuring success at multiple levels. But the principle is the same: treat each level as a quality gate in your CI/CD pipeline.
Cost becomes a consideration at scale. Running evals on 50 end-to-end workflows, each involving 4 agents and 3 LLM-as-judge calls, can cost $20–50. You might run evals only on commits that touch agent code, not on every commit. You might run expensive evals only before production deployment, and cheaper rule-based evals on every commit.
This is where eval infrastructure becomes important. Tools like Braintrust and Weights & Biases let you define eval pipelines with conditional steps, caching, and cost optimisation. You run fast evals first, expensive evals only if fast evals pass.
Safety and Governance: Evals as a Control Mechanism
In regulated industries (healthcare, finance, insurance), evals aren't just about quality—they're about compliance and risk management.
Evals can encode safety policies:
Hallucination Detection: An eval scorer checks whether the agent's output contains claims not supported by the source data. If hallucination rate > 2%, fail the deployment.
Bias Detection: An eval dataset includes diverse demographic scenarios. You measure whether the agent treats different groups equally. If treatment differs significantly, fail the deployment.
Drift Detection: You measure whether the agent's behaviour has changed over time. If the latest model is significantly different from the baseline, flag it for review.
Explainability: An eval scorer checks whether the agent provides reasoning for its decisions. If explanation quality drops, fail the deployment.
These safety evals are governance mechanisms. They're how you prove to regulators (and to your board) that you're deploying AI responsibly. When an audit asks, "How do you ensure this agent doesn't make biased decisions?" you can point to your eval results.
At Brightlume, safety evals are non-negotiable for healthcare and financial services clients. We define them upfront, embed them in CI/CD, and treat them with the same rigor as production code tests.
Common Pitfalls and How to Avoid Them
Teams implementing evals as code often hit the same obstacles. Here's how to avoid them:
Pitfall 1: Evals Don't Match Production
You build evals on clean, well-formatted data. Production data is messy. Your agent passes evals and fails in production.
Solution: Use real production data for your eval dataset. If that's not possible (e.g., you're building a new agent), manually create test cases that reflect production messiness: typos, ambiguity, edge cases.
Pitfall 2: Thresholds Are Too Loose
You set thresholds that are easy to pass. Every deployment passes evals. Evals become meaningless.
Solution: Set thresholds based on production requirements, not on current performance. If your agent needs to achieve 95% accuracy to be useful, set the threshold at 95%, not 80%.
Pitfall 3: Eval Datasets Become Stale
You create a dataset once, use it for 6 months, never update it. Your agent learns to game the eval. It passes evals but fails in production.
Solution: Refresh your eval dataset every quarter. Add new examples from production. Remove examples that are no longer representative.
Pitfall 4: LLM-as-Judge Is Inconsistent
You use Claude to score your evals, but Claude's behaviour changes between API calls. Your evals are flaky.
Solution: Use a fixed model version (e.g., claude-opus-4 not claude-opus-4-latest). Seed the LLM's random generation for reproducibility. Log the LLM's scoring rationale so you can debug inconsistencies.
Pitfall 5: Evals Are Too Expensive
You run evals on every commit, but evals cost $50 per run. Your CI/CD bill explodes.
Solution: Stratify evals by cost. Run fast rule-based evals on every commit. Run expensive LLM-as-judge evals only on pull requests and before deployment. Cache eval results where possible.
Pitfall 6: No One Owns the Evals
Evals are written, deployed, then neglected. They break silently. No one updates them when production patterns change.
Solution: Assign ownership. Someone on the team owns the eval dataset, the eval harness, and the thresholds. They review eval results weekly and update evals based on production insights.
Putting It All Together: A 90-Day Roadmap
If you're starting from scratch, here's a realistic 90-day plan to implement evals as code:
Weeks 1–2: Define Success
- Identify the quality dimensions that matter (accuracy, latency, cost, safety).
- Set target thresholds based on business requirements.
- Define what "production-ready" means for your agent.
Weeks 3–4: Build Your Eval Dataset
- Collect 100–200 representative examples from production (or create them manually).
- Annotate each with ground truth.
- Store in version control.
Weeks 5–6: Implement Scorers
- Start with rule-based scorers (structural checks, format validation).
- Add LLM-as-judge scorers for semantic quality.
- Test scorers on your dataset. Iterate until they're reliable.
Weeks 7–8: Build the Eval Harness
- Write code that loads your dataset, runs your agent, collects outputs, runs scorers, aggregates results.
- Test it locally. Ensure reproducibility.
- Log results in a format you can track over time.
Weeks 9–10: Integrate into CI/CD
- Add eval harness to your GitHub Actions / GitLab CI / Jenkins pipeline.
- Gate deployments on eval results (fail the build if thresholds breach).
- Test the pipeline on a few commits. Ensure it works.
Weeks 11–12: Monitor and Iterate
- Deploy to production with evals active.
- Track production performance vs. eval predictions.
- Update evals based on gaps you discover.
- Refine thresholds based on real-world data.
By week 12, you have a working evals-as-code system. You're gating deployments on quality, catching regressions before they hit production, and iterating confidently.
This is the foundation for moving pilots to production at scale.
Conclusion: Evals as Code Is How You Scale AI Safely
AI agents are powerful. But they're unpredictable. Without systematic measurement, you're flying blind.
Evals as code changes that. It brings discipline to AI quality. You measure systematically. You gate deployments on measurements. You iterate based on data, not hope.
This is why Brightlume achieves 85%+ pilot-to-production rates. We treat evals like code from day one. Every agent ships with passing evals. Every deployment is backed by measurable quality gates. Every regression is caught before it hits production.
The tools and patterns are mature. The frameworks exist. The cost is manageable. The barrier to entry is low.
If you're shipping AI agents to production, evals as code isn't optional. It's how you prove your agents are safe, reliable, and ready.
Start with a single agent. Build a simple eval dataset. Write a few scorers. Integrate into CI/CD. Gate deployments. Iterate. That's it. You're now treating AI quality like code quality.
The rest is scaling that discipline across your organisation.