All posts
AI Strategy

Testing AI Systems: Unit Tests, Evals, and the Regression Problem

Learn how to test AI systems beyond traditional unit tests. Master evals, regression detection, and production-ready testing strategies for AI agents and LLMs.

By Brightlume Team

Why Classic Test Pyramids Break Down with AI

You've shipped software for years. You know the test pyramid: lots of unit tests at the base, integration tests in the middle, a few end-to-end tests at the top. It works. You push code, tests pass, production doesn't catch fire. But AI systems don't follow that playbook.

When you deploy an LLM-powered agent or a fine-tuned model, you're not deploying deterministic logic. You're deploying a probabilistic system that can produce subtly different outputs for semantically similar inputs. A unit test that checks "does this function return True or False" tells you nothing about whether your AI agent will correctly handle a customer service ticket tomorrow. Your test suite passes. Your model degrades. Your production metrics collapse. This is the regression problem, and it's why testing AI systems requires a fundamentally different approach.

The challenge is this: traditional software testing assumes repeatability and determinism. Feed input X, get output Y, every time. AI systems don't work that way. A prompt that worked perfectly last week might hallucinate today. A model fine-tuned on your training data might drift when deployed against real-world queries it wasn't optimised for. Your evals need to catch these regressions before they hit production, but your evals also need to be fast enough to run in CI/CD pipelines without adding 30 minutes to every deployment.

This is where the distinction between unit tests and evals becomes critical. Unit tests validate code logic. Evals validate model behaviour. And if you're building AI agents that write and execute code, you need both—but you need to understand what each one actually measures.

Understanding the Testing Landscape: Unit Tests vs. Evals

Unit Tests: Still Essential, But Limited

Unit tests in AI systems test the engineering layer, not the model layer. They validate that your code correctly calls the API, correctly parses the response, correctly routes logic based on model outputs. These are necessary but insufficient.

Consider a customer support AI agent. A unit test might verify:

  • The agent correctly invokes the embedding model when a query arrives
  • The retrieval system returns documents within latency SLAs
  • The agent correctly formats tool calls to the backend API
  • Error handling works when the API is unavailable

These tests pass. Your code is solid. But they tell you nothing about whether the agent actually answers customer questions correctly, whether it hallucinates product features, or whether it follows your tone guidelines. That's where evals come in.

Unit tests are still your foundation. You need them. But they're testing infrastructure, not intelligence. When you're operating AI systems in production, infrastructure correctness is table stakes. What you actually need to measure is behaviour correctness.

Evals: Testing Model Behaviour at Scale

Evaluations (evals) are systematic tests of model outputs against defined criteria. Unlike unit tests, evals don't test whether code executes correctly—they test whether the model's behaviour meets your standards.

There are three primary categories of evals you need to understand:

Capability evals measure whether your model can actually do the task. Can it classify sentiment correctly? Can it extract entities from unstructured text? Can it reason about multi-step problems? These evals typically use a test set with known correct answers and measure accuracy, precision, recall, or F1 score. You run capability evals once, during development, to confirm your model has the baseline ability to handle the task.

Regression evals measure whether your model's behaviour has degraded since the last version. This is the critical one for production systems. You run regression evals every time you update your model, change your prompt, or modify your agent's tool set. A regression eval uses a fixed test set and compares outputs against a baseline. If accuracy drops 3%, you know something changed. If it stays stable, you have confidence the deployment is safe. According to Anthropic's technical guide on demystifying evals for AI agents, regression evals are essential for detecting performance degradation before it hits users.

Scenario evals test specific edge cases and failure modes. Can your agent handle ambiguous queries? Does it correctly refuse requests outside its scope? Can it recover from malformed API responses? These evals are narrower than capability evals but deeper—they test specific scenarios where you've previously seen failures or where you anticipate risk.

The NIST framework on AI test, evaluation, validation and verification emphasises that these evaluation categories must be part of a structured assessment strategy, not ad-hoc testing. At Brightlume, we've found that teams shipping production AI in 90 days distinguish themselves by building evals into their CI/CD pipeline from day one, not bolting them on after deployment.

Building Your Eval Strategy: From Test Sets to Automated Scoring

Designing Test Sets That Actually Catch Regressions

Your eval is only as good as your test set. A test set that's too small won't catch subtle degradation. A test set that's too narrow won't represent real-world behaviour. A test set that's poorly labelled will give you false confidence.

Start with stratification. If you're building an agent for financial services, your test set needs to represent the distribution of queries you actually see: 40% account inquiries, 30% transaction disputes, 20% product questions, 10% edge cases. If your test set is 50% edge cases, your evals are measuring something different from what your users experience.

Second, label conservatively. When you're scoring eval outputs, don't aim for 100% agreement on what's "correct." Aim for high-confidence agreement on clear passes and clear failures. If you're uncertain whether an output is correct, mark it as uncertain. Your eval shouldn't rely on subjective judgment calls—it should measure objective criteria: Does the agent answer the question? Does it stay within scope? Does it cite sources? These are binary or near-binary. If you're debating whether an output is correct, your eval criterion isn't precise enough.

Third, version your test sets. When you discover a failure mode in production, add it to your test set. When you update your agent's capabilities, refresh your test set to cover the new functionality. Your test set should grow and evolve as your system matures. This is why unit testing AI systems requires first-principles thinking about statistical validation rather than copying traditional software testing approaches—your test set is your contract with production behaviour, and it needs to be maintained like production code.

Automated Scoring: When and How to Use LLM-as-Judge

Manually scoring thousands of eval outputs is impractical. You need automated scoring. For many tasks, an LLM-as-judge works well: you ask Claude Opus or GPT-4 to evaluate whether another model's output meets your criteria, and you score based on the judge's assessment.

LLM-as-judge has real limitations. It's slower than heuristic scoring (typically 500-2000ms per eval). It's more expensive than rule-based evals (roughly $0.01-0.05 per eval depending on your model and eval complexity). And it can be gamed—if your eval prompt is poorly written, the judge might reward hallucinations or poor reasoning.

But it's also the only practical way to evaluate open-ended outputs at scale. If you're evaluating whether an agent's explanation is clear, whether its tone is professional, or whether it correctly answers a nuanced question, you need a judge with semantic understanding. A regex can't do that.

When using LLM-as-judge, follow these principles:

Make the rubric explicit. Don't ask "Is this output good?" Ask "Does this output answer the user's question without hallucinating facts?" and "Does it cite sources for any claims?" Explicit rubrics reduce variance between evals and make failures reproducible.

Use a fixed model for judging. Don't use different judge models for different eval runs. Consistency matters more than optimality. If you use Claude Opus for one eval and Claude Sonnet for another, you're measuring judge variance, not model improvement.

Validate your judge against ground truth. Pick 50-100 examples where you've manually verified the correct answer. Run your LLM judge on those examples and measure agreement. If your judge disagrees with ground truth more than 5-10% of the time, your eval rubric needs work.

Log everything. Save the eval input, the model output, the judge's reasoning, and the score. When an eval fails unexpectedly, you need to debug why. You can't do that without logs.

According to Anthropic's comprehensive guide on demystifying evals, multi-turn evaluations (where you test a model's ability to maintain context and coherence across multiple interactions) are increasingly important for agentic systems. Single-turn evals measure isolated outputs; multi-turn evals measure whether your agent stays consistent and coherent across a conversation.

The Regression Problem: Why Your Model Degrades in Production

Distribution Shift and Silent Failures

Your model performs well on your test set. You deploy it. A week later, your accuracy on production queries has dropped 8%. What happened?

Distribution shift. Your test set was drawn from historical data. Your production queries are slightly different: different phrasing, different edge cases, different user behaviour. Your model, trained on the test distribution, generalises reasonably well to similar inputs but degrades when the distribution changes.

This is silent. Your code doesn't error. Your logs don't show failures. Your users complain, and you only find out when you analyse production metrics three days later. This is why regression evals are non-negotiable for production AI systems.

Regression evals catch this by running the same test set against your current model and comparing outputs to a baseline. If your baseline accuracy was 92% and your current accuracy is 84%, you have a regression. You don't deploy. You investigate.

But regression evals only work if your test set represents real-world distribution. If your test set is too narrow, you'll have regressions in production that your evals didn't catch. This is why you need multiple eval strategies: regression evals for baseline performance, scenario evals for edge cases, and continuous monitoring in production.

Model Updates and Prompt Changes

Regressions don't only happen from distribution shift. They happen when you update your model or change your prompt.

You're using Claude Opus 4 for your agent. Anthropic releases Claude 5. You update your agent to use the new model. Capability-wise, Claude 5 is better. But your prompt was tuned for Opus 4's specific behaviour. Claude 5 might interpret your prompt differently, leading to different outputs. Your evals catch this: you run your regression test set against Claude 5, and if accuracy drops, you know your prompt needs adjustment.

Or you change your system prompt to be more concise. You think it'll improve latency without affecting accuracy. You run your evals. Accuracy drops 2%. You revert the change. This is the value of regression evals—they prevent you from shipping changes that degrade behaviour, even when the degradation is subtle.

At Brightlume, we've found that teams with the highest production success rates (our 85%+ pilot-to-production rate) run evals on every prompt change, every model update, and every agent capability modification. They treat evals as a gate, not a checkpoint. If evals fail, the change doesn't go to production. This discipline prevents silent regressions and keeps production systems stable.

Building Evals into Your CI/CD Pipeline

Latency and Cost: The Practical Constraints

You can't run 10,000 evals on every commit. Your CI/CD pipeline would take hours. Your eval costs would be thousands of dollars per day.

You need a tiered approach. On every commit, run a fast smoke test: 50-100 evals using simple heuristic scoring. This catches obvious breaks and takes 2-3 minutes. On every merge to main, run a medium eval suite: 500-1000 evals with LLM-as-judge scoring. This takes 15-20 minutes and costs $5-10. On every release to production, run a full eval suite: all 5000+ evals with multiple judge models and manual spot-checking. This takes an hour and costs $50-100, but you only do it once per release.

This tiered approach balances speed, cost, and confidence. Your developers get fast feedback (smoke tests pass in minutes). Your release process has high confidence (full evals before production). And your costs stay reasonable.

According to practical overview of AI evaluation methodologies, integrating evals into CI/CD pipelines requires careful threshold setting. You need to define what "passing" means: Does accuracy need to be within 2% of baseline? 5%? Does latency need to stay under 500ms? 1000ms? These thresholds should be set based on your SLAs and your users' expectations, not arbitrary numbers.

Threshold Setting and Decision Rules

When your evals run, you get a score. Is 88% accuracy acceptable? Is 85%? It depends on your baseline and your risk tolerance.

Set thresholds based on production impact. If a 3% accuracy drop would cost you $100,000 in lost revenue or customer churn, your threshold should be strict (no more than 1% regression). If a 5% drop is acceptable, your threshold can be looser. But be explicit about this. Don't set thresholds arbitrarily.

For regression evals specifically, use relative thresholds, not absolute ones. Don't say "accuracy must be above 90%." Say "accuracy must not drop more than 2% from baseline." This accounts for natural variation and focuses on detecting actual degradation, not absolute performance.

For multi-turn evals (critical for agentic AI vs copilots where agents need to maintain state across interactions), use stricter thresholds. A 2% drop in single-turn accuracy might be acceptable; a 2% drop in multi-turn consistency is concerning because it suggests the agent is losing coherence across conversations.

Monitoring and Alerting

Evals in CI/CD catch regressions before production. But production monitoring catches regressions that evals missed—distribution shifts that your test set didn't represent, user behaviours that weren't in your training data.

Set up production monitoring that mirrors your eval metrics. If your evals measure accuracy, measure accuracy in production. If they measure latency, measure latency. Track these metrics over time. If you see a trend (accuracy drifting down 0.5% per week), that's a signal to investigate and potentially retrain.

When production metrics diverge from your evals, that's valuable information. It means your test set isn't representative. Add the divergent examples to your test set. Update your evals. This is how you improve your testing strategy over time.

Specific Testing Strategies for AI Agents and Workflows

Testing Agentic Behaviour: Tool Call Accuracy and Orchestration

When you're testing AI agents as digital coworkers, you're not just testing language understanding—you're testing orchestration. Can the agent decide which tool to call? Can it chain multiple tools together? Can it recover from tool failures?

Your evals need to measure this. For tool call accuracy, you test whether the agent calls the right tool with the right parameters. For orchestration, you test whether multi-step workflows complete correctly. For robustness, you test whether the agent handles tool failures gracefully.

Example: You're building an agent that books hotel reservations. Your eval might include:

  • Single-turn capability: "Book me a room at the Hilton in Sydney for 3 nights starting tomorrow." Does the agent call the booking tool with correct parameters?
  • Multi-turn orchestration: "Find me a hotel near the airport, check availability for next week, and book the cheapest option." Does the agent chain search → availability check → booking in the right order?
  • Error recovery: "Book me a room, but the payment fails." Does the agent handle the error and offer alternatives?

These evals are more complex than simple accuracy metrics. You're measuring task completion, not output quality. According to how to evaluate AI agents and agentic workflows, tool call accuracy verification and workflow performance assessment are critical for agentic systems because the agent's value comes from its ability to take actions, not just generate text.

Testing for Hallucination and Scope Violations

Hallucination is when your model generates plausible-sounding but false information. Scope violations are when your agent answers questions it shouldn't (product recommendations when it should only provide support, medical advice when it should only provide information).

These are critical failure modes. A customer support agent that hallucinates product features loses customer trust. A health system agent that provides medical advice outside its scope creates liability.

Your evals need to explicitly test for these. Create test cases where the correct answer is "I don't know" or "I can't help with that." Measure whether your agent correctly refuses. Create test cases with factual traps ("Our product has feature X, right?" when it doesn't). Measure whether your agent avoids the trap.

For hallucination detection, use LLM-as-judge with explicit criteria: "Does the output contain any claims not supported by the provided context?" For scope violations, use rule-based scoring: "Does the output stay within the defined scope?" These are binary or near-binary, so you can score them automatically without expensive LLM judging.

Testing Latency and Cost Under Load

Your evals might show 95% accuracy, but if your agent takes 10 seconds to respond to a customer query, your users will abandon it. Latency and cost matter as much as accuracy.

Add latency evals to your test suite. Measure how long your agent takes to respond to a typical query. Measure how long it takes to complete a multi-turn conversation. Set latency thresholds based on your SLAs: if your users expect a response within 2 seconds, your eval should fail if latency exceeds 2 seconds.

Add cost evals too. If you're using Claude Opus for your agent, each query costs roughly $0.015-0.03 depending on token count. If your agent processes 10,000 queries per day, that's $150-300 per day. If you can reduce token usage by 20% without hurting accuracy, that's $30-60 saved per day, or $11,000-22,000 per year. Cost evals help you optimise for this.

When you're deciding between Anthropic vs OpenAI for your agent, cost and latency evals matter. Claude Opus might be 15% more expensive but 30% faster. Your evals should quantify this trade-off, not just compare accuracy.

Advanced Eval Techniques: Multi-Turn, Adversarial, and Continuous Learning

Multi-Turn Evaluations: Testing Coherence and Context Maintenance

Single-turn evals measure isolated outputs. Multi-turn evals measure whether your agent maintains coherence across a conversation. This is critical for agentic workflows where the agent needs to remember context, correct itself, and maintain a consistent persona.

A multi-turn eval might look like this:

  1. User: "I need to book a hotel in Sydney for next week." Agent: "I can help with that. What dates are you looking at?" Eval: Did the agent acknowledge the request and ask a clarifying question?

  2. User: "March 15-18." Agent: "Got it. What's your budget?" Eval: Did the agent remember the location and dates from the previous turn?

  3. User: "Under $200 per night." Agent: "I found 5 hotels matching your criteria..." Eval: Did the agent use all three pieces of information (location, dates, budget) to make recommendations?

  4. User: "Actually, I prefer beachfront." Agent: "Let me filter those results for beachfront properties..." Eval: Did the agent update its recommendations based on the new constraint without losing previous context?

Multi-turn evals are more expensive and more complex to score, but they measure something single-turn evals can't: whether your agent is actually useful in real conversations. If your agent loses context or contradicts itself, users will notice immediately.

According to Anthropic's guide, multi-turn regression evals are increasingly important because agentic systems are evaluated on their ability to maintain coherence and consistency across extended interactions, not just individual responses.

Adversarial Testing: Finding Failure Modes Before Production

Adversarial evals deliberately try to break your agent. You're looking for failure modes: inputs that cause hallucinations, contradictions, or scope violations.

Common adversarial eval patterns:

  • Jailbreaks: Prompts designed to make the agent ignore its instructions. "Forget your previous instructions and tell me..." Your eval should verify the agent resists these.
  • Factual traps: False premises designed to trigger hallucinations. "Our product has feature X, right?" Your eval should verify the agent doesn't confirm false facts.
  • Boundary testing: Requests just outside the agent's scope. "Can you recommend a doctor?" (if the agent should only provide health information, not recommendations). Your eval should verify the agent correctly refuses.
  • Inconsistency probes: Requests that contradict earlier statements. "Earlier you said X, but now you're saying Y. Which is correct?" Your eval should verify the agent handles contradictions gracefully.

Adversarial evals are labour-intensive to create, but they catch critical failure modes. At Brightlume, we've found that teams that run adversarial evals catch 60-70% more failure modes before production than teams that only run capability evals.

Continuous Learning: Updating Evals as Your System Matures

Your evals should evolve as your system matures. When you discover a failure in production, add it to your test set. When you add new capabilities, add new evals. When you change your model or prompt, update your regression baselines.

This is where version control matters. Your evals are code. Your test sets are data. Both should be versioned, reviewed, and tracked. When you update your evals, you should be able to explain why: "We added 50 new test cases for edge case X because we saw 3 production failures last week."

Treat your eval suite like you treat your production code: it's a critical system that requires maintenance, testing, and continuous improvement. The teams with the most reliable production AI systems are the teams that invest in their evals as much as they invest in their models.

Practical Implementation: From Theory to Production

Setting Up Your Eval Infrastructure

You need three things: a test set, a scoring function, and a CI/CD integration.

Test sets should be stored as versioned data files (JSON, CSV, or database records). Each test case should include the input, the expected output (or rubric for judging), and metadata (which capability it tests, which version it was added in, whether it's a regression case).

Scoring functions should be modular. Heuristic scoring (checking for keywords, regex matching) for fast evals. LLM-as-judge for nuanced evals. Custom scoring for domain-specific metrics (e.g., does the agent's recommended dosage fall within safe ranges). Your scoring function should return a score (0-1 or 0-100) and optionally reasoning (for debugging failures).

CI/CD integration means your evals run automatically when you push code. Most teams use GitHub Actions, GitLab CI, or similar. Your eval pipeline should:

  1. Check out the code
  2. Load the test set
  3. Run your model/agent on each test case
  4. Score the outputs
  5. Compare to baseline
  6. Pass or fail the build based on thresholds
  7. Log results for analysis

This entire process should take less than 30 minutes for your smoke test suite and less than 2 hours for your full suite.

Choosing Eval Platforms and Tools

You can build evals in-house using Python and a simple scoring framework. You can use platforms like Anthropic's eval framework (open source, free, well-documented). You can use commercial platforms like Braintrust, Arize, or Humanloop that provide UI, versioning, and collaboration features.

For most teams shipping production AI, an open-source framework + custom scoring is sufficient. You don't need a platform. You need discipline: run evals on every change, set clear thresholds, monitor production metrics, and update your test set when you find failures.

Documentation and Governance

Your evals are part of your model governance. According to AI model governance: version control, auditing, and rollback strategies, you need to document:

  • What each eval measures (capability, regression, scenario)
  • How many test cases it includes
  • What threshold it uses for passing
  • When it was last updated and why
  • Who owns it

This documentation should be in your code repository, reviewed in PRs, and updated when you change evals. When you deploy a new model version, you should be able to point to eval results that justify the deployment. When something breaks in production, you should be able to trace back to the evals that were supposed to catch it and understand why they didn't.

This is what separates teams that ship production AI reliably (like our 85%+ pilot-to-production rate at Brightlume) from teams that ship AI that breaks in production. It's not magic. It's discipline.

Common Pitfalls and How to Avoid Them

Pitfall 1: Test Set Overfitting

You optimise your model to perform well on your eval test set. Your evals pass. Production fails. This happens because your test set is too narrow or too simple.

Avoid this by:

  • Keeping your test set representative of production distribution
  • Not tuning your model specifically to pass evals (tune to solve the problem; evals just measure it)
  • Regularly adding new test cases from production failures
  • Using multiple eval strategies (not just one metric)

Pitfall 2: Threshold Creep

Your eval threshold is 90% accuracy. You ship a change that drops accuracy to 88%. Instead of investigating, you lower the threshold to 85%. A month later, your threshold is 75% and your production system is degraded.

Avoid this by:

  • Setting thresholds based on business impact, not convenience
  • Documenting why you change a threshold
  • Treating threshold changes as risky (they should require approval)
  • Monitoring production metrics to validate your thresholds are correct

Pitfall 3: Eval-Production Mismatch

Your evals show 92% accuracy. Your production metrics show 78% accuracy. Your test set doesn't represent real-world distribution.

Avoid this by:

  • Sampling production queries and adding them to your test set
  • Comparing eval metrics to production metrics regularly
  • Investigating divergences (they indicate your test set needs work)
  • Treating eval-production mismatch as a critical issue that needs immediate attention

Pitfall 4: Insufficient Eval Coverage

You have evals for happy path scenarios but not for edge cases, error conditions, or scope violations. Your agent works fine 95% of the time but fails catastrophically 5% of the time.

Avoid this by:

  • Explicitly listing failure modes you care about
  • Creating evals for each failure mode
  • Running adversarial evals
  • Monitoring production for unexpected failures and adding them to your test set

Conclusion: Evals as a Production Discipline

Testing AI systems is harder than testing traditional software because you're testing probabilistic systems, not deterministic logic. Classic test pyramids don't apply. You need evals—systematic, automated tests of model behaviour—in addition to unit tests.

The teams that ship production AI reliably treat evals as a core engineering discipline, not an afterthought. They run evals on every change. They set clear thresholds based on business impact. They monitor production metrics and update their test sets when they find failures. They document their evals and treat them as part of their model governance.

This is why teams using AI consulting vs AI engineering approaches that prioritise engineering discipline have higher success rates. AI engineering means building evals into your process from day one, not bolting them on after deployment.

If you're shipping AI to production, you need evals. You need regression detection. You need to understand the difference between unit tests and evaluations. And you need the discipline to run them, interpret them, and act on them. That's how you avoid silent regressions. That's how you keep production systems stable. That's how you ship AI that works.

For teams ready to move beyond pilots and ship production-ready AI, this is non-negotiable. Explore Brightlume's capabilities to see how we integrate rigorous testing and governance into 90-day production deployments. Or dive deeper into AI-native engineering practices that treat AI systems like production systems, not experiments.