Why Most AI Agents Fail at Structured Output and How to Fix It

The Structured Output Problem Nobody Talks About

You've built an AI agent that works flawlessly in your notebook. It reasons correctly, calls the right tools, and produces sensible outputs when you're testing it manually. Then you ship it to production, and within hours, downstream systems start rejecting its outputs. The agent returns malformed JSON. It includes fields that shouldn't exist. It omits required keys. The reasoning was sound, but the execution failed at the point where it matters most: delivering structured data that your systems can actually consume.

This isn't a rare edge case. It's the norm. According to research on AI agent harness failures and anti-patterns, inconsistent structured output is one of the primary reasons AI agents fail in production environments. The gap between what an agent understands and what it actually outputs—what some engineers call the "reasoning-action disconnect"—is the silent killer of agentic workflows.

The fundamental issue is this: large language models (LLMs) are text completion engines. They predict the next token based on probability distributions. When you ask Claude Opus 4 or GPT-4 to return JSON, you're asking a probabilistic text generator to conform to a rigid schema. Without explicit constraints, the model will happily generate plausible-looking but invalid output because it has no hard guarantee that what it produces matches your requirements.

This article walks you through why structured output fails, and more importantly, how to build agentic systems that don't. We'll cover the architectural patterns that actually work in production, the evaluation strategies that catch these failures before they hit your users, and the specific tools—Pydantic, JSON schemas, retry logic—that turn unreliable agents into reliable ones.

Why LLMs Struggle With Structured Output

To understand the fix, you need to understand the problem at the model level. When you prompt an LLM to return structured data, you're asking it to do something that doesn't align with its core training objective.

Large language models are trained on next-token prediction. Given a sequence of tokens, they output a probability distribution over the next token. They don't have a notion of "this output must be valid JSON" or "this field is required." They have learned patterns from their training data, and when those patterns suggest that a particular token is likely, they emit it. If the training data contained malformed JSON examples, or if the model's probability distribution slightly favours an invalid token, it will happily generate it.

Consider a simple example. You ask an agent to return a customer record with fields: id (integer), name (string), and email (string). The model understands these requirements from your prompt. But during token generation, after outputting "email": ", the model might assign high probability to tokens like null, undefined, or even "[email protected]", "secondary_email": "[email protected]". The model isn't malicious; it's following probability distributions learned from training data that included all sorts of JSON variations.

This is the core insight: prompting alone is insufficient. No matter how precisely you write your prompt, you cannot guarantee valid structured output from an unconstrained LLM. The model might understand your intent, but understanding and guaranteeing are different things.

Research on structured outputs as the contract for reliable AI agents emphasises this point: structured outputs provide a hard contract between the agent and downstream systems. Without that contract, you're relying on probabilistic behaviour, which is inherently unreliable at scale.

The Cost of Unstructured Outputs in Production

When your AI agent fails to produce valid structured output, the consequences cascade through your system. Let's be concrete about what actually happens.

Downstream system failures: Your agent returns JSON that's syntactically invalid or missing required fields. Your database insertion fails. Your API call returns a 400 error. A transaction that should have completed hangs in a queue waiting for retry logic to kick in. In a healthcare setting, a clinical decision support agent returns a malformed patient record, and a nurse has to manually validate the data before proceeding. In hospitality, a guest experience automation agent returns a booking confirmation with missing guest details, and the reservation system rejects it.

Silent data corruption: Sometimes the output is close enough to valid that it parses, but with wrong values. An agent returns {"booking_status": "confirmed", "room_type": null} when room_type is required. Your system accepts it, stores it, and now you have a database record that violates your schema invariants. Weeks later, when you try to generate a guest itinerary, the system crashes because it assumes room_type exists.

Latency and cost explosion: When output validation fails, you implement retry logic. The agent tries again. And again. Each retry adds latency and increases your LLM API costs. If your agent has a 10% failure rate on structured output, and each failure triggers 2–3 retries on average, you're paying 20–30% more for the same work, and your end-to-end latency is multiplied by 2–3x. At scale, this is not acceptable.

Evaluation blindness: You test your agent with a small set of examples, and it works. You ship it, and it fails on production data you didn't anticipate. Without proper evaluation of structured output compliance, you don't know whether your agent is actually reliable until users are affected.

At Brightlume, we've built AI systems that ship in 90 days with an 85%+ pilot-to-production success rate. That rate exists because we don't rely on prompting alone. We enforce structured output at the architectural level.

Pydantic: Your First Line of Defence

Pydantic is a Python library that provides runtime type checking and validation. It's not a silver bullet—it doesn't prevent the LLM from generating invalid output—but it does catch invalid output before it reaches your systems, and it provides a clean interface for defining what valid output looks like.

Here's the basic pattern:

from pydantic import BaseModel, Field
from typing import Optional

class BookingConfirmation(BaseModel):
    booking_id: str = Field(..., description="Unique booking identifier")
    guest_name: str = Field(..., description="Full name of the guest")
    room_type: str = Field(..., description="Type of room booked")
    check_in_date: str = Field(..., description="ISO 8601 format")
    check_in_time: str = Field(default="15:00", description="Check-in time in HH:MM format")
    total_price: float = Field(..., description="Total price in AUD")
    special_requests: Optional[str] = Field(default=None, description="Any special requests")

You define your data model as a Pydantic class. Each field has a type annotation. You can add validation constraints—minimum values, regex patterns, custom validators—and Pydantic will enforce them.

When your agent generates output, you attempt to parse it into your Pydantic model:

import json
from pydantic import ValidationError

def validate_agent_output(raw_output: str, model_class):
    try:
        parsed = json.loads(raw_output)
        validated = model_class(**parsed)
        return validated, None
    except json.JSONDecodeError as e:
        return None, f"Invalid JSON: {e}"
    except ValidationError as e:
        return None, f"Validation failed: {e}"

If the output is invalid—wrong type, missing field, value outside allowed range—Pydantic will raise a ValidationError with a detailed message explaining exactly what's wrong. You can then decide whether to retry the agent, log the failure, or escalate to a human.

But here's the critical limitation: Pydantic validates after the LLM has generated output. By that point, you've already paid for tokens, and you've already waited for the model to complete. If the output is invalid, you need to retry, which doubles your latency and cost.

This is why Pydantic alone isn't enough. It's a safety net, not a preventative measure. You need to constrain the LLM itself.

JSON Schema: Constraining the Model

JSON Schema is a formal specification for describing the structure of JSON documents. It's been around for years, but only recently have LLM providers begun using it to constrain model outputs.

The idea is simple: instead of just telling the model "return JSON," you provide a formal schema that describes exactly what valid JSON looks like. The model then uses this schema during token generation to avoid producing invalid output.

Here's a JSON Schema representation of our booking confirmation:

{
  "type": "object",
  "properties": {
    "booking_id": {
      "type": "string",
      "description": "Unique booking identifier"
    },
    "guest_name": {
      "type": "string",
      "description": "Full name of the guest"
    },
    "room_type": {
      "type": "string",
      "enum": ["single", "double", "suite", "penthouse"]
    },
    "check_in_date": {
      "type": "string",
      "pattern": "^\\d{4}-\\d{2}-\\d{2}$"
    },
    "check_in_time": {
      "type": "string",
      "default": "15:00"
    },
    "total_price": {
      "type": "number",
      "minimum": 0
    },
    "special_requests": {
      "type": ["string", "null"]
    }
  },
  "required": ["booking_id", "guest_name", "room_type", "check_in_date", "total_price"]
}

This schema defines the exact structure of valid output. room_type must be one of the specified enum values. check_in_date must match the ISO 8601 pattern. total_price must be a non-negative number. All required fields must be present.

When you pass this schema to an LLM that supports constrained outputs—like OpenAI's Structured Outputs or Claude's constraint system—the model uses it to guide token generation. The model won't emit a token that would violate the schema, because the schema is enforced at the token level.

This is fundamentally different from prompting. Prompting says "please return valid JSON." Schema enforcement says "you cannot generate invalid JSON; the system won't allow it."

OpenAI's documentation on structured outputs explicitly addresses this: constrained outputs guarantee that the model's response will always be valid JSON that conforms to your schema. There's no retry logic needed, no validation failures, no downstream system rejections.

The challenge is that not all models support schema enforcement equally. Older models like GPT-3.5 don't support it at all. Even newer models have limitations on schema complexity. And in some cases, enforcing a very strict schema can degrade model reasoning, because the model is constrained in how it can express intermediate steps.

Tool Use and Function Calling: The Agentic Layer

When you're building an AI agent—a system that takes actions, calls external tools, and reasons about outcomes—structured output becomes even more critical. The agent needs to call tools with the right arguments, and those arguments must be valid.

This is where function calling comes in. Instead of asking the model to return JSON, you define a set of functions (or tools) that the agent can call, and you specify the exact arguments each function accepts. The model then generates function calls, not free-text output.

Here's a simplified example:

from anthropic import Anthropic

client = Anthropic()

tools = [
    {
        "name": "create_booking",
        "description": "Create a new hotel booking",
        "input_schema": {
            "type": "object",
            "properties": {
                "guest_name": {
                    "type": "string",
                    "description": "Full name of the guest"
                },
                "room_type": {
                    "type": "string",
                    "enum": ["single", "double", "suite"]
                },
                "check_in_date": {
                    "type": "string",
                    "description": "ISO 8601 date"
                },
                "nights": {
                    "type": "integer",
                    "minimum": 1,
                    "maximum": 30
                }
            },
            "required": ["guest_name", "room_type", "check_in_date", "nights"]
        }
    }
]

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "Book a double room for John Smith from 2024-12-20 for 3 nights"}
    ]
)

When the model generates a response, it doesn't return free-text JSON. Instead, it returns a tool_use block that specifies which tool to call and what arguments to pass. The arguments are validated against the schema you defined. If the model tries to call the function with invalid arguments—missing a required field, providing a value outside the enum, etc.—the API will reject it.

Research on tool argument rot as a failure mode in AI agents highlights this as a critical issue: agents gradually degrade in their ability to call tools correctly over long conversations. By enforcing schemas at the tool level, you prevent this degradation from happening in the first place.

But function calling alone isn't sufficient. You still need to handle cases where the model's reasoning is sound but the tool call is invalid. This is where retry logic comes in.

Retry Strategies: Handling Failures Gracefully

No matter how well you constrain your outputs, edge cases will occur. The model might misunderstand the context. It might attempt to call a tool that doesn't exist. It might provide an argument that's technically valid according to the schema but semantically invalid in your domain.

A robust agentic system needs retry logic that's smart about how it handles failures.

Basic retry with exponential backoff: When a tool call fails, retry the agent with additional context about the failure. Increase the delay between retries exponentially to avoid hammering the API.

import time

def call_agent_with_retry(messages, tools, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1024,
                tools=tools,
                messages=messages
            )
            return response
        except Exception as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # 1s, 2s, 4s
                print(f"Attempt {attempt + 1} failed. Retrying in {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise

Failure context injection: When a tool call fails, don't just retry blindly. Feed the error message back to the agent so it can learn from the failure.

def agentic_loop(user_input, tools, max_iterations=10):
    messages = [{"role": "user", "content": user_input}]
    
    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )
        
        # Check if the model wants to use a tool
        tool_calls = [block for block in response.content if block.type == "tool_use"]
        
        if not tool_calls:
            # Model is done
            return response
        
        # Process each tool call
        messages.append({"role": "assistant", "content": response.content})
        
        tool_results = []
        for tool_call in tool_calls:
            try:
                result = execute_tool(tool_call.name, tool_call.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tool_call.id,
                    "content": json.dumps(result)
                })
            except Exception as e:
                # Feed the error back to the agent
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tool_call.id,
                    "content": f"Error: {str(e)}",
                    "is_error": True
                })
        
        messages.append({"role": "user", "content": tool_results})
    
    raise Exception("Max iterations reached")

This pattern is crucial. When a tool call fails, you tell the agent exactly what went wrong. The agent can then reason about the failure and try a different approach. This is fundamentally different from just retrying the same request.

Validation before execution: Before executing a tool, validate that its arguments are semantically valid in your domain, not just structurally valid according to the schema.

def validate_booking_args(args):
    # Structural validation happens at the API level
    # Semantic validation happens here
    
    check_in = datetime.fromisoformat(args["check_in_date"])
    if check_in < datetime.now():
        raise ValueError("Check-in date cannot be in the past")
    
    checkout = check_in + timedelta(days=args["nights"])
    if checkout > datetime.now() + timedelta(days=365):
        raise ValueError("Booking cannot extend more than 365 days in the future")
    
    if args["room_type"] == "penthouse" and args["nights"] < 3:
        raise ValueError("Penthouse requires minimum 3-night stay")
    
    return True

Research on AI agent failure modes where agents know the answer but say the wrong thing identifies this pattern as critical: the agent's reasoning might be sound, but the structured output it produces might violate domain-specific constraints that aren't captured in the JSON schema.

Evaluation: Measuring Structured Output Reliability

You can't improve what you don't measure. Structured output reliability must be part of your evaluation strategy from day one.

Structural validation: Test that your agent's output is valid JSON and conforms to your schema.

def evaluate_structural_validity(agent_outputs, schema):
    valid_count = 0
    for output in agent_outputs:
        try:
            parsed = json.loads(output)
            jsonschema.validate(parsed, schema)
            valid_count += 1
        except (json.JSONDecodeError, jsonschema.ValidationError):
            pass
    
    return valid_count / len(agent_outputs)

Semantic validation: Test that the output is not just structurally valid, but semantically correct. Does the booking confirmation contain the details the user actually requested?

def evaluate_semantic_correctness(agent_output, user_input, expected_output):
    # Parse the output
    parsed = json.loads(agent_output)
    
    # Check that key fields match expectations
    errors = []
    if parsed.get("guest_name") != expected_output["guest_name"]:
        errors.append(f"Guest name mismatch: {parsed.get('guest_name')} vs {expected_output['guest_name']}")
    
    if parsed.get("room_type") != expected_output["room_type"]:
        errors.append(f"Room type mismatch: {parsed.get('room_type')} vs {expected_output['room_type']}")
    
    # ... more checks
    
    return len(errors) == 0, errors

According to research on evaluation challenges for AI agents, structured tasks like policy-compliant actions are particularly prone to failure. Your evaluation suite needs to cover not just happy paths, but edge cases where the agent's reasoning might be correct but the output violates constraints.

Failure mode analysis: Categorise failures by type. Are most failures due to missing fields? Wrong data types? Values outside allowed ranges? This tells you where to focus your effort.

def categorise_failure(validation_error):
    error_str = str(validation_error)
    if "required property" in error_str:
        return "missing_field"
    elif "is not of type" in error_str:
        return "type_mismatch"
    elif "is not one of" in error_str:
        return "enum_violation"
    elif "is less than" in error_str or "is greater than" in error_str:
        return "range_violation"
    else:
        return "other"

Research on inconsistent output formatting as a major AI agent failure emphasises that structured generation techniques to constrain outputs are essential. Your evaluation strategy should quantify how often your agent produces inconsistent or malformed output, and track whether your improvements are actually reducing that rate.

Putting It Together: A Production-Ready Pattern

Here's how all these pieces fit together in a real agentic system:

from pydantic import BaseModel, Field, ValidationError
from typing import Optional
import json
import jsonschema
from anthropic import Anthropic

# 1. Define your data model with Pydantic
class BookingConfirmation(BaseModel):
    booking_id: str = Field(..., description="Unique booking identifier")
    guest_name: str = Field(..., description="Full name of the guest")
    room_type: str = Field(..., description="Type of room booked")
    check_in_date: str = Field(..., description="ISO 8601 format")
    nights: int = Field(..., description="Number of nights", ge=1, le=30)
    total_price: float = Field(..., description="Total price in AUD", ge=0)

# 2. Generate JSON Schema from Pydantic
schema = BookingConfirmation.model_json_schema()

# 3. Define tools with the schema
tools = [
    {
        "name": "create_booking",
        "description": "Create a new hotel booking",
        "input_schema": schema
    }
]

# 4. Implement the agent loop with retry logic
def run_booking_agent(user_input, max_retries=3):
    client = Anthropic()
    messages = [{"role": "user", "content": user_input}]
    
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=1024,
                tools=tools,
                messages=messages
            )
            
            # Extract tool use block
            tool_use = next(
                (block for block in response.content if block.type == "tool_use"),
                None
            )
            
            if not tool_use:
                return None, "No tool call generated"
            
            # 5. Validate with Pydantic before execution
            try:
                booking = BookingConfirmation(**tool_use.input)
                # Semantic validation
                validate_booking_args(booking)
                # Execute the tool
                result = execute_booking(booking)
                return booking, None
            except ValidationError as e:
                error_msg = f"Validation failed: {e}"
                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": f"Tool call failed: {error_msg}. Please try again with valid arguments."
                })
        
        except Exception as e:
            if attempt < max_retries - 1:
                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": f"Error: {str(e)}. Please try again."
                })
            else:
                return None, str(e)
    
    return None, "Max retries exceeded"

This pattern combines all the pieces: schema enforcement at the API level, Pydantic validation before execution, semantic validation for domain constraints, and intelligent retry logic that feeds errors back to the agent.

Common Pitfalls and How to Avoid Them

Overly permissive schemas: A schema that's too loose defeats the purpose. If you define room_type as just a string, the model might return "fancy deluxe suite" when you need it to be one of ["single", "double", "suite"]. Use enums, patterns, and constraints aggressively.

Ignoring latency in retry logic: Every retry doubles your latency. If your agent has a 20% failure rate and retries twice on average, your end-to-end latency is 2.4x worse than if you got it right the first time. Invest in getting the schema right, not in handling failures gracefully.

Mixing structured and unstructured output: Some agents need to return both structured data (for downstream systems) and human-readable explanations. Use separate fields or separate API calls. Don't try to embed explanations in structured fields; it breaks parsing.

Not testing with production-like data: Your agent might work perfectly with clean, well-formatted test data. Then it encounters production data with edge cases—unusual names, dates in different formats, special characters—and fails. Test with realistic data from day one.

Assuming the model understands your domain: Just because you understand that "penthouse" is a room type doesn't mean the model will reliably use it. If you're seeing the agent generate values outside your enum, it's not because the model is broken; it's because your schema isn't constraining it enough, or your prompt isn't clear enough about the domain.

Scaling Structured Output Reliability

When you're building agentic systems at Brightlume, scaling reliability means thinking beyond individual tool calls. You need to consider:

Composition of agents: When you have multiple agents working together, each agent's output becomes the next agent's input. If agent A produces slightly malformed output, agent B might fail to parse it. Design your agents so that each one's output is guaranteed to be valid input for the next one.

Versioning and evolution: Your schema will evolve. You'll add fields, change constraints, deprecate old fields. Plan for backward compatibility. Use optional fields for new additions. Provide clear migration paths for downstream systems.

Cost optimisation: Structured output enforcement reduces failures, which reduces retries, which reduces costs. But it also reduces the model's flexibility. Sometimes a slightly less constrained schema with a retry loop is cheaper than a very strict schema that requires a more capable (and more expensive) model. Measure both dimensions.

Monitoring and alerting: Track your structured output failure rate in production. Set alerts if it exceeds your threshold. When failures happen, log the raw output and the validation error so you can debug.

According to research on JSON mode for enforcing structured outputs in LLMs, this is an active area of development. Newer models and APIs are improving their support for constrained outputs, and the failure rates are decreasing. But the fundamental principle remains: you cannot rely on prompting alone.

The Path Forward

Structured output reliability is not optional. It's the foundation of production-ready agentic systems. Whether you're building a clinical decision support agent for a health system, a guest experience automation system for a hotel group, or an intelligent automation workflow for an enterprise operations team, the same principles apply.

Start with Pydantic to define your data models. Use JSON Schema to constrain your LLM outputs. Implement function calling to structure your tool interactions. Build intelligent retry logic that feeds errors back to the agent. Evaluate ruthlessly, measuring both structural and semantic correctness.

The agents that fail are the ones built on prompting alone. The ones that succeed are built on architecture: clear contracts between components, hard constraints on outputs, and evaluation that catches failures before they reach users.

If you're moving AI pilots to production and want to ensure your agentic workflows are reliable, that's exactly what Brightlume specialises in. We ship production-ready AI systems in 90 days, with an 85%+ pilot-to-production rate, because we don't compromise on structured output reliability. We're AI engineers, not advisors. We care about latency, cost, evals, and governance. We build systems that actually work.