All posts
AI Strategy

Long-Running AI Agents: Scheduling, Durability, and Recovery Patterns

Build production AI agents that run reliably for hours or days. Master scheduling, durability, and recovery patterns for enterprise agentic workflows.

By Brightlume Team

The Challenge of Long-Running AI Agents in Production

Most AI agent tutorials assume synchronous, request-response execution: user asks a question, agent thinks for a few seconds, returns an answer. That model breaks the moment your agent needs to execute over hours or days—orchestrating multi-step workflows, polling external systems, handling intermittent failures, or coordinating across teams.

Long-running AI agents are fundamentally different from chatbots. They must:

  • Persist state across process restarts, network failures, and infrastructure changes
  • Schedule work that executes asynchronously, resuming from checkpoints rather than restarting from scratch
  • Handle partial failures gracefully—a flaky API call shouldn't blow up the entire workflow
  • Provide observability so you know exactly where execution stalled and why
  • Enforce governance around tool access, spend, and compliance across extended runs

At Brightlume, we've shipped production AI agents for healthcare systems orchestrating patient workflows, financial services automating reconciliation across days, and hospitality operators running guest experience automation 24/7. The difference between a prototype that works once and a system running reliably in production comes down to three things: how you schedule work, how you guarantee durability, and how you recover from failure.

This article walks you through the architectural patterns, specific technologies, and decision trees that engineering leaders use to build agents that execute reliably at scale.

Understanding Agent Execution Models

Before diving into durability patterns, you need to understand the fundamental difference between synchronous and asynchronous agent execution.

Synchronous execution is what you get with most LLM APIs and agent frameworks out of the box. You call the agent, it runs to completion (or timeout), and returns a result. The entire process lives in a single request context. If the process crashes, you lose everything. If it takes longer than your infrastructure allows (Lambda timeouts, API gateways, browser sessions), it fails.

Asynchronous execution decouples the trigger from the result. You submit work to a queue, the agent runs in the background, and you poll or subscribe for results. This is what enables long-running workflows. The agent can be interrupted, restarted, or paused without losing its place.

Most production AI agents need a hybrid model: synchronous entry points (a user clicks "start workflow") that immediately kick off asynchronous background execution. The user gets a job ID and can check status later, while the agent runs to completion in its own lifecycle.

Consider a clinical operations agent at a health system. A clinician submits a patient discharge workflow. The agent immediately returns a job ID (synchronous). In the background, it orchestrates multiple steps over 30 minutes: pulling patient records, checking insurance eligibility, coordinating with pharmacy, scheduling follow-up appointments. Each step may involve waiting for external APIs, retrying on transient failures, and escalating to humans when uncertainty exceeds a threshold. Only when all steps complete does the workflow finish.

Without asynchronous execution, the clinician's browser tab would hang for 30 minutes. With it, they continue working while the agent operates independently.

Scheduling Patterns: When and How Your Agent Runs

Scheduling is about deciding when work executes and what triggers it. There are three primary patterns, each suited to different use cases.

Event-Driven Scheduling

The agent runs immediately in response to a trigger: a user action, a webhook, a message in a queue. This is the most responsive pattern and works well for workflows that should start as soon as possible.

Example: A hotel guest requests a late checkout. The request hits a webhook, which triggers an agent that checks room availability, applies pricing rules, notifies housekeeping, and confirms the request back to the guest—all within seconds.

Event-driven scheduling requires:

  • A reliable message queue (AWS SQS, Google Cloud Pub/Sub, RabbitMQ) so events don't get lost if your agent is down
  • Idempotency keys to handle duplicate events (if the queue retries a message your agent already processed, you don't want to execute twice)
  • Dead-letter queues for events that fail repeatedly, so you can debug and replay them later

The latency is low, but you're responsible for scaling your agent infrastructure to handle traffic spikes. If 1,000 guests request late checkout simultaneously, you need 1,000 concurrent agent instances or a queue that buffers the load.

Time-Based Scheduling

The agent runs on a fixed schedule: every hour, daily at 2 AM, every 15 minutes. This pattern works for batch jobs and polling tasks.

Example: A financial services firm runs an agent every hour to reconcile transactions across payment processors, detect anomalies, and flag discrepancies for manual review. If the agent finds a mismatch, it escalates; otherwise, it logs success and exits.

Time-based scheduling is simple to implement (cron jobs, cloud schedulers) but less responsive. If you need results in 5 minutes and your agent runs hourly, you'll wait. It's also prone to thundering-herd problems if multiple agents start simultaneously and hammer the same backend systems.

Better approach: stagger start times, add jitter to prevent synchronized retries, and use exponential backoff when polling external systems.

Hybrid Scheduling

Combine event-driven and time-based triggers. The agent starts immediately on an event but also runs periodically to catch any missed events or clean up stale state.

Example: A healthcare provider's patient experience agent starts when a discharge order is placed (event-driven) but also runs every 6 hours to find any discharges that weren't picked up by the event stream and process them retroactively. This catches edge cases where webhooks failed or events were dropped.

Hybrid scheduling adds complexity but dramatically improves reliability. You're no longer betting everything on a single trigger mechanism.

Durability: Making Your Agent Survive Failure

Durability means your agent continues executing even when things break. There are three layers to durability.

Layer 1: Checkpoint-Based State Management

Every time your agent makes a decision or completes a step, it writes its state to durable storage (a database, object store, or distributed cache). If the agent crashes, it resumes from the last checkpoint rather than restarting from scratch.

Think of checkpoints like save points in a video game. You progress through a level, hit a save point, and if you die later, you respawn at the save point rather than the beginning.

For AI agents, checkpoints must capture:

  • Execution state: which step the agent is on, what decisions it's made so far
  • Tool results: outputs from external API calls (don't re-fetch unless necessary)
  • Agent memory: context the agent has accumulated (conversation history, extracted data, reasoning notes)
  • Timestamps: when each step completed, for debugging and SLA tracking

Checkpointing adds latency (you're writing to a database after each step) but is non-negotiable for production. Without it, a single network blip forces you to restart the entire workflow.

A robust checkpoint design looks like this:

Before calling external API:
  - Write checkpoint with "awaiting_api_response" status
  - Call API
  - Write checkpoint with result and "api_complete" status

If agent crashes between steps:
  - Restart from last checkpoint
  - Detect that API was already called (from result in checkpoint)
  - Skip the duplicate API call
  - Continue from next step

This is exactly what durable execution frameworks like Temporal implement. Rather than writing checkpoint logic yourself, you use a framework that handles it automatically. Your agent code reads like normal synchronous code, but the framework transparently persists state and handles recovery.

Layer 2: Idempotent Tool Calls

Your agent will retry failed operations. If you're not careful, a retry can have unintended side effects. Imagine an agent retrying a payment processing call—you don't want to charge the customer twice.

Idempotency means calling the same operation multiple times with the same inputs produces the same result, regardless of how many times it's executed.

For tools and external APIs:

  • Use idempotency keys: when calling an API, include a unique key (UUID) that identifies this specific operation. The API stores the key and result. If you retry with the same key, the API returns the cached result instead of executing again.
  • Design idempotent operations: if you're building your own tools, structure them to be idempotent. For example, "set user status to active" is idempotent (calling it twice doesn't change the outcome), but "increment user login count" is not.
  • Verify state before acting: before executing an operation, check if it's already been done. If an agent is retrying a "send email" operation, check if the email was already sent before sending again.

Most modern APIs (Stripe, AWS, Google Cloud) support idempotency keys. Use them. If a tool doesn't support idempotency, wrap it with a layer that does.

Layer 3: Graceful Degradation and Escalation

Some failures can't be retried. An API might be permanently down, a tool might return an unexpected format, or the agent might exceed its decision budget. In these cases, your agent must fail gracefully.

Graceful degradation means the agent continues with reduced functionality rather than crashing. Escalation means it hands off to a human.

Example from healthcare: a clinical agent is orchestrating patient discharge. It successfully completes 8 of 10 steps but can't reach the pharmacy system to confirm medication orders. Instead of failing the entire workflow, it:

  1. Marks the pharmacy step as "pending human review"
  2. Escalates to a pharmacist with a summary of what's been completed
  3. Logs the failure with full context
  4. Returns a partial result: "Discharge 80% complete, awaiting pharmacy confirmation"

The patient can still go home; the pharmacist handles the remaining piece. The workflow doesn't hang waiting for the pharmacy system to come back online.

To implement graceful degradation:

  • Define failure modes: for each tool, decide what happens if it fails. Is it critical (escalate to human) or optional (skip and continue)?
  • Set timeouts aggressively: don't wait indefinitely for an external system. Use short timeouts (5–10 seconds) and escalate if exceeded.
  • Provide fallbacks: if a tool fails, is there an alternative? A backup API, a cached result, a simplified version of the operation?
  • Log everything: when you degrade or escalate, log the full context. You'll need it for debugging and compliance.

Research on AI agent reliability emphasizes that consistency and robustness depend heavily on how agents handle failure modes. The most reliable agents aren't those that never fail—they're those that fail predictably and recover gracefully.

Recovery Patterns: Getting Back on Track

Recovery is what happens after failure. There are three primary strategies.

Automatic Retry with Exponential Backoff

When a tool call fails, automatically retry, waiting longer between each attempt. This handles transient failures (temporary network issues, rate limiting, momentary service outages).

Exponential backoff looks like:

  • Attempt 1: immediate
  • Attempt 2: wait 1 second, then retry
  • Attempt 3: wait 2 seconds, then retry
  • Attempt 4: wait 4 seconds, then retry
  • Attempt 5: wait 8 seconds, then retry
  • After 5 attempts: give up and escalate

The wait time doubles each time, which prevents overwhelming a struggling backend system. If the system is overloaded, hammering it with immediate retries only makes it worse. Exponential backoff gives it time to recover.

Set a maximum number of retries (usually 3–5) and a maximum wait time (usually 60 seconds). After that, escalate to a human or a different strategy.

Circuit Breaker Pattern

If a tool fails repeatedly, stop trying and fail fast instead of wasting time on retries that will fail anyway.

A circuit breaker has three states:

  • Closed: normal operation, all requests go through
  • Open: tool is failing, requests are rejected immediately without trying
  • Half-open: after a cooldown period, try one request to see if the tool has recovered

Example: an agent is calling a pricing API. The first three calls fail. The circuit breaker opens. The next 50 calls fail immediately without even trying (because the circuit is open). After 30 seconds (cooldown), the circuit breaker tries one call (half-open state). If it succeeds, the circuit closes and normal operation resumes. If it fails, the circuit opens again.

Circuit breakers prevent cascading failures. Instead of your agent spending 5 minutes retrying a dead API, it detects the failure immediately and escalates.

Dead-Letter Queues and Replay

When an agent can't process an event (all retries exhausted, escalation failed), put it in a dead-letter queue. Later, when you've fixed the underlying issue, replay the event and the agent processes it.

Example: an event-driven agent processes hotel booking requests. A request arrives for a guest whose profile is corrupted in the database. The agent can't look up the guest, retries fail, and escalation times out. Instead of losing the booking, it's moved to a dead-letter queue.

You fix the database corruption. You replay the dead-lettered event. The agent processes it successfully.

Dead-letter queues are essential for production systems. They're your safety net for edge cases and bugs you didn't anticipate.

Building Reliable Long-Running Agents: The Architecture

Now let's put this together into a concrete architecture. Here's what a production long-running agent system looks like:

Core Components

Agent Runtime: the engine that executes your agent logic. This could be a custom Python script, an LLM framework like LangGraph for building stateful agents, or a dedicated platform like Temporal. The runtime must support:

  • Checkpointing state to durable storage
  • Retrying failed operations
  • Timeout handling
  • Logging and observability

State Store: a database (PostgreSQL, DynamoDB, MongoDB) where checkpoints are persisted. Every time your agent completes a step, it writes state here. On restart, it reads the last checkpoint and resumes.

Message Queue: for event-driven triggers and async task distribution. Messages sit in the queue until a worker (your agent runtime) picks them up. If the worker crashes, the message stays in the queue and another worker picks it up.

Tool/Skill Registry: a catalog of tools your agent can use. Each tool is versioned, has documented inputs/outputs, and includes error handling. The awesome-agent-skills repository provides examples of well-built agent skills compatible with major frameworks.

Observability Layer: logging, tracing, and metrics that let you see what your agent is doing. You need to know:

  • How long each step takes
  • Which tools are called and how often
  • Where failures occur
  • How many workflows complete successfully vs. fail

Execution Flow

Here's how a request flows through the system:

  1. Trigger: event arrives (webhook, scheduled job, user request)
  2. Enqueue: message is placed in the queue with a unique ID
  3. Dequeue and Resume: a worker picks up the message, checks if there's an existing checkpoint (from a previous run), and resumes from there
  4. Execute Step: agent runs one step of the workflow
  5. Checkpoint: state is written to the state store
  6. Retry Loop: if the step fails, retry with exponential backoff
  7. Escalate: if retries exhausted, escalate to human or dead-letter queue
  8. Complete: when all steps finish, mark the workflow as complete and notify the user

The entire flow is idempotent. If a worker crashes mid-step, another worker picks up the message and resumes from the last checkpoint. The user sees a single unified view of the workflow status.

Real-World Example: Healthcare Patient Discharge

Let's walk through a concrete example to make this concrete. A health system uses an AI agent to orchestrate patient discharge workflows.

Workflow Steps:

  1. Fetch patient record and discharge order
  2. Check insurance eligibility and coverage
  3. Coordinate with pharmacy for discharge medications
  4. Schedule follow-up appointments
  5. Generate discharge summary
  6. Notify patient and providers

Failure Modes and Recovery:

  • Step 2 (Insurance): insurance API is flaky. Use exponential backoff. If it fails after 5 retries, escalate to billing team with a note.
  • Step 3 (Pharmacy): pharmacy system is slow. Set a 10-second timeout. If it times out, mark as pending and escalate to pharmacist. Don't block the entire discharge.
  • Step 4 (Appointments): appointment system is down. Use circuit breaker. After 3 failures, stop trying and escalate. Retry later when the system is back up.
  • Step 5 (Summary): LLM call to generate summary. Use the model's built-in retry logic. If it fails, use a template-based fallback.

Checkpointing:

After each step completes, the agent writes:

{
  "workflow_id": "discharge_12345",
  "patient_id": "P98765",
  "current_step": 3,
  "step_status": "completed",
  "results": {
    "patient_record": {...},
    "insurance_eligible": true,
    "pharmacy_meds": [...]
  },
  "timestamp": "2024-01-15T10:30:45Z",
  "next_step": 4
}

If the agent crashes while generating the discharge summary (step 5), it restarts, reads this checkpoint, and skips steps 1–4 (they're already done). It goes straight to step 5.

Observability:

The system logs:

  • Step 1: "Fetched patient record in 150ms"
  • Step 2: "Insurance check: attempt 1 failed (timeout), retrying"
  • Step 2: "Insurance check: attempt 2 succeeded in 3200ms"
  • Step 3: "Pharmacy API timed out after 10s, escalating to pharmacist"
  • Step 4: "Scheduled follow-up appointment in 200ms"
  • Step 5: "Generated discharge summary in 2100ms"
  • Step 6: "Notified patient and providers in 500ms"
  • Total: workflow completed in 6s (wall-clock time), but involved multiple retries and escalations

Clinicians see: "Discharge in progress: 4/6 steps complete. Pharmacy coordination pending pharmacist review. ETA 5 minutes."

Governance and Control in Long-Running Agents

As agents run longer and handle more critical work, governance becomes essential. You need to control:

Spend Limits

Long-running agents call LLMs multiple times. If an agent is stuck in a loop, it could rack up significant costs. Implement spend limits:

  • Per-workflow budget: "this discharge workflow has a budget of $0.50 for LLM calls"
  • Per-step budget: "each step can spend no more than $0.05"
  • Per-agent budget: "all agents running today can spend no more than $100"

When a budget is exceeded, the agent stops and escalates. You decide whether to increase the budget, optimize the agent, or fail the workflow.

Tool Access Control

Your agent has access to powerful tools: database writes, API calls, payment processing. Implement fine-grained access control:

  • Which tools can this agent use? (a discharge agent shouldn't be able to write to payroll systems)
  • What parameters can it use? (an agent can read patient data but only for the current patient, not other patients)
  • What's the rate limit? (an agent can send 10 emails per minute, not 1,000)

Use role-based access control (RBAC) or attribute-based access control (ABAC) to enforce these policies. Every tool call should be validated against the agent's permissions before execution.

Audit and Compliance

For regulated industries (healthcare, finance), you need a complete audit trail:

  • Who triggered the workflow?
  • What decisions did the agent make and why?
  • What tools did it call and with what parameters?
  • What data did it access?
  • Were there any policy violations?

Log everything to an immutable audit store. Timestamp every action. Include the agent's reasoning (why did it decide to escalate?) so auditors understand the decision-making process.

Research on security and vulnerabilities in deployed AI agents highlights that as agents gain access to more tools and infrastructure, security becomes critical. A compromised agent or malicious prompt injection could cause real damage. Governance controls are your defense.

Evaluating Agent Reliability

Before deploying a long-running agent to production, you need to know it's reliable. How do you test something that runs for hours?

Synthetic Workload Testing

Create test workflows that exercise all failure modes:

  • Happy path: everything works, agent completes successfully
  • Transient failures: one tool fails once, then succeeds (tests retry logic)
  • Persistent failures: a tool fails every time (tests escalation)
  • Timeout: a tool takes longer than the timeout (tests timeout handling)
  • Partial success: some steps succeed, some fail (tests graceful degradation)
  • State corruption: checkpoint is corrupted or missing (tests recovery)

Run each test 100+ times and measure:

  • Success rate: what percentage of workflows complete successfully?
  • Latency: how long does each step take? (including retries)
  • Cost: how much do LLM calls and tool calls cost per workflow?
  • Escalation rate: what percentage of workflows escalate to humans?

Chaos Engineering

Intentionally break things in a controlled way:

  • Kill the agent process mid-workflow and restart it. Does it resume correctly?
  • Corrupt the checkpoint (flip a bit in the database). Does the agent detect it and recover?
  • Rate-limit a tool (simulate high latency). Does the agent timeout and escalate?
  • Inject random failures (10% of API calls fail). Does the agent handle it gracefully?

Chaos engineering reveals weaknesses before they hit production. If your agent can't survive a random 10% failure rate in testing, it won't survive real-world conditions.

Production Monitoring

Once deployed, monitor continuously:

  • Success rate: are workflows completing? If it drops below 95%, something's wrong.
  • Latency: is the 95th percentile latency acceptable? If it's increasing, you might have a bottleneck.
  • Escalation rate: are you escalating more often than expected? This might indicate a bug or a tool that's unreliable.
  • Error patterns: are certain steps failing more often than others? This points to specific tools or decision points that need attention.

Set up alerts for anomalies. If success rate drops below 90%, page someone. If a single tool is failing 50% of the time, disable it and use a fallback.

Choosing a Framework or Building Custom

You have three options:

Option 1: Use a Dedicated Framework

Frameworks like Temporal or Prefect handle durability, checkpointing, and recovery automatically. You write agent code that looks synchronous, and the framework handles persistence.

Pros:

  • Durability and recovery are built-in
  • Less code to write
  • Battle-tested in production
  • Strong observability

Cons:

  • Learning curve
  • Operational overhead (you're running a distributed system)
  • Less flexibility for custom logic

Best for: teams with engineering bandwidth who want reliability out of the box. At Brightlume, we often start here for 90-day production deployments where reliability is non-negotiable.

Option 2: Use an LLM Framework with Durability Extensions

Frameworks like LangGraph have plugins for durability. You get the flexibility of LangGraph with durability added on.

Pros:

  • Familiar API if you're already using LangGraph
  • Good balance of flexibility and reliability
  • Growing ecosystem

Cons:

  • Durability might not be as battle-tested as dedicated frameworks
  • You're responsible for operational complexity

Best for: teams already invested in LangGraph who need durability.

Option 3: Build Custom with Careful Design

If your needs are simple or your use case is unique, you might build custom. But be disciplined:

  • Implement checkpointing from day one (not as an afterthought)
  • Use a battle-tested message queue (SQS, RabbitMQ)
  • Implement circuit breakers and exponential backoff
  • Log everything
  • Test thoroughly

Pros:

  • Total flexibility
  • No framework lock-in

Cons:

  • You're responsible for all the hard parts (durability, recovery, observability)
  • Easy to get wrong
  • More code to maintain

Best for: teams with deep infrastructure expertise building mission-critical systems where the custom logic is worth the investment.

Key Takeaways for Engineering Leaders

If you're building long-running AI agents, here's what matters:

  1. Durability is non-negotiable. Checkpoint after every step. If you skip this, you'll lose work when things fail.

  2. Assume failure. Design for it. Use exponential backoff, circuit breakers, and escalation. The most reliable agents aren't those that never fail—they're those that fail gracefully.

  3. Idempotency is your friend. Use idempotency keys. Design tools to be idempotent. This lets you retry safely.

  4. Observability from day one. Log everything. You'll need it for debugging, compliance, and understanding where time and money are spent.

  5. Test failure modes. Don't just test the happy path. Test timeouts, transient failures, persistent failures, and state corruption. Chaos engineering reveals weaknesses early.

  6. Governance scales with capability. As agents gain access to more tools and run longer, implement spend limits, tool access control, and audit trails. This isn't optional for regulated industries.

  7. Choose the right tool. For 90-day production deployments, a framework like Temporal or LangGraph with durability extensions saves time and reduces risk. For unique use cases, custom might be justified, but only if you have the engineering discipline.

Long-running agents are where AI moves from demos to real value. They orchestrate complex workflows, handle exceptions gracefully, and deliver measurable ROI. The patterns and technologies covered here are proven in production across healthcare, finance, and hospitality. Master them, and you'll build agents that don't just work—they work reliably, at scale, even when things break.

At Brightlume, we've deployed agents running these patterns across 90-day engagements for enterprises and mid-market organisations. The engineering-first approach—concrete architectures, specific technologies, measurable outcomes—is what separates production agents from prototypes. Apply these patterns, test thoroughly, and you'll have systems that execute reliably for hours, days, or weeks, delivering the business outcomes you built them for.

Moving from Pilot to Production

The gap between a working prototype and a production agent is exactly what we've outlined: scheduling, durability, recovery, governance, and testing. Many organisations build agents that work in controlled environments but fail when deployed to production. The difference is almost always durability and error handling.

If you're running AI pilots and need to move to production, the first question isn't "which model should we use?" It's "how will this agent survive failure?" Answer that question with the patterns in this article, and you're on the path to reliable, production-ready agents.

Research on AI agent reliability and democratic resilience implications emphasizes that as agents take on more autonomous work, reliability and predictability become critical. The patterns here—checkpointing, graceful degradation, escalation—aren't optional extras. They're the foundation of trustworthy autonomous systems.

For engineering leaders building agentic workflows in healthcare, financial services, or hospitality, the stakes are high. Patient care, regulatory compliance, customer satisfaction—these depend on agents that work reliably. The architectural patterns, frameworks, and testing strategies in this article are proven in production. Use them, and you'll build agents worthy of that responsibility.