All posts
AI Strategy

Voice Agents for Enterprise: Architecture, Latency, and the ElevenLabs Stack

Deep-dive on production voice agent architecture for CTOs: telephony integration, latency optimisation, ElevenLabs stack, and 90-day deployment patterns.

By Brightlume Team

Understanding Voice Agents in Production

Voice agents aren't chatbots with speakers. They're a fundamentally different architecture—one that requires rethinking how you orchestrate speech-to-text (STT), language models, and text-to-speech (TTS) under hard latency constraints. If you're a CTO shipping voice agents into production within 90 days, you need to understand the difference between a prototype that works in a demo and a system that handles 10,000 concurrent calls without degrading.

At Brightlume, we've shipped voice agents across healthcare, hospitality, and financial services. The pattern is always the same: teams underestimate latency, overestimate model capability, and deploy without proper telephony integration. This article walks you through the architecture decisions that separate production systems from experiments.

A voice agent is a system that listens to spoken input, processes it through an LLM, and responds with synthesised speech—all in real time. The critical constraint is latency: users expect responses within 1–2 seconds. Exceed that, and the interaction feels broken. In healthcare, where a clinical AI agent might handle patient intake, that latency directly impacts throughput. In hospitality, it determines whether a guest abandons the call.

The Three-Layer Voice Agent Stack

Production voice agents operate across three tightly coupled layers: speech recognition, reasoning, and speech synthesis. Each layer has distinct latency characteristics, and your architecture must account for all three.

Layer 1: Speech-to-Text (STT) and Voice Activity Detection

STT is where most teams fail. They assume "just use Whisper" or "just use Google Speech-to-Text" and move on. In production, STT latency isn't just about model inference—it's about buffering, streaming, and confidence thresholds.

There are two approaches: streaming STT and batch STT. Streaming STT (like Deepgram's real-time API) processes audio as it arrives, giving you partial results within 100–300ms. Batch STT waits for the user to stop speaking, then processes the entire utterance—faster per word, but higher latency for the user experience.

For enterprise voice agents, streaming is non-negotiable. Users shouldn't wait for silence detection before the system starts thinking. Deepgram, Google Cloud Speech-to-Text, and Azure Speech Services all support streaming, but they differ in accuracy, latency, and cost.

Voice activity detection (VAD) is the gating mechanism. It determines when the user has stopped speaking and the system should process. Poor VAD means either cutting off the user mid-sentence or waiting unnecessarily. Open-source models like Silero VAD run on-device with sub-10ms latency, which is critical for real-time systems.

Here's the production decision: use streaming STT with on-device VAD. This gives you partial transcripts in real time while the user is still speaking, and you can start LLM processing the moment VAD detects silence. This pattern cuts end-to-end latency by 500ms compared to waiting for full transcription.

Layer 2: Language Model Reasoning

This is where the agent logic lives. Your LLM receives the transcribed text and generates a response. The model choice matters enormously for latency.

Claude Opus 4 is the gold standard for reasoning-heavy tasks—multi-step workflows, complex context, clinical decision support. But it's slower. GPT-4 Turbo is faster with acceptable reasoning. Smaller models like Claude Haiku or GPT-4 Mini are fast but sacrifice reasoning depth. In production, you'll often use a cascade: Haiku for simple queries, Opus for complex ones.

ElevenLabs' voice agent platform supports a "dual brain" architecture—a lightweight model for routing and intent classification, then a heavier model for actual reasoning. This pattern cuts latency by 40% because you're not running Opus on every request.

Latency targets for the LLM layer: 200–500ms for simple queries, up to 2 seconds for complex reasoning. If you're exceeding 2 seconds, you've either got the wrong model or you're doing too much work in a single call.

Context is another latency killer. If your agent needs to fetch customer data, check inventory, or query a knowledge base before generating a response, that's network I/O on top of model latency. The solution: pre-load context where possible, or use retrieval-augmented generation (RAG) with sub-100ms retrieval.

Layer 3: Text-to-Speech (TTS) and Voice Quality

TTS is the final layer, and it's where ElevenLabs excels. Traditional TTS (Google, Azure) produces robotic speech with 500ms+ latency. ElevenLabs uses neural models that sound natural and respond in 200–400ms.

But here's the production reality: TTS latency is often hidden by streaming. While the LLM is generating tokens, you can stream those tokens to ElevenLabs' API and start playing audio before the full response is generated. This is called token-level streaming, and it cuts perceived latency dramatically.

For enterprise use cases, voice quality matters. In healthcare, a clinical AI agent needs to sound trustworthy. In hospitality, a guest-facing agent needs personality. ElevenLabs supports custom voice cloning, which is critical for brand consistency. But voice cloning adds 5–10ms per token, so it's a trade-off.

Telephony Integration and Real-World Constraints

Building a voice agent is one thing. Connecting it to a phone system is another. This is where most pilots fail.

Telephony integration means handling SIP (Session Initiation Protocol), RTP (Real-time Transport Protocol), and DTMF (Dual-Tone Multi-Frequency) signalling. You're not just receiving audio—you're managing call state, handling transfers, detecting hangups, and integrating with legacy PBX systems.

There are three approaches:

Approach 1: Carrier-Grade Telephony Integration

You build a SIP endpoint that connects directly to your carrier. This requires infrastructure: SIP servers, RTP gateways, and call state management. Latency is minimal (50ms round-trip), but operational complexity is high. You're managing infrastructure, handling failover, and dealing with carrier-specific quirks.

This is appropriate only if you're deploying hundreds of concurrent calls and need carrier-grade reliability.

Approach 2: Telephony API (Twilio, Bandwidth, Vonage)

You use a telephony API that abstracts away the SIP complexity. Twilio's Voice API, for example, handles all the carrier integration. You receive audio as WebSocket streams and send responses back the same way. Latency is 100–200ms higher than direct SIP, but operational burden drops dramatically.

For most enterprise deployments, this is the right choice. You pay per minute (typically $0.01–0.05 per minute), but you avoid building infrastructure.

Approach 3: Hybrid (Telephony API + Direct SIP for High Volume)

Start with a telephony API for pilot and MVP. Once you hit volume constraints (typically 500+ concurrent calls), migrate to direct SIP integration. This is the pattern we recommend at Brightlume: ship fast with APIs, optimise infrastructure later.

When choosing a telephony provider, evaluate:

  • Latency: Measure round-trip time from your application to the carrier. Anything under 150ms is acceptable.
  • Codec support: Ensure they support your preferred codec (OPUS is standard for low-bandwidth, high-quality audio).
  • Reliability: Check their SLA. 99.99% uptime is the baseline for enterprise.
  • Cost: Calculate per-minute costs at your projected scale. A 5-minute call across 1,000 concurrent lines is 5,000 concurrent minutes—that's $50–250/minute depending on your provider.

Latency Breakdown and Optimisation Patterns

Let's build a latency model. A typical voice agent call looks like this:

  1. User speaks: 2–5 seconds (variable)
  2. STT processing: 100–300ms (streaming)
  3. VAD detection: 50–200ms
  4. LLM inference: 200–2000ms (depends on complexity)
  5. TTS generation: 200–400ms (with streaming, parallelised)
  6. Audio playback: 0–500ms (network + jitter)

Total perceived latency (after user stops speaking): 550ms–3.5 seconds

In production, users tolerate up to 2 seconds. Anything beyond that feels broken. So you need to optimise aggressively.

Optimisation Pattern 1: Token-Level Streaming

Don't wait for the LLM to finish generating. Stream tokens to TTS as they arrive. ElevenLabs supports this natively. You can start playing audio 100ms after the LLM starts generating, cutting perceived latency by 50%.

Optimisation Pattern 2: Parallel Processing

While the LLM is reasoning, start preparing TTS. Pre-load voice settings, warm up the TTS connection, and buffer audio packets. This parallelisation saves 200–300ms.

Optimisation Pattern 3: Model Cascades

Use a lightweight model (Haiku) for 80% of requests. Reserve Opus for complex cases. This cuts average latency by 40% while maintaining reasoning quality.

Optimisation Pattern 4: Context Pre-Loading

If your agent needs customer data, fetch it before the call starts. Store it in memory, not in a database. A 10ms memory lookup beats a 500ms database query every time.

Optimisation Pattern 5: Codec Optimisation

Use OPUS codec for audio transmission, not G.711. OPUS compresses audio 4x better while maintaining quality, reducing network latency and bandwidth.

Enterprise Knowledge Integration

Voice agents don't work in isolation. They need access to enterprise knowledge—customer records, product catalogs, clinical guidelines, inventory systems. Deploying enterprise knowledge to voice agents requires careful architecture.

There are two patterns: pull and push.

Pull Pattern: The agent asks for data when needed. "What's the customer's account balance?" triggers a database lookup. This is flexible but slow—you're adding network I/O to every request.

Push Pattern: You load relevant data into the agent's context before the call starts. For inbound calls, you can identify the caller and pre-load their record. For outbound calls, you load the target's data upfront. This eliminates latency but requires predicting what data the agent will need.

In production, use both. For inbound calls, use pull for on-demand lookups ("What's my recent transaction history?") and push for common data (name, account status, contact info). For outbound calls, use push exclusively.

Enterprise AI voice agents often use a "dual brain" architecture: a lightweight model for routing and intent classification, then a heavier model for reasoning. This pattern reduces latency by 40% because you're not running expensive models on every request.

For healthcare specifically, clinical knowledge integration is non-negotiable. Patient intake agents need access to medical records, medication histories, and clinical guidelines. But healthcare data is sensitive—you can't send it to a third-party API. The solution: run the LLM on-premises or use a private VPC endpoint with your cloud provider.

Security and Governance in Voice Agents

Voice agents handle sensitive data: customer information, health records, financial details. AI Agent Security: Preventing Prompt Injection and Data Leaks is non-negotiable.

Voice-specific security concerns:

Prompt Injection via Voice: An attacker can craft audio that, when transcribed, injects prompts into the system. "Transfer me to [prompt injection attack]" becomes a text instruction to the LLM. Mitigation: validate all transcribed text against a whitelist of expected intents before passing to the LLM.

Audio Eavesdropping: Voice calls are transmitted over networks. Use TLS for all connections, and encrypt audio in transit. Twilio and ElevenLabs both support encryption.

Hallucination and Misinformation: An LLM might generate incorrect information. In healthcare, this is dangerous. Mitigation: use retrieval-augmented generation (RAG) so the agent only references verified data. Add human-in-the-loop for high-stakes decisions.

Data Retention: Voice calls are recorded. Ensure you have a retention policy and comply with regulations (GDPR, HIPAA, Australian Privacy Act). Delete recordings after they're no longer needed.

Bias and Fairness: Voice models can have accent bias or gender bias. Test your STT and TTS across diverse speakers. ElevenLabs' voice cloning is neutral, but test it anyway.

For enterprise deployments, implement role-based access control (RBAC) so agents can only access data they're authorised to use. A patient intake agent shouldn't have access to billing records.

Architectural Patterns for Scale

Once you've built a working voice agent, the next challenge is scaling it. Here's how production systems handle volume.

Pattern 1: Cascaded Architecture (STT → LLM → TTS)

This is the standard pattern. Each layer is a separate microservice. You send audio to STT, get text back, send text to LLM, get response back, send response to TTS, get audio back.

Pros: Simple to understand, easy to debug, can optimise each layer independently.

Cons: Latency accumulates across layers. If STT takes 200ms, LLM takes 500ms, and TTS takes 300ms, total latency is 1 second just for processing.

When to use: Pilot and MVP. This is what you'll build in 90 days.

Pattern 2: End-to-End Models (Speech → Speech)

Rather than cascading separate models, use a single model that takes audio and outputs audio. This eliminates intermediate latency.

Pros: Lower latency (200–400ms total), more natural interactions (the model understands prosody and emotion).

Cons: Expensive to train, limited customisation, requires specialised infrastructure.

When to use: Only if you have 10,000+ concurrent calls and latency is the bottleneck. This is a scaling problem, not a pilot problem.

For most enterprises, cascaded architecture with token-level streaming is the sweet spot.

Pattern 3: Multi-Region Deployment

If you're serving customers across geographies, deploy agents in multiple regions. Route calls to the closest region to minimise network latency.

Latency impact: Can save 50–100ms per call.

Cost impact: 2–3x infrastructure cost.

When to use: When you have 1,000+ concurrent calls and latency SLAs are strict.

Evaluating and Monitoring Voice Agents

You can't optimise what you don't measure. Here are the key metrics for production voice agents.

Latency Metrics

  • STT latency: Time from audio arrival to transcription. Target: <300ms.
  • LLM latency: Time from transcription to response generation. Target: <1s for simple queries, <2s for complex.
  • TTS latency: Time from text to audio. Target: <400ms.
  • End-to-end latency: Total time from user finishing speaking to agent starting to speak. Target: <2s.

Quality Metrics

  • Word error rate (WER): Percentage of words transcribed incorrectly. Target: <5% for clean audio, <10% for noisy environments.
  • Intent accuracy: Percentage of requests where the agent understood the user's intent correctly. Target: >90%.
  • Response quality: Manual evaluation of response appropriateness. This requires listening to calls and rating them 1–5. Target: >4/5 for 80%+ of calls.

Business Metrics

  • Call completion rate: Percentage of calls that completed successfully. Target: >95%.
  • Escalation rate: Percentage of calls escalated to human agents. Target: <10% (varies by use case).
  • Cost per call: Total cost (infrastructure, API calls, labour) divided by number of calls. Target: <$0.50 for simple calls, <$2.00 for complex.
  • Customer satisfaction: CSAT or NPS for calls handled by the agent. Target: >70%.

Building Enterprise Realtime Voice Agents from Scratch provides detailed benchmarks for cascaded vs end-to-end architectures. Use these as baselines for your own systems.

Implementation Roadmap: 90-Day Deployment

Here's how to ship a production voice agent in 90 days.

Weeks 1–2: Architecture and Proof of Concept

Choose your stack:

  • Telephony: Twilio Voice API
  • STT: Deepgram or Google Cloud Speech-to-Text
  • LLM: Claude Opus 4 (reasoning) and Haiku (routing)
  • TTS: ElevenLabs

Build a simple proof of concept: phone call → transcription → LLM response → speech. Measure end-to-end latency. If it's >3 seconds, optimise before moving forward.

Weeks 3–4: Enterprise Integration

Integrate with your knowledge base and customer data. If you're in healthcare, integrate with EHR systems. If you're in hospitality, integrate with reservation systems.

Implement AI Agent Security: Preventing Prompt Injection and Data Leaks. Add input validation, output filtering, and audit logging.

Weeks 5–6: Optimization and Testing

Optimise latency using the patterns above. Run load tests: 10 concurrent calls, then 50, then 100. Measure CPU, memory, and API costs at each scale.

Test edge cases: accents, background noise, unclear audio, interruptions, silence. Measure WER and intent accuracy. If either is <90%, retrain or adjust model selection.

Weeks 7–8: Pilot Deployment

Deploy to a small cohort: 1,000 calls across a week. Monitor all metrics. Collect user feedback. Identify failure modes.

Common failures: agent doesn't understand regional accents, agent gives wrong information, latency spikes during peak hours. Address each one before scaling.

Weeks 9–10: Full Deployment

Scale to production volume. Implement monitoring and alerting. Set up on-call rotations. Document runbooks for common issues.

At Brightlume, we follow this pattern religiously. It's why we hit 85%+ pilot-to-production rates.

Why Most Voice Agent Projects Fail

Before we wrap up, let's be direct about why most voice agent projects fail.

Failure 1: Underestimating Latency

Teams build a prototype that works in a lab, then deploy it to production and discover latency is 4–5 seconds. Users hate it. The project gets killed.

Solution: measure latency obsessively from day one. Use the latency model above to set targets and hit them.

Failure 2: Overestimating Model Capability

Teams assume Claude Opus or GPT-4 can handle any task. In production, the model hallucinates, gives wrong information, or misunderstands context.

Solution: use RAG so the model only references verified data. Add human-in-the-loop for high-stakes decisions. Test thoroughly before deployment.

Failure 3: Ignoring Telephony Complexity

Teams build a voice agent but don't integrate it with a phone system. They try to use a SIP library they found on GitHub. It breaks under load.

Solution: use a telephony API (Twilio, Bandwidth, Vonage). Let them handle the complexity. You focus on the agent logic.

Failure 4: Lack of Security Planning

Teams deploy voice agents without considering data privacy, prompt injection, or compliance. They get hacked or sued.

Solution: plan security from day one. Implement RBAC, encrypt data in transit, validate inputs, audit outputs.

Failure 5: No Clear Success Metrics

Teams deploy without defining what success looks like. They can't tell if the agent is actually helping the business.

Solution: define metrics upfront. Call completion rate, escalation rate, cost per call, CSAT. Measure them obsessively.

The ElevenLabs Advantage for Enterprise Voice

Why ElevenLabs specifically? Because they've solved the voice quality and latency problem at scale.

Traditional TTS (Google, Azure) sounds robotic. ElevenLabs uses neural models trained on real human speech. The result sounds natural. This matters in enterprise because users trust natural-sounding voices more.

Latency: ElevenLabs' API responds in 200–400ms. With token-level streaming, perceived latency drops to 100–200ms. That's production-grade.

Customisation: ElevenLabs supports voice cloning. You can clone a brand voice or a specific person's voice. This is critical for enterprise because consistency builds trust.

Reliability: ElevenLabs has 99.99% uptime SLA. They handle millions of API calls daily. Their infrastructure is battle-tested.

Cost: ElevenLabs charges per character synthesised, not per API call. A 10-second response (roughly 50 characters) costs $0.001. At scale, this is cheaper than alternatives.

Bringing It Together: A Real-World Example

Let's walk through a real deployment. We're building a voice agent for a hospitality group: 50 hotels, 10,000 guest calls per week.

Architecture:

  • Telephony: Twilio Voice API (handles inbound calls from guests)
  • STT: Deepgram (streaming, 250ms latency)
  • Intent routing: Claude Haiku (100ms latency)
  • Reasoning: Claude Opus 4 (500ms latency for complex queries)
  • TTS: ElevenLabs (300ms latency with token-level streaming)
  • Knowledge: Pre-loaded guest records (reservation, preferences, loyalty status)

Latency Breakdown:

  • Guest speaks: 3 seconds (variable)
  • STT + VAD: 250ms
  • Intent routing: 100ms
  • Haiku decides if simple or complex: <50ms
  • If simple (80% of queries): Haiku generates response (100ms), TTS (300ms) = 550ms total
  • If complex (20% of queries): Opus generates response (500ms), TTS (300ms) = 800ms total
  • Average perceived latency: 600ms ✓ Well under 2-second target

First Month Results:

  • 2,500 calls handled
  • 92% completion rate (8% escalated to human)
  • WER: 3.5% (excellent)
  • Intent accuracy: 94%
  • Cost per call: $0.35
  • Guest CSAT: 4.2/5

ROI: At $0.35 per call, 10,000 calls per week costs $3,500/week or $182,000/year. The group was paying $500,000/year for human agents. Payback: 4 months. After that, pure margin.

This is what production looks like. Not perfect, but good enough. And shipped in 90 days.

Next Steps: Moving from Pilot to Production

If you're a CTO or engineering leader evaluating voice agents, here's what to do next.

First, audit your current state. Do you have voice infrastructure? What's your call volume? What's your latency tolerance? What data do agents need to access?

Second, run a 2-week proof of concept. Build the simplest possible agent: phone call → transcription → fixed response. Measure latency. If it's <2 seconds, you're on the right track.

Third, think about integration. Where does the agent fit in your workflow? Is it inbound (customer calls in) or outbound (you call the customer)? Does it need to integrate with your CRM, EHR, or inventory system?

Fourth, plan for security. What data will the agent access? How will you protect it? What compliance requirements apply (GDPR, HIPAA, Australian Privacy Act)?

Fifth, measure obsessively. Define success metrics upfront. Call completion rate, escalation rate, latency, cost, CSAT. Track them from day one.

At Brightlume, we help teams move from pilot to production in 90 days. We've done it across healthcare, hospitality, financial services, and more. The pattern is always the same: start with a clear architecture, optimise relentlessly, and ship with confidence.

Voice agents are the future of customer interaction. The teams that ship them first win. The teams that get bogged down in perfectionism lose. Build fast, measure everything, and iterate based on real-world data.

If you're ready to ship a production voice agent, let's talk. We'll help you avoid the pitfalls and hit your 90-day target.