Understanding the Caching Problem in Production LLM Applications
You're running Claude Opus 4 or GPT-4 at scale. Your token costs are climbing. Your latency is creeping up. Users are seeing 3–5 second response times when they should see 500 milliseconds. This isn't a model problem—it's an architecture problem.
Caching in large language model applications is the single highest-impact performance lever you can pull before throwing infrastructure at the problem. We've seen teams reduce their LLM API spend by 40–80% and cut response latency by 250x through deliberate caching strategy, not by switching models or optimising prompts.
The challenge is that traditional caching—the kind that works for databases and HTTP responses—breaks down with LLMs. You can't just hash a prompt and store the response. Users ask semantically identical questions in different ways. A user asking "What's the weather in Sydney?" and "Tell me about Sydney's climate today" should hit the same cached result, but exact-match caching won't catch it. You need a layered approach: exact-match prompt caching, response-level caching, and semantic caching working together.
This article walks you through the three core caching patterns that production teams use to ship cost-efficient, low-latency LLM applications. We'll cover the mechanics, the trade-offs, and the gotchas that catch teams in production.
The Three Caching Layers: A Mental Model
Think of caching for LLMs as three nested layers, each handling a different problem:
Prompt-level caching stops redundant API calls before they happen. If you're processing the same document 100 times, you cache the embeddings or the tokenised prompt itself. This is the fastest, cheapest layer—a cache hit here costs you almost nothing.
Response-level caching stores the final output of an LLM call. You ask Claude the same question twice; the second time, you return the cached response without calling the API. This saves the full token cost of the output, but not the input.
Semantic caching is the sophisticated layer. It recognises that "What's the capital of Australia?" and "Name Australia's capital city" are the same question, even though the text is different. It uses embeddings to match query intent rather than exact strings. Research from NeurIPS on LLM-enabled semantic caching for affordable web access demonstrates how semantic matching can dramatically improve cache hit rates in production systems.
Most production teams use all three in combination. You start with prompt caching to handle deterministic inputs, layer response caching on top for common queries, and add semantic caching for user-facing applications where the same intent arrives in many forms.
Prompt-Level Caching: The Foundation
Prompt-level caching is the simplest and most reliable layer. You cache the input before it reaches the LLM API.
How Prompt Caching Works
When you send a request to Claude Opus 4, you're sending tokens. Those tokens are processed by the model, and you're charged for every single one. If you send the same 10,000 tokens to the API 100 times, you're paying for 1 million tokens of input, even though the computation is identical.
Prompt caching intercepts that input. You hash the prompt (or a canonical representation of it), check if you've seen it before, and if you have, you skip the API call entirely or use cached embeddings of that prompt.
There are two flavours:
Exact-match prompt caching: You hash the exact prompt text. If the hash matches a previous request, you reuse the cached result. This works brilliantly for deterministic workflows—batch processing, document analysis, system prompts that never change.
Embedding-based prompt caching: You convert the prompt to an embedding (a vector representation), store that embedding, and on future requests, you check if an embedding already exists in your cache. This is slightly more flexible than exact matching but requires an embedding model.
Practical Implementation
The simplest implementation uses a key-value store (Redis, DynamoDB, or even an in-memory hash map for small deployments):
1. Hash the prompt (or embed it)
2. Check the cache for that hash/embedding
3. If hit: return the cached response
4. If miss: call the LLM API, store the response, return it
For example, if you're processing 100 customer support tickets and 40 of them ask variations of the same question, exact-match caching won't help all 40. But if you're running a batch job that processes the same system prompt + document structure 1,000 times, exact-match caching is gold. You pay for the API call once and serve 999 requests from cache.
When Prompt Caching Wins
- Batch processing: Same prompt, different data (e.g., "Summarise this document").
- System prompts: Your instruction set never changes; only the user input varies.
- RAG workflows: You're embedding and caching the retrieval context before sending to the LLM.
- Multi-turn conversations with fixed context: First turn is expensive; subsequent turns reuse the cached context.
The Gotcha: Staleness and Invalidation
Prompt caching assumes your cached response is still valid. If your system prompt changes, your cached responses become wrong. You need an invalidation strategy:
- Time-based expiry (TTL): Cache entries expire after N minutes or hours.
- Event-based invalidation: When a system prompt updates, you clear the cache.
- Versioning: You version your prompts and include the version in the cache key, so v1 and v2 prompts don't collide.
For production systems, version-based invalidation is most reliable. Tag every cached entry with the prompt version, and when you deploy a new prompt, the old cached entries are automatically orphaned.
Response-Level Caching: Capturing the Output
Response-level caching is simpler conceptually: you cache the final output of an LLM call. Same question, same answer—serve it from cache.
Mechanics
You create a cache key from the prompt (usually a hash), call the LLM, and store the response with that key. On the next identical request, you return the cached response without calling the API.
This saves the full cost of the output tokens. If your LLM call generates 500 output tokens, and you serve 100 requests from cache, you save 50,000 output tokens—a meaningful cost reduction for high-volume applications.
Where Response Caching Shines
Q&A applications: Users ask common questions repeatedly. "What's the refund policy?" gets asked 1,000 times a day. Cache the response, serve it 999 times from cache.
Content generation: You generate a product description once, serve it to every user who views that product.
Classification and tagging: You classify a document once; if the same document is reprocessed, return the cached classification.
Implementation Patterns
LangChain's official documentation on LLM caching provides production-ready implementations. You can use Redis, DynamoDB, or any key-value store:
class ResponseCache:
def __init__(self, backend):
self.backend = backend # Redis, DynamoDB, etc.
def get(self, prompt_hash):
return self.backend.get(prompt_hash)
def set(self, prompt_hash, response, ttl_seconds):
self.backend.set(prompt_hash, response, ex=ttl_seconds)
def call_llm_with_cache(self, prompt, model, ttl=3600):
hash_key = hash_prompt(prompt)
cached = self.get(hash_key)
if cached:
return cached
response = call_llm(prompt, model)
self.set(hash_key, response, ttl)
return response
For high-volume applications, you'll want a distributed cache (Redis cluster) to avoid cache coherence issues across multiple servers.
The Cost-Latency Trade-off
Response caching saves output tokens but doesn't reduce input token costs. If you're calling GPT-4 with a 2,000-token prompt and getting a 500-token response, caching saves you the 500 tokens on cache hits but not the 2,000. For input-heavy applications, response caching alone isn't enough—you need prompt or semantic caching.
Semantic Caching: Intent-Based Matching
Semantic caching is where the magic happens. Instead of matching exact strings, you match intent. "What's the weather in Sydney?" and "Tell me about today's weather in Sydney" are semantically identical, even though the text differs.
How Semantic Caching Works
You convert each prompt to an embedding—a vector representation of its meaning. You store that embedding in a vector database (Pinecone, Weaviate, Qdrant) or use a library like GPTCache, which specialises in semantic caching for LLM applications.
When a new prompt arrives, you embed it and search your vector database for similar embeddings. If you find a match above a similarity threshold (e.g., cosine similarity > 0.95), you return the cached response without calling the LLM.
The Architecture
1. New prompt arrives
2. Embed the prompt (e.g., using text-embedding-3-small)
3. Query vector DB: "Find embeddings similar to this"
4. If similarity > threshold: return cached response
5. If no match: call LLM, embed the prompt, store in vector DB
This is more expensive than exact-match caching (you're doing embedding lookups), but it catches far more cache hits. Research on semantic caching for LLM apps shows cost reductions of 40–80% and latency improvements of 250x in production systems.
Similarity Threshold: The Critical Dial
Your similarity threshold controls the trade-off between cache hits and correctness:
- Threshold 0.99: Very conservative. Only near-identical queries hit the cache. Fewer false positives, lower hit rate.
- Threshold 0.95: Moderate. Catches paraphrases and minor variations. Most production systems sit here.
- Threshold 0.90: Aggressive. High hit rate, but risk of returning slightly wrong answers for queries that are semantically similar but contextually different.
For customer-facing applications, we recommend starting at 0.95 and measuring cache hit rate and user satisfaction. If you're seeing complaints about "wrong" cached answers, tighten the threshold to 0.97.
Semantic Caching in the Wild
A detailed explainer on the full caching stack for production LLM apps covers semantic caching alongside TTL invalidation and the cache stampede problem. The stampede problem occurs when multiple requests arrive simultaneously for a cache miss—they all hit the LLM API at once, creating a spike. You prevent this with request coalescing: the first request calls the LLM; subsequent requests wait for that result and share the response.
Tools and Libraries
You don't have to build semantic caching from scratch. Open-source tools handle the heavy lifting:
- GPTCache: Specialised for LLM caching, supports multiple vector backends.
- LiteLLM: Caching layer that works with any LLM provider (OpenAI, Anthropic, etc.).
- LangChain: Integrates caching backends, including semantic caching via GPTCache.
- Bifrost: Focuses on prompt routing and caching for multi-model deployments.
A guide to open-source semantic caching tools provides implementation examples for each.
Combining the Three Layers: A Production Stack
The most effective production systems don't use just one caching pattern—they layer them:
Layer 1: Prompt-Level Caching (Fastest, Cheapest)
Check for exact prompt matches first. If the user is asking the exact same question as someone else, serve the cached response immediately. This is a Redis lookup—sub-millisecond latency.
Layer 2: Semantic Caching (High Hit Rate)
If no exact match, check semantic similarity. Embed the prompt, query your vector database, and if you find a match above threshold, return the cached response. This adds embedding latency (10–50ms) but catches paraphrases and variations.
Layer 3: Response Caching (Fallback)
After calling the LLM, cache the response for future exact matches. This is your safety net—even if semantic caching misses, you're building a cache for the next time someone asks the same thing.
Practical Example: Customer Support Bot
A customer asks, "How do I reset my password?" Your system:
- Prompt cache check: Hash the prompt, query Redis. Miss.
- Semantic cache check: Embed the prompt, query Pinecone. Hit! A previous user asked "How do I change my password?" with 0.96 similarity. Return cached response (50ms).
- If semantic cache missed, call Claude, cache the response.
Result: Most requests served in 50–100ms from semantic cache. No API calls. Cost per request approaches zero.
Latency and Cost Optimisation: The Numbers
Cost Reduction
Let's model a real scenario. You're running a knowledge base chatbot with Claude Opus 4:
- Input tokens per request: 2,000 (prompt + context)
- Output tokens per request: 300
- Daily requests: 10,000
- Cost per 1M input tokens: $15
- Cost per 1M output tokens: $60
Without caching:
- Daily input cost: 10,000 × 2,000 / 1,000,000 × $15 = $300
- Daily output cost: 10,000 × 300 / 1,000,000 × $60 = $18
- Total daily cost: $318
With 60% cache hit rate (semantic + response caching):
- Requests hitting cache: 6,000 (no API cost)
- Requests calling API: 4,000
- Daily input cost: 4,000 × 2,000 / 1,000,000 × $15 = $120
- Daily output cost: 4,000 × 300 / 1,000,000 × $60 = $7.20
- Total daily cost: $127.20
- Savings: $191 per day, 60% reduction
Over a year, that's $70,000 saved. And you're not even optimising embedding costs or vector database queries—those are negligible compared to LLM API costs.
Latency Reduction
Cached responses are orders of magnitude faster:
- API call: 500ms–2s (network round-trip + model inference)
- Prompt cache hit: <1ms (Redis lookup)
- Semantic cache hit: 50–100ms (embedding + vector DB lookup)
- Response cache hit: 10–50ms (cache lookup)
For user-facing applications, this is the difference between "snappy" and "slow". A 2-second response feels sluggish; a 100ms response feels instant.
Advanced Patterns: Stampede Prevention and TTL Strategies
The Cache Stampede Problem
Imagine this: A cached response expires. 50 requests arrive simultaneously, all missing the cache. They all call the LLM at once, creating a sudden spike in API usage and latency. This is the cache stampede.
You prevent it with request coalescing: when multiple requests arrive for the same cache miss, only the first one calls the LLM. The others wait for that result and share it.
class CoalescedCache:
def __init__(self):
self.in_flight = {} # Track in-flight requests
def get_with_coalesce(self, key, fetch_fn):
if key in self.in_flight:
# Request already in flight, wait for it
return self.in_flight[key].result()
# Start new request
future = concurrent.futures.Future()
self.in_flight[key] = future
try:
result = fetch_fn()
future.set_result(result)
return result
finally:
del self.in_flight[key]
This is critical in high-traffic systems. Without coalescing, a stampede can double or triple your API costs in seconds.
TTL Strategies
How long should you cache responses? It depends on your data freshness requirements:
- Static content (e.g., "What's the capital of France?"): Cache forever (or until you update your knowledge base).
- Slowly changing content (e.g., product descriptions): 24 hours.
- Real-time content (e.g., stock prices): 1 minute or don't cache at all.
- User-specific content (e.g., personalised recommendations): Don't cache across users; cache per-user with 1-hour TTL.
For most LLM applications, a 24-hour TTL is a sensible default. Pair it with event-based invalidation: when you update your knowledge base, clear the cache for affected queries.
Semantic Caching Deep Dive: Embedding Models and Similarity Metrics
Choosing an Embedding Model
Your semantic caching quality depends entirely on your embedding model. A poor embedding model will miss semantic similarities or match unrelated queries.
For production systems, use text-embedding-3-small (OpenAI) or text-embedding-3-large. These are state-of-the-art and have been trained on diverse data:
- text-embedding-3-small: 512 dimensions, fast, good for most applications. ~$0.02 per 1M tokens.
- text-embedding-3-large: 3,072 dimensions, more expressive, better for nuanced similarity. ~$0.13 per 1M tokens.
For cost-sensitive applications, consider open-source models like all-MiniLM-L6-v2 (run locally, zero API cost) or all-mpnet-base-v2 (larger, better quality).
Similarity Metrics
You'll measure similarity between embeddings using one of three metrics:
Cosine similarity: Most common. Measures the angle between vectors. Range: -1 to 1 (1 = identical direction).
Euclidean distance: Measures straight-line distance between vectors. Smaller = more similar.
Dot product: Fast but sensitive to vector magnitude. Use only if vectors are normalised.
For LLM caching, cosine similarity is the standard. It's robust and interpretable.
Threshold Tuning
A practical guide on building LLM caching strategies covers threshold tuning in depth. The process:
- Set threshold to 0.95.
- Monitor cache hit rate and user satisfaction.
- If hit rate is low (<40%), lower threshold to 0.93.
- If users report incorrect cached answers, raise threshold to 0.97.
- Iterate until balanced.
Most teams converge on 0.94–0.96 for general-purpose applications.
Enterprise Considerations: Governance and Compliance
At Brightlume, we work with enterprises that need more than just cost savings. They need governance, auditability, and compliance.
Cache Governance
Data residency: Ensure cached data (especially sensitive customer data) stays within your jurisdiction. Use self-hosted vector databases (Qdrant, Weaviate) rather than cloud-managed services if compliance requires it.
Access control: Who can read from the cache? For multi-tenant systems, implement tenant isolation in your cache keys. A user from Company A should never see cached responses from Company B.
Audit logging: Log every cache hit and miss. This is critical for compliance audits and debugging.
Security
Your cache contains sensitive data—customer queries, LLM responses, potentially PII. Encrypt it:
- In transit: Use TLS/SSL for Redis connections.
- At rest: Enable encryption in your vector database and cache backend.
- Key rotation: Rotate encryption keys regularly.
Compliance
For healthcare, financial services, or other regulated industries, caching introduces compliance questions:
- Data retention: How long do you keep cached responses? Set TTLs accordingly.
- Right to be forgotten: If a customer requests data deletion, you need to purge their cached responses. Implement this with customer-tagged cache keys.
- Explainability: If you return a cached response, can you explain why? For regulated use cases, you may need to log "served from cache" in your audit trail.
Brightlume's AI strategy and governance expertise helps teams navigate these challenges at scale.
Position-Independent Caching and Advanced Research
Caching research is moving fast. Recent work explores more sophisticated patterns:
MEPIC: Memory-Efficient Position-Independent Caching
MEPIC research proposes position-independent caching for LLM inference. Traditional caching assumes cached tokens must appear in the same position in the context. MEPIC enables chunk-level reuse even if chunks appear in different positions, with selective recomputation.
This is still research territory, but it hints at where caching will go: more granular, more flexible, better cache hit rates on the same infrastructure.
AI Traffic and CDN Caching
Recent work on rethinking web cache design for the AI era analyses how AI traffic (LLM requests) differs from traditional web traffic. AI requests are larger, less repetitive, and more compute-intensive. This changes optimal caching strategies. For example, edge caching (serving cached responses from geographically close servers) becomes more valuable for latency-sensitive AI applications.
Implementation Checklist: From Theory to Production
Ready to implement caching? Here's a checklist:
Phase 1: Exact-Match Prompt Caching (Week 1)
- [ ] Set up Redis or DynamoDB
- [ ] Implement prompt hashing (use SHA-256)
- [ ] Add cache checks before LLM calls
- [ ] Set TTL to 24 hours
- [ ] Measure cache hit rate
Phase 2: Response-Level Caching (Week 2)
- [ ] Extend cache to store full responses
- [ ] Monitor cache size (don't let it grow unbounded)
- [ ] Implement cache eviction policy (LRU or LFU)
- [ ] Measure cost savings
Phase 3: Semantic Caching (Week 3–4)
- [ ] Choose embedding model (text-embedding-3-small)
- [ ] Set up vector database (Pinecone or self-hosted Qdrant)
- [ ] Implement similarity search
- [ ] Tune threshold (start at 0.95)
- [ ] Implement request coalescing to prevent stampedes
- [ ] Measure latency improvement
Phase 4: Monitoring and Tuning (Ongoing)
- [ ] Log cache hits, misses, and latencies
- [ ] Monitor API costs and compare to baseline
- [ ] Set up alerts for cache hit rate drops (may indicate data drift)
- [ ] Quarterly threshold tuning based on user feedback
Common Pitfalls and How to Avoid Them
Pitfall 1: Caching Stale Data
Problem: You cache a response, but the underlying data changes. Users see outdated information.
Solution: Use event-based invalidation. When your knowledge base updates, clear affected cache entries. Tag cache entries with version numbers.
Pitfall 2: Threshold Too Low
Problem: Your semantic cache threshold is 0.90, so "What's the weather?" returns a cached response about stock prices (both are financial queries, high similarity).
Solution: Start conservative (0.97), measure false positives, and lower gradually. For critical applications, use 0.96 or higher.
Pitfall 3: Cache Stampede
Problem: A popular cached response expires. 100 requests arrive. 100 API calls go out.
Solution: Implement request coalescing. Only the first request calls the API; others wait and share the result.
Pitfall 4: Unbounded Cache Growth
Problem: You cache everything. Your vector database grows to 10GB. Lookups slow down.
Solution: Set TTLs, implement cache eviction (LRU), and monitor cache size. For semantic caching, periodically prune old entries.
Pitfall 5: Ignoring Embedding Costs
Problem: You save $300/day on LLM API calls but spend $200/day on embedding API calls. Net savings: $100/day.
Solution: Use local embedding models (all-MiniLM-L6-v2) for non-critical applications, or batch embeddings to amortise costs. For high-volume systems, the embedding cost is negligible compared to LLM costs.
Measuring Success: Metrics That Matter
When you deploy caching, measure these metrics:
Cache hit rate: Percentage of requests served from cache. Target: 50–70% for semantic caching in mature systems.
Cost per request: Track this before and after caching. For our earlier example, it dropped from $0.032 to $0.013 per request (60% reduction).
P95 latency: The latency at the 95th percentile. Caching should cut this dramatically. From 1,500ms to 100ms is typical.
False positive rate: Percentage of cached responses that were semantically incorrect. Target: <1% for production systems. If this rises above 2%, tighten your threshold.
Cache size: Total size of cached data. Monitor this to catch unbounded growth.
Conclusion: Caching as a First-Class Performance Tool
Caching for LLM applications is not an afterthought—it's a first-class performance lever. Teams that ship production AI systems treat caching as a core architectural decision, not an optimisation to bolt on later.
The three-layer approach—prompt caching for deterministic inputs, response caching for common queries, and semantic caching for intent-based matching—gives you 40–80% cost reductions and 250x latency improvements. These aren't theoretical numbers; they're what production teams see in the field.
Start with exact-match prompt caching (simple, immediate wins), layer on response caching (low complexity), and add semantic caching when you need to catch paraphrases and variations. Implement request coalescing to prevent stampedes, tune your similarity threshold based on user feedback, and monitor cache hit rate as a key metric.
At Brightlume, we've shipped caching strategies for teams moving AI pilots to production. The teams that get caching right early—before they scale to 10,000 requests per day—save months of optimisation work later. If you're building production LLM applications and want to move fast without burning through API budgets, caching is where you start.
The engineering is straightforward. The impact is massive. Get it right.