All posts
AI Strategy

Vector Databases in 2026: Pinecone, Weaviate, pgvector, and the New Contenders

Compare Pinecone, Weaviate, pgvector, Milvus, Qdrant, and Redis for production RAG. Architectures, latency, cost, and deployment patterns for AI engineers.

By Brightlume Team

Vector Databases in 2026: Pinecone, Weaviate, pgvector, and the New Contenders

If you're shipping a retrieval-augmented generation (RAG) system or any AI agent that needs to search semantic embeddings at scale, you're making a vector database decision right now. The landscape has shifted dramatically. What was a greenfield choice in 2023—Pinecone or nothing—is now a crowded field with real trade-offs: managed convenience versus operational control, latency versus cost, SQL integration versus purpose-built speed.

This isn't a vendor comparison. This is an engineering guide. We'll walk through the architectures, production constraints, and specific trade-offs that matter when you're moving from pilot to production. At Brightlume, we ship production AI in 90 days. Vector database selection is one of the first technical decisions we make with engineering teams, and it cascades through your entire data pipeline, inference latency, and operational overhead.

Let's get concrete about what's actually changed in 2026, why it matters for your deployment, and how to pick the right tool without betting your production reliability on hype.

What Vector Databases Actually Do

A vector database is a specialised system that stores numerical representations of text, images, or other data—embeddings—and retrieves them based on semantic similarity. Unlike traditional databases that return exact matches, vector databases find the closest neighbours in high-dimensional space.

When you embed a user query into a 1,536-dimensional space (OpenAI's text-embedding-3-small), you need to find the thousand closest vectors from a corpus of millions. Exact distance calculation across every vector is computationally expensive. Vector databases use approximate nearest neighbour (ANN) algorithms—hierarchical navigable small worlds (HNSW), product quantisation, or locality-sensitive hashing—to return results in milliseconds instead of seconds.

In a typical RAG pipeline:

  1. Your LLM query gets embedded (5–50ms with modern APIs)
  2. That embedding searches your vector database (10–100ms depending on scale and infrastructure)
  3. The top-k results are retrieved and sent as context to your LLM
  4. The LLM generates a response using both the query and retrieved context

The vector database is the bottleneck. If it's slow, your entire agent latency suffers. If it's expensive, your cost-per-inference scales with every search.

Pinecone: Managed Simplicity with Trade-Offs

Pinecone remains the default choice for teams that want to avoid infrastructure entirely. It's a fully managed service: you send embeddings, you query, Pinecone handles scaling, indexing, and availability.

Strengths:

  • Zero operations overhead: No infrastructure provisioning, no index tuning, no replication management. This matters for teams with no database expertise.
  • Metadata filtering: Pinecone supports filtering on metadata fields alongside vector similarity, reducing post-query filtering overhead.
  • Predictable pricing: Per-index pricing with a monthly fee, not per-query. If you're running high-volume inference, the cost per search is fixed.
  • Hybrid search: Pinecone's Hybrid Search combines dense vector retrieval with sparse (keyword-based) retrieval, improving recall for queries where exact terminology matters.

Constraints:

  • Latency floor: Pinecone's latency is bounded by network round-trip time. Even with optimal indexing, you're looking at 50–150ms per query depending on geography and load. For sub-50ms latency requirements, Pinecone is off the table.
  • Data residency and compliance: Pinecone's infrastructure is primarily US-based. If you're in healthcare or finance with strict data locality requirements, this is a blocker. Australian organisations processing sensitive data often can't use it.
  • Vendor lock-in: Your embeddings and indices live in Pinecone's system. Migrating to another platform requires exporting, re-embedding, and re-indexing—a non-trivial operation at scale.
  • Cost at scale: Pinecone's per-index pricing becomes expensive when you're running dozens of indices or high-dimensional embeddings. A single index with 10M vectors can cost $1,000–$2,000/month depending on dimensionality and SLA tier.

When to use Pinecone: You're a startup or small team without database infrastructure expertise, you need to ship fast, latency tolerance is >100ms, and data residency isn't a constraint.

Weaviate: Open Source with Managed Options

Weaviate is an open-source vector database that can run on your infrastructure or as a managed cloud service. It's gained significant traction because it bridges the gap between fully managed and fully self-hosted.

Strengths:

  • Hybrid search out of the box: Weaviate combines vector search with keyword search, BM25 ranking, and semantic re-ranking in a single query. This is more sophisticated than Pinecone's approach and reduces the need for external search infrastructure.
  • GraphQL API: Weaviate uses GraphQL for queries, which is powerful for complex retrieval patterns but has a steeper learning curve than REST.
  • Schema flexibility: You define data types, relationships, and cross-references explicitly. This is verbose but gives you control over data structure.
  • Self-hosted option: Run Weaviate on Kubernetes, Docker, or bare metal. You control the infrastructure, latency, and data location.
  • Generative search: Weaviate can integrate with LLMs to generate results on-the-fly, reducing the need for separate orchestration layers.

Constraints:

  • Operational complexity: Self-hosting Weaviate requires Kubernetes expertise, backup strategies, and index management. The managed cloud option exists but is less mature than Pinecone.
  • Memory overhead: Weaviate holds indices in memory for speed. A 50M vector index with metadata can require 100GB+ of RAM. Scaling horizontally is possible but operationally complex.
  • Query language learning curve: GraphQL is powerful but unfamiliar to teams used to SQL or REST APIs. Your application code needs to construct GraphQL queries, which adds complexity.
  • Latency variability: On self-hosted Weaviate, latency depends entirely on your infrastructure. Poor resource allocation or network contention can spike query times unpredictably.

When to use Weaviate: You need hybrid search, you have Kubernetes infrastructure already, you want to avoid vendor lock-in, or you have strict data residency requirements and can self-host.

pgvector: SQL Integration Without Separate Infrastructure

pgvector is a PostgreSQL extension that adds vector similarity search directly to your existing Postgres database. This is a fundamentally different approach: instead of a separate vector database, you embed vector search into your relational database.

Strengths:

  • ACID compliance: Your vectors and relational data stay transactionally consistent. No separate sync between a vector database and your primary database.
  • Existing infrastructure: If you're already running Postgres (and most teams are), pgvector requires no new infrastructure. It's an extension install.
  • SQL simplicity: Vector queries are SQL queries. No new query language, no API wrapper. Your engineers already know how to write this.
  • Cost efficiency: You're not paying for a separate vector database service. Postgres licensing and hosting are often already budgeted.
  • Full-text and vector in one query: Combine vector similarity with relational filtering, full-text search, and joins in a single SQL statement.

Constraints:

  • Latency not optimised for vectors: Postgres is optimised for transactional workloads, not similarity search at massive scale. A billion-vector search will be slower on pgvector than on a purpose-built vector database.
  • Index memory: pgvector uses HNSW or IVFFlat indices. These live in memory and require careful tuning. For very large indices, memory becomes a constraint.
  • Scaling complexity: Scaling Postgres for vector search is non-trivial. Read replicas help, but you're still bounded by single-node index memory.
  • No distributed search: pgvector doesn't distribute vectors across multiple nodes. Scaling means vertical scaling (bigger machines) or sharding at the application layer.

When to use pgvector: You have modest vector volumes (<100M vectors), you're already running Postgres, you need transactional consistency between vectors and relational data, or you want to minimise operational complexity and cost.

Milvus: Open Source, Distributed, Production-Grade

Milvus is an open-source vector database designed from the ground up for distributed, large-scale deployments. It's used in production by major tech companies and is often the choice when you need both performance and operational control.

Strengths:

  • Distributed architecture: Milvus separates compute, storage, and coordination layers. You can scale each independently, making it genuinely scalable to billions of vectors.
  • Multiple index types: HNSW, IVFFlat, SCANN, DiskANN. You choose the index based on your latency and memory constraints.
  • Kubernetes-native: Milvus is designed for Kubernetes. Helm charts, operators, and cloud-native deployments are first-class citizens.
  • Cost-effective at scale: Because it's open source and doesn't require proprietary infrastructure, Milvus is cheaper than managed services at high volumes.
  • Flexible storage backends: Milvus supports MinIO, S3, and GCS for object storage. Vectors don't need to live in memory; they can be paged from cheaper storage.

Constraints:

  • Operational overhead: Running Milvus requires Kubernetes, distributed systems knowledge, and ongoing maintenance. You're responsible for backups, upgrades, and monitoring.
  • Ecosystem maturity: Milvus is mature but smaller than Postgres or Elasticsearch. Fewer third-party integrations, fewer Stack Overflow answers.
  • Latency tuning complexity: With multiple index options and distributed nodes, tuning Milvus for your latency requirements is non-trivial. Wrong configuration can cause 10x latency differences.
  • Metadata filtering performance: Milvus's metadata filtering isn't as optimised as purpose-built vector databases. Complex filters can slow queries significantly.

When to use Milvus: You're deploying billions of vectors, you have Kubernetes infrastructure, you want open-source control, or you're willing to invest in operational complexity for cost savings.

Qdrant: Rust Performance with Payload Filtering

Qdrant is a relatively newer vector database written in Rust, designed for high-performance similarity search with sophisticated filtering. It's gaining traction in production deployments because it combines speed with usability.

Strengths:

  • Rust performance: Qdrant's Rust implementation is fast. Single-node latency is often 10–30ms for large-scale searches, better than many alternatives.
  • Payload filtering: Qdrant's filtering is more sophisticated than most competitors. You can filter on nested structures, ranges, and complex conditions without post-query filtering.
  • Point-in-time snapshots: Qdrant supports snapshots for disaster recovery and point-in-time rollback. Useful for production reliability.
  • Distributed clustering: Qdrant supports clustering for high availability and scaling, with fewer operational complexities than Milvus.
  • Binary quantisation: Qdrant supports binary quantisation, reducing index memory by 32x while maintaining reasonable accuracy. This is crucial for cost efficiency at scale.

Constraints:

  • Smaller ecosystem: Qdrant is newer. Fewer integrations, smaller community, fewer production case studies.
  • Single-node scaling limitations: While clustering exists, Qdrant's scaling story is less mature than Milvus. Very large deployments (billions of vectors) may hit scaling issues.
  • Memory overhead: Like most vector databases, Qdrant holds indices in memory. Binary quantisation helps, but it's still a constraint.

When to use Qdrant: You need low latency (<50ms), you want sophisticated filtering, you prefer Rust reliability, or you're evaluating newer technologies with strong fundamentals.

Elasticsearch Vector Search: Search Engine Convergence

Elasticsearch has added vector search capabilities, blurring the line between search engines and vector databases. If you're already running Elasticsearch for full-text search, adding vector search is now a native option.

Strengths:

  • Unified search: Combine full-text, keyword, and vector search in one query. No separate vector database needed.
  • Existing infrastructure: Teams already running Elasticsearch for logs and metrics can reuse infrastructure.
  • Hybrid ranking: Elasticsearch's reciprocal rank fusion (RRF) combines vector and keyword ranking, often improving retrieval quality.
  • Operational familiarity: Your ops team already knows Elasticsearch. Vector search is just another index type.

Constraints:

  • Not optimised for pure vector workloads: Elasticsearch is a search engine first, vector database second. Pure vector search latency is typically slower than dedicated vector databases.
  • Cost: Elasticsearch's pricing is based on compute and storage. Vector search adds overhead. For pure vector workloads, a dedicated vector database is often cheaper.
  • Index memory: Like other systems, Elasticsearch holds indices in memory. Scaling to billions of vectors requires significant infrastructure.

When to use Elasticsearch: You're already running Elasticsearch, you need hybrid search, or you want to consolidate search and vector infrastructure.

Redis as a Vector Cache: Low-Latency Semantic Caching

Redis has added vector search capabilities through RedisVL, positioning itself as a low-latency vector cache rather than a primary vector database. This is a different use case: caching frequently accessed embeddings and search results.

Strengths:

  • Sub-millisecond latency: Redis is in-memory and optimised for speed. Vector searches return in 1–10ms, orders of magnitude faster than network-bound alternatives.
  • Semantic caching: Cache LLM responses based on embedding similarity. If a new query is semantically close to a cached query, return the cached response without calling the LLM.
  • Existing infrastructure: Teams already running Redis for caching can reuse infrastructure.
  • Cost-effective for warm data: Caching hot vectors in Redis is cheaper than querying a primary vector database repeatedly.

Constraints:

  • Not a primary database: Redis is volatile. Data persists to disk, but it's not a durable primary store. Use Redis for caching, not as your sole vector store.
  • Memory-bound scale: Redis runs in-memory. Caching billions of vectors is impractical. Use it for your hottest data.
  • No distributed querying: Redis clusters exist, but vector search across a cluster is complex. Single-node Redis is the typical deployment.

When to use Redis: You need semantic caching, you want sub-millisecond latency for frequently accessed queries, or you're building a two-tier system (Redis cache + primary vector database).

TiDB Vector Search: SQL with Vector Integration

TiDB's vector search capabilities extend SQL-based databases with vector similarity search, similar to pgvector but at distributed scale. TiDB is a distributed SQL database, so vector search is integrated into a horizontally scalable system.

Strengths:

  • Distributed SQL: Scale vectors across multiple nodes while maintaining SQL compatibility and ACID guarantees.
  • Cost-effective scaling: Pay only for the compute and storage you use. Scaling is horizontal, not vertical.
  • Hybrid queries: Combine vector search with relational queries, aggregations, and joins in SQL.
  • Operational simplicity for teams with SQL expertise: If your team knows SQL, TiDB vector search is familiar.

Constraints:

  • Latency not optimised for pure vector workloads: TiDB is a SQL database first. Vector search latency is good but not as low as dedicated vector databases.
  • Smaller ecosystem: TiDB is less widely deployed than Postgres or Elasticsearch. Fewer integrations and community resources.
  • Learning curve for distributed systems: Operating TiDB requires understanding distributed SQL concepts. It's more complex than single-node Postgres.

When to use TiDB: You need distributed SQL with vector search, you're already using TiDB, or you want horizontal scaling without separate vector infrastructure.

Production Trade-Offs: Latency, Cost, and Operational Burden

Choosing a vector database isn't about picking the "best"—it's about optimising for your constraints. Here's how to think about the trade-offs:

Latency Requirements

If you need sub-50ms end-to-end latency for user-facing queries, your options narrow significantly:

  • Redis: 1–10ms (but only for cached data)
  • Qdrant: 10–50ms (single-node, optimised configuration)
  • pgvector: 20–100ms (depends on index size and hardware)
  • Milvus: 30–200ms (depends on configuration and scale)
  • Weaviate: 50–500ms (self-hosted, depends on infrastructure)
  • Elasticsearch: 50–500ms (depends on index size)
  • Pinecone: 50–150ms (network-bound, geographic latency)

For sub-50ms requirements, you're choosing between Redis (caching layer), Qdrant, or pgvector. Everything else is too slow.

Cost Structure

Vector database costs break down into:

  1. Per-query cost: Pinecone charges per-index, not per-query. Elasticsearch and managed Weaviate charge by compute. Self-hosted options have infrastructure costs.
  2. Storage cost: Managed services charge by vector count. Self-hosted systems pay for infrastructure (servers, storage).
  3. Operational cost: Self-hosted systems require engineering time for maintenance, monitoring, and upgrades.

For a typical RAG system running 100 queries/second:

  • Pinecone: ~$1,000–$2,000/month (per-index pricing)
  • Managed Weaviate: ~$500–$1,500/month (compute-based)
  • pgvector on RDS: ~$300–$800/month (database instance cost)
  • Milvus on Kubernetes: ~$200–$500/month (infrastructure) + 100 hours/year operational overhead
  • Qdrant (self-hosted): ~$200–$400/month (infrastructure) + 50 hours/year operational overhead

At scale (1000 queries/second), self-hosted options become significantly cheaper, but operational overhead increases.

Operational Burden

Operational burden includes:

  • Deployment complexity: How many steps to get running?
  • Monitoring and alerting: What observability is built-in?
  • Scaling: How do you add capacity?
  • Backup and disaster recovery: How do you protect data?
  • Upgrades: How disruptive are new versions?

Lowest burden: Pinecone, managed Weaviate (you pay for convenience) Medium burden: pgvector, Elasticsearch (familiar infrastructure) Highest burden: Milvus, Qdrant self-hosted (full operational responsibility)

Picking Your Vector Database: A Decision Framework

Here's how we approach vector database selection at Brightlume when shipping production AI in 90 days:

Step 1: Latency requirements

If you need <50ms latency, start with Qdrant or pgvector. If you can tolerate 100–200ms, expand to Milvus or Weaviate.

Step 2: Data residency and compliance

If you need data in Australia or have strict locality requirements, self-hosted options (Milvus, Qdrant, Weaviate, pgvector) are mandatory. Pinecone is off the table.

Step 3: Operational capability

If you have no Kubernetes expertise and no database team, Pinecone or pgvector are realistic. If you have infrastructure expertise, Milvus or Qdrant are viable.

Step 4: Cost sensitivity

If you're cost-constrained at scale, self-hosted Milvus or Qdrant + infrastructure costs are cheaper than Pinecone. If you need predictable costs with no operational overhead, Pinecone wins.

Step 5: Hybrid search requirements

If you need keyword + vector search, Weaviate or Elasticsearch are stronger choices. If you need pure vector search, any option works.

Architecture Patterns: Coupling Vector Databases to Your AI Pipeline

Your vector database doesn't exist in isolation. It's part of a larger system. Here are common production patterns:

Pattern 1: Embedded Vector Cache (Redis)

Use Redis for frequently accessed vectors, backed by a primary vector database.

  • Query flow: Check Redis → if miss, query primary database → cache in Redis
  • Cost: Redis holds hot data; primary database holds all data
  • Latency: Sub-millisecond for cached queries, 50–200ms for uncached
  • When to use: High query volume, skewed access patterns (80/20 rule), sub-millisecond latency required

Pattern 2: Transactional Vector Store (pgvector)

Use pgvector when vectors and relational data must stay in sync.

  • Query flow: Single SQL query combining vector search and relational filtering
  • Cost: Single database instance
  • Latency: 20–100ms depending on scale
  • When to use: Moderate vector volumes, strong consistency requirements, existing Postgres infrastructure

Pattern 3: Dedicated Vector Database (Qdrant, Milvus)

Use a dedicated vector database with separate relational storage.

  • Query flow: Vector search in dedicated database → join with relational data in application layer
  • Cost: Two systems to operate
  • Latency: 10–100ms for vector search, additional latency for joins
  • When to use: High vector volumes, need for independent scaling, latency-sensitive workloads

Pattern 4: Managed Simplicity (Pinecone)

Use Pinecone when operational overhead is the constraint.

  • Query flow: Send embedding to Pinecone → retrieve results → join with relational data
  • Cost: Predictable monthly fee
  • Latency: 50–150ms
  • When to use: Small team, no infrastructure expertise, data residency not a constraint

Evaluating Vector Databases: What to Test

Before committing to a vector database in production, test these metrics:

Latency at your scale: Run 1,000 queries against your actual dataset size. Measure p50, p95, and p99 latency. Latency distribution matters more than average.

Index memory overhead: Calculate the memory required for your vectors. For 10M vectors with 1,536 dimensions: 10M × 1,536 × 4 bytes = ~60GB raw. With index overhead, expect 80–150GB depending on the system.

Filtering performance: If you're filtering on metadata, test filtering performance. Some systems degrade significantly with complex filters.

Scaling behaviour: Test scaling. Add 10x more vectors and re-measure latency. Scaling should be linear or sub-linear, not exponential.

Operational overhead: Deploy the system, set up monitoring, simulate a failure, and measure recovery time. This reveals operational reality.

The 2026 Landscape: What's Changed

Compared to 2023, the vector database landscape has matured significantly:

  1. Specialisation is winning: Purpose-built vector databases (Qdrant, Milvus) are outperforming general-purpose systems (Elasticsearch) for pure vector workloads.

  2. Open source is viable in production: Milvus and Qdrant are production-ready alternatives to managed services, with lower total cost of ownership at scale.

  3. SQL integration is practical: pgvector and TiDB show that vector search in SQL databases is viable for moderate scales, simplifying architecture.

  4. Hybrid search is standard: Vector-only retrieval is being replaced by hybrid approaches combining vectors, keyword search, and metadata filtering.

  5. Latency is the new constraint: As RAG systems move from batch to real-time, sub-50ms vector search is becoming mandatory, not optional.

Conclusion: Make the Right Trade-Off

There's no universally "best" vector database. The right choice depends on your latency requirements, operational capability, cost constraints, and data residency needs.

  • If you need sub-50ms latency and have infrastructure expertise: Qdrant or pgvector
  • If you need distributed scale and can operate Kubernetes: Milvus
  • If you need simplicity and can tolerate 100ms+ latency: Pinecone
  • If you need hybrid search and have Elasticsearch: Extend it with vector capabilities
  • If you need sub-millisecond latency for hot data: Redis caching layer
  • If you need distributed SQL with vectors: TiDB

At Brightlume, we've deployed all of these in production. The decision isn't made in isolation—it cascades through your embedding pipeline, inference orchestration, and operational model. Get it right early, because migrating vector databases in production is painful.

Start with a 90-day proof of concept. Test your actual latency, cost, and operational requirements. Then commit. The vector database you choose today will shape your AI infrastructure for years.