Hybrid RAG
Dense + Sparse Retrieval for Domain-Specific LLMs
TL;DR
- Standard vector search fails on domain-specific terminology, exact product codes, and acronyms your embedding model has never seen.
- Hybrid retrieval runs dense (vector) and sparse (BM25) search in parallel and merges results via Reciprocal Rank Fusion.
- Typical accuracy improvement over pure vector search: 15–35% on domain-specific benchmarks.
- Works natively in Weaviate, Elasticsearch, and pgvector with sparse extension.
Why Standard Vector Search Fails in Enterprise RAG
Standard RAG architecture uses a single retrieval method: dense vector similarity search. You embed the query with a neural model, find the nearest document chunks in the vector space, and pass them to the LLM as context. This works well for general knowledge questions, but breaks down in two common enterprise scenarios:
- Domain-specific terminology — technical terms, acronyms, and product codes that the embedding model has seen infrequently or never during pretraining are poorly represented in vector space. A query for "FHIR R4 QuestionnaireResponse resource validation" may retrieve conceptually related chunks about healthcare data standards rather than the specific document about FHIR R4 schema validation.
- Exact match requirements — contract clauses, regulatory article references, part numbers, and version identifiers require precise keyword matching. Embedding models optimise for semantic similarity, not exact string matching. A query for "Section 4.2.1(b) of the Master Services Agreement" should return that exact clause — not conceptually similar contract language.
Hybrid retrieval solves both failure modes by running two complementary retrieval strategies in parallel and intelligently merging their results.
Dense vs Sparse Retrieval: The Core Distinction
| Dimension | Dense Retrieval | Sparse Retrieval (BM25) |
|---|---|---|
| Representation | High-dimensional float vectors (768–3072 dims) | Sparse term frequency vectors |
| Similarity measure | Cosine / dot product | BM25 term frequency–inverse document frequency |
| Strengths | Semantic understanding, synonym handling, concept matching | Exact term matching, rare words, proper nouns, codes |
| Weaknesses | Poor on OOV terms, exact matches, domain jargon | No semantic understanding; misses paraphrases |
| Storage | Vector database (Pinecone, Weaviate, pgvector) | Inverted index (Elasticsearch, OpenSearch, BM25Okapi) |
| Best models | OpenAI text-embedding-3, Cohere embed-v3, BGE-M3 | BM25Okapi, SPLADE, BM25+ |
Hybrid RAG Architecture
The hybrid retrieval pipeline runs both systems on every query and merges results before passing context to the LLM:
- Query preprocessing — clean and normalise the query; optionally expand it with HyDE (Hypothetical Document Embeddings) for dense retrieval
- Parallel retrieval — run dense vector search (top-k = 20) and sparse BM25 search (top-k = 20) simultaneously
- Reciprocal Rank Fusion (RRF) — merge the two ranked lists into a single ranked list using RRF scoring
- Reranking — optionally apply a cross-encoder reranker (Cohere Rerank, BGE reranker) to the top-20 merged results to produce the final top-5 context chunks
- Context assembly — pass the top-5 chunks to the LLM with source citations for faithfulness tracking
Reciprocal Rank Fusion: Implementation Detail
RRF is the preferred fusion method because it is robust to score scale differences between retrieval systems and does not require calibration.
The RRF score for a document d across retrieval systems is:
Where k = 60 (the standard constant) and rank_i(d) is the rank of document d in retrieval system i. Documents not present in a system's results are assigned rank = ∞ (score contribution = 0).
- A document ranked #1 in dense and #1 in sparse: score = 1/61 + 1/61 = 0.0328
- A document ranked #1 in dense only: score = 1/61 = 0.0164
- A document ranked #5 in both: score = 1/65 + 1/65 = 0.0308
Documents that appear in both result sets consistently outscore documents that appear in only one — even if their individual ranks are lower. This is the mathematical property that makes hybrid retrieval more robust than either method alone.
Implementation by Vector Database
- Weaviate: Native hybrid search via the
hybridquery parameter with configurable alpha (0 = pure BM25, 1 = pure vector, 0.5 = balanced). Uses its own fusion algorithm but is equivalent to RRF in practice. - Elasticsearch / OpenSearch: Use the
knnquery for dense retrieval combined withmatchormulti_matchfor sparse; merge with RRF using therrfretriever (Elasticsearch 8.9+). - pgvector: Run dense search via
pgvectorand sparse search via PostgreSQL full-text search (tsvector/tsquery); merge results in application code using RRF. - Pinecone: Supports hybrid search natively with SPLADE sparse vectors; index both dense and sparse representations at document ingestion time.
Seeing high hallucination in your RAG system?
We audit existing RAG pipelines and redesign retrieval architecture for production accuracy. Typical engagement: 4–6 weeks.
Get a RAG AuditWhen to Add a Reranker
Retrieval and reranking are separate stages optimising for different objectives. Retrieval optimises for recall — getting the relevant documents into the candidate set. Reranking optimises for precision — ordering that candidate set so the most relevant chunks appear first.
Add a cross-encoder reranker when: your context window is limited and you can only pass 3–5 chunks to the LLM; your domain requires fine-grained relevance judgments; or precision metrics (answer correctness, faithfulness) are consistently low despite good retrieval recall.
Reranking adds latency (typically 100–400ms for a top-20 candidate set). For latency-sensitive applications, consider caching reranked results for common queries or using a lighter distilled reranker model.
Measuring Retrieval Quality
Before and after implementing hybrid retrieval, measure these metrics on a held-out evaluation set:
- Recall@k — what percentage of relevant documents appear in the top-k retrieved results
- MRR (Mean Reciprocal Rank) — average reciprocal rank of the first relevant document across queries
- RAGAS Faithfulness — what percentage of LLM claims are supported by the retrieved context
- RAGAS Answer Relevancy — how relevant the final answer is to the original question
Build this evaluation set from real user queries in your domain — at least 100 query–answer pairs annotated by domain experts. Synthetic evaluation sets consistently overestimate real-world performance.
Frequently Asked Questions
What is the difference between dense and sparse retrieval in RAG?
Dense retrieval uses neural embedding models to convert text into high-dimensional vectors, enabling semantic similarity search — it finds documents that mean the same thing even if they use different words. Sparse retrieval (BM25/TF-IDF) uses exact term frequency matching — it excels at finding documents containing specific technical terms, product codes, or proper nouns. Dense retrieval understands intent; sparse retrieval finds exact matches. Each has failure modes the other covers.
When should I use hybrid RAG instead of standard vector search?
Use hybrid RAG when: (1) your corpus contains domain-specific terminology, acronyms, or product codes that embedding models may not represent well; (2) users ask queries mixing semantic intent with exact term requirements (e.g. "find all contracts mentioning SOC2 compliance and data residency"); (3) retrieval precision is more important than recall; (4) you are seeing high hallucination rates in a standard RAG system. If standard vector search is returning irrelevant chunks, hybrid retrieval almost always improves accuracy.
What is Reciprocal Rank Fusion (RRF)?
Reciprocal Rank Fusion is a rank aggregation algorithm that combines ranked lists from multiple retrieval systems into a single merged ranking. For each document, RRF calculates a score based on its rank position in each list using the formula: RRF(d) = sum(1 / (k + rank(d))) where k is a constant (typically 60). Documents that rank highly in both dense and sparse retrieval receive the highest combined scores. RRF is preferred over score-based fusion because it is robust to score scale differences between retrieval systems.
What embedding models work best for domain-specific RAG?
General-purpose models like OpenAI text-embedding-3-large or Cohere embed-english-v3 work well as starting points. For domain-specific applications, consider: fine-tuned BERT variants for medical or legal text, domain-adapted sentence transformers trained on your specific corpus, or BGE-M3 which supports multi-lingual and multi-granularity retrieval. The key signal is retrieval recall on a held-out evaluation set from your actual domain — benchmark before committing to a model.