How to Build a Production
RAG Pipeline in Python (2026)
TL;DR
- Stack: Python + Qdrant + text-embedding-3-small + BM25 + Cohere reranker + RAGAS.
- Key finding: 512-token chunks + Cohere reranker outperforms naive 1024-token retrieval consistently.
- Hybrid retrieval (dense + sparse) is the production default — pure dense retrieval misses keyword-specific queries.
- Evaluate before shipping — RAGAS faithfulness below 0.8 means hallucination risk is unacceptable.
Most RAG tutorials show you a working demo in 20 lines. This is not that guide. We cover what actually matters for production: chunking strategy, hybrid retrieval, reranking, and evaluation. These are the decisions that separate a demo that impresses in a slide deck from a system that handles real user queries reliably. For a deeper architecture overview, see our RAG architecture guide.
What Makes a RAG Pipeline "Production-Grade"
A production RAG pipeline differs from a prototype in four ways:
- Chunking strategy — chunk size and overlap affect retrieval precision more than almost any other variable
- Hybrid retrieval — combining dense vector search with sparse BM25 retrieval catches what either alone misses
- Reranking — a cross-encoder reranker dramatically improves the quality of the top-k passages fed to the LLM
- Evaluation — systematic measurement with RAGAS before and after every significant change
Install Dependencies
pip install qdrant-client openai cohere langchain langchain-openai \
langchain-community rank-bm25 ragas datasetsStep 1 — Document Ingestion and Chunking
Chunk size is the most impactful early decision. Our benchmarks across four client deployments (support docs, legal contracts, technical manuals, sales playbooks) consistently found 512 tokens with 50-token overlap to be the best default. Smaller chunks (256 tokens) give higher retrieval precision but lower context richness. Larger chunks (1024 tokens) lose precision — the relevant sentence gets buried.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
# Load documents
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
# Chunk with 512 tokens, 50-token overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
# Preserve metadata for filtering and citation
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_id"] = i
chunk.metadata["source"] = chunk.metadata.get("source", "unknown")
print(f"Created {len(chunks)} chunks from {len(documents)} documents")Step 2 — Embedding Model Setup
For most production use cases, OpenAI's text-embedding-3-small is the best balance of quality and cost — better than ada-002 at half the price. For cost-sensitive or data-sovereignty deployments, BGE-M3 (open-source, runs locally) is a strong alternative.
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed_texts(texts: list[str], batch_size: int = 100) -> list[list[float]]:
"""Embed texts in batches to respect API rate limits."""
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch,
)
embeddings.extend([e.embedding for e in response.data])
return embeddingsStep 3 — Setting Up Qdrant
Run Qdrant locally with Docker for development:
docker run -p 6333:6333 qdrant/qdrantFor production, use Qdrant Cloud or self-host on Kubernetes. Create a collection and index your chunks:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
qdrant = QdrantClient("localhost", port=6333)
COLLECTION = "knowledge_base"
VECTOR_DIM = 1536 # text-embedding-3-small dimension
# Create collection
qdrant.recreate_collection(
collection_name=COLLECTION,
vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.COSINE),
)
# Index chunks
texts = [chunk.page_content for chunk in chunks]
vectors = embed_texts(texts)
points = [
PointStruct(
id=i,
vector=vectors[i],
payload={
"text": chunks[i].page_content,
"source": chunks[i].metadata.get("source"),
"chunk_id": chunks[i].metadata.get("chunk_id"),
},
)
for i in range(len(chunks))
]
qdrant.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} chunks")Need a production RAG system built?
We build and deploy RAG pipelines for enterprises — with hybrid retrieval, reranking, evaluation frameworks, and full observability.
ML Engineering ServicesTalk to an EngineerStep 4 — Hybrid Retrieval (Dense + Sparse)
Pure dense vector search is good but misses exact keyword matches — acronyms, product names, version numbers. Sparse BM25 retrieval catches these. Combining both with Reciprocal Rank Fusion (RRF) gives you the best of both worlds.
from rank_bm25 import BM25Okapi
from collections import defaultdict
# Build BM25 index over your corpus
corpus = [chunk.page_content.lower().split() for chunk in chunks]
bm25 = BM25Okapi(corpus)
def hybrid_search(query: str, top_k: int = 20) -> list[dict]:
"""Combine dense and sparse retrieval with Reciprocal Rank Fusion."""
query_embedding = embed_texts([query])[0]
# Dense retrieval
dense_results = qdrant.search(
collection_name=COLLECTION,
query_vector=query_embedding,
limit=top_k,
)
dense_ids = [r.id for r in dense_results]
# Sparse (BM25) retrieval
bm25_scores = bm25.get_scores(query.lower().split())
sparse_ids = sorted(range(len(bm25_scores)),
key=lambda i: bm25_scores[i], reverse=True)[:top_k]
# Reciprocal Rank Fusion
rrf_scores = defaultdict(float)
k = 60 # RRF constant
for rank, doc_id in enumerate(dense_ids):
rrf_scores[doc_id] += 1 / (k + rank + 1)
for rank, doc_id in enumerate(sparse_ids):
rrf_scores[doc_id] += 1 / (k + rank + 1)
# Return top-k by fused score
top_ids = sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]
return [chunks[i].page_content for i in top_ids]Step 5 — Reranking with Cohere
Retrieval returns candidates; reranking picks the best ones. A cross-encoder reranker like Cohere's rerank-english-v3.0 reads both the query and each passage together and scores relevance more accurately than the embedding-based similarity used in retrieval. Our benchmark finding: 256-token chunks with Cohere reranking consistently outperformed 1024-token chunks without reranking.
import cohere
co = cohere.Client("your-cohere-api-key")
def retrieve_and_rerank(query: str, top_k_retrieve: int = 20, top_k_final: int = 5) -> list[str]:
"""Retrieve candidates then rerank to final context."""
candidates = hybrid_search(query, top_k=top_k_retrieve)
rerank_response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=candidates,
top_n=top_k_final,
)
return [candidates[r.index] for r in rerank_response.results]
def answer(query: str) -> str:
"""Full RAG query: retrieve → rerank → generate."""
context_passages = retrieve_and_rerank(query)
context = "\n\n---\n\n".join(context_passages)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Answer the question using ONLY the context provided. "
"If the context does not contain enough information, say so. "
"Do not hallucinate facts not present in the context."
),
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}",
},
],
temperature=0.1,
)
return response.choices[0].message.contentStep 6 — Evaluation with RAGAS
Never ship a RAG system without first measuring it. RAGAS evaluates four dimensions that cover the most common failure modes. Build a test set of 50–200 representative questions with ground-truth answers before development starts.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
# Build evaluation dataset
eval_data = {
"question": ["What is the refund policy?", "How do I reset my password?"],
"answer": [], # populated by your RAG system
"contexts": [], # list of retrieved passages per question
"ground_truth": ["30-day full refund.", "Click 'Forgot password' on login page."],
}
for q in eval_data["question"]:
passages = retrieve_and_rerank(q)
response = answer(q)
eval_data["answer"].append(response)
eval_data["contexts"].append(passages)
dataset = Dataset.from_dict(eval_data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# faithfulness: 0.91 ← target > 0.8
# answer_relevancy: 0.87 ← target > 0.8
# context_precision: 0.79 ← acceptable; below 0.7 needs retrieval tuning
# context_recall: 0.83 ← target > 0.8A faithfulness score below 0.8 is a hard stop — it means the LLM is generating content not supported by the retrieved context (hallucination). Tune the system prompt and reduce generation temperature before shipping. For a deep dive on common production failures, see our guide on hybrid RAG for domain-specific LLMs.
Frequently Asked Questions
What is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation (RAG) is an architecture pattern that grounds LLM responses in your own data. Instead of relying purely on the model's training data, RAG retrieves the most relevant passages from your document corpus at query time and injects them into the LLM's context window. The model then generates an answer based on both the retrieved context and its general knowledge. This prevents hallucinations about domain-specific facts, keeps responses current without fine-tuning, and gives you a citation trail for every answer.
When should I use RAG vs fine-tuning?
Use RAG when: your knowledge base changes frequently (new documents, updated policies), you need source citations for compliance or trust, your dataset is too large to fit in context, or you need to answer questions over thousands of documents. Use fine-tuning when: you need the model to adopt a specific tone, style, or format that prompt engineering cannot achieve, you're optimising for a narrow, well-defined task with stable data, or you need to distil reasoning patterns into a smaller model for cost efficiency. In practice, RAG and fine-tuning are complementary — fine-tuned models with RAG retrieval are common in production.
What is the best vector database for production RAG in 2026?
For most production RAG systems in 2026, Qdrant is the recommended default: it supports both dense and sparse (BM25) vectors natively, has strong filtering and metadata support, ships as a single binary or managed cloud service, and has an excellent Python client. Pinecone is the easiest managed option with minimal operational overhead. Weaviate suits teams that want a graph-like object model alongside vectors. pgvector works well if you're already on Postgres and your scale is under 10 million vectors. Chroma is fine for development and small-scale deployments but is not recommended for production workloads above 1 million vectors.
How do I evaluate RAG pipeline quality?
Use RAGAS (Retrieval-Augmented Generation Assessment) for systematic evaluation. The four core metrics are: Faithfulness (does the answer stick to the retrieved context?), Answer Relevancy (does the answer address the question?), Context Precision (are the retrieved passages actually relevant?), and Context Recall (did retrieval surface all the information needed to answer?). Run RAGAS evaluations on a curated test set of 50–200 question-answer pairs before shipping any RAG system. A faithfulness score below 0.8 indicates hallucination risk. Context precision below 0.7 indicates your retrieval strategy needs tuning.
How much does a production RAG system cost to run?
For a typical enterprise RAG system handling 1,000 queries/day: embedding costs (text-embedding-3-small) ≈ $0.50/day; Cohere reranker ≈ $1.00/day at 1,000 queries; LLM generation (GPT-4o-mini, ~500 tokens per response) ≈ $1.50/day; Qdrant Cloud (1M vectors) ≈ $25/month. Total: roughly $90–120/month for 1,000 queries/day. For 10,000 queries/day, scale the LLM generation cost proportionally (the biggest variable); infrastructure costs scale sub-linearly. Using an open-source embedding model (BGE-M3) and self-hosting Qdrant eliminates the first and last cost lines entirely.