RAG SystemsApril 202618 min read

How to Build a Production
RAG Pipeline in Python (2026)

TL;DR

  • Stack: Python + Qdrant + text-embedding-3-small + BM25 + Cohere reranker + RAGAS.
  • Key finding: 512-token chunks + Cohere reranker outperforms naive 1024-token retrieval consistently.
  • Hybrid retrieval (dense + sparse) is the production default — pure dense retrieval misses keyword-specific queries.
  • Evaluate before shipping — RAGAS faithfulness below 0.8 means hallucination risk is unacceptable.

Most RAG tutorials show you a working demo in 20 lines. This is not that guide. We cover what actually matters for production: chunking strategy, hybrid retrieval, reranking, and evaluation. These are the decisions that separate a demo that impresses in a slide deck from a system that handles real user queries reliably. For a deeper architecture overview, see our RAG architecture guide.

What Makes a RAG Pipeline "Production-Grade"

A production RAG pipeline differs from a prototype in four ways:

  • Chunking strategy — chunk size and overlap affect retrieval precision more than almost any other variable
  • Hybrid retrieval — combining dense vector search with sparse BM25 retrieval catches what either alone misses
  • Reranking — a cross-encoder reranker dramatically improves the quality of the top-k passages fed to the LLM
  • Evaluation — systematic measurement with RAGAS before and after every significant change

Install Dependencies

pip install qdrant-client openai cohere langchain langchain-openai \
            langchain-community rank-bm25 ragas datasets

Step 1 — Document Ingestion and Chunking

Chunk size is the most impactful early decision. Our benchmarks across four client deployments (support docs, legal contracts, technical manuals, sales playbooks) consistently found 512 tokens with 50-token overlap to be the best default. Smaller chunks (256 tokens) give higher retrieval precision but lower context richness. Larger chunks (1024 tokens) lose precision — the relevant sentence gets buried.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

# Load documents
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

# Chunk with 512 tokens, 50-token overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)

# Preserve metadata for filtering and citation
for i, chunk in enumerate(chunks):
    chunk.metadata["chunk_id"] = i
    chunk.metadata["source"] = chunk.metadata.get("source", "unknown")

print(f"Created {len(chunks)} chunks from {len(documents)} documents")

Step 2 — Embedding Model Setup

For most production use cases, OpenAI's text-embedding-3-small is the best balance of quality and cost — better than ada-002 at half the price. For cost-sensitive or data-sovereignty deployments, BGE-M3 (open-source, runs locally) is a strong alternative.

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed_texts(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    """Embed texts in batches to respect API rate limits."""
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
        )
        embeddings.extend([e.embedding for e in response.data])
    return embeddings

Step 3 — Setting Up Qdrant

Run Qdrant locally with Docker for development:

docker run -p 6333:6333 qdrant/qdrant

For production, use Qdrant Cloud or self-host on Kubernetes. Create a collection and index your chunks:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

qdrant = QdrantClient("localhost", port=6333)

COLLECTION = "knowledge_base"
VECTOR_DIM = 1536  # text-embedding-3-small dimension

# Create collection
qdrant.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.COSINE),
)

# Index chunks
texts = [chunk.page_content for chunk in chunks]
vectors = embed_texts(texts)

points = [
    PointStruct(
        id=i,
        vector=vectors[i],
        payload={
            "text": chunks[i].page_content,
            "source": chunks[i].metadata.get("source"),
            "chunk_id": chunks[i].metadata.get("chunk_id"),
        },
    )
    for i in range(len(chunks))
]

qdrant.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} chunks")

Need a production RAG system built?

We build and deploy RAG pipelines for enterprises — with hybrid retrieval, reranking, evaluation frameworks, and full observability.

ML Engineering ServicesTalk to an Engineer

Step 4 — Hybrid Retrieval (Dense + Sparse)

Pure dense vector search is good but misses exact keyword matches — acronyms, product names, version numbers. Sparse BM25 retrieval catches these. Combining both with Reciprocal Rank Fusion (RRF) gives you the best of both worlds.

from rank_bm25 import BM25Okapi
from collections import defaultdict

# Build BM25 index over your corpus
corpus = [chunk.page_content.lower().split() for chunk in chunks]
bm25 = BM25Okapi(corpus)

def hybrid_search(query: str, top_k: int = 20) -> list[dict]:
    """Combine dense and sparse retrieval with Reciprocal Rank Fusion."""
    query_embedding = embed_texts([query])[0]

    # Dense retrieval
    dense_results = qdrant.search(
        collection_name=COLLECTION,
        query_vector=query_embedding,
        limit=top_k,
    )
    dense_ids = [r.id for r in dense_results]

    # Sparse (BM25) retrieval
    bm25_scores = bm25.get_scores(query.lower().split())
    sparse_ids = sorted(range(len(bm25_scores)),
                        key=lambda i: bm25_scores[i], reverse=True)[:top_k]

    # Reciprocal Rank Fusion
    rrf_scores = defaultdict(float)
    k = 60  # RRF constant

    for rank, doc_id in enumerate(dense_ids):
        rrf_scores[doc_id] += 1 / (k + rank + 1)

    for rank, doc_id in enumerate(sparse_ids):
        rrf_scores[doc_id] += 1 / (k + rank + 1)

    # Return top-k by fused score
    top_ids = sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]
    return [chunks[i].page_content for i in top_ids]

Step 5 — Reranking with Cohere

Retrieval returns candidates; reranking picks the best ones. A cross-encoder reranker like Cohere's rerank-english-v3.0 reads both the query and each passage together and scores relevance more accurately than the embedding-based similarity used in retrieval. Our benchmark finding: 256-token chunks with Cohere reranking consistently outperformed 1024-token chunks without reranking.

import cohere

co = cohere.Client("your-cohere-api-key")

def retrieve_and_rerank(query: str, top_k_retrieve: int = 20, top_k_final: int = 5) -> list[str]:
    """Retrieve candidates then rerank to final context."""
    candidates = hybrid_search(query, top_k=top_k_retrieve)

    rerank_response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=candidates,
        top_n=top_k_final,
    )

    return [candidates[r.index] for r in rerank_response.results]


def answer(query: str) -> str:
    """Full RAG query: retrieve → rerank → generate."""
    context_passages = retrieve_and_rerank(query)
    context = "\n\n---\n\n".join(context_passages)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the question using ONLY the context provided. "
                    "If the context does not contain enough information, say so. "
                    "Do not hallucinate facts not present in the context."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}",
            },
        ],
        temperature=0.1,
    )
    return response.choices[0].message.content

Step 6 — Evaluation with RAGAS

Never ship a RAG system without first measuring it. RAGAS evaluates four dimensions that cover the most common failure modes. Build a test set of 50–200 representative questions with ground-truth answers before development starts.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Build evaluation dataset
eval_data = {
    "question": ["What is the refund policy?", "How do I reset my password?"],
    "answer": [],        # populated by your RAG system
    "contexts": [],      # list of retrieved passages per question
    "ground_truth": ["30-day full refund.", "Click 'Forgot password' on login page."],
}

for q in eval_data["question"]:
    passages = retrieve_and_rerank(q)
    response = answer(q)
    eval_data["answer"].append(response)
    eval_data["contexts"].append(passages)

dataset = Dataset.from_dict(eval_data)

result = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(result)
# faithfulness: 0.91  ← target > 0.8
# answer_relevancy: 0.87  ← target > 0.8
# context_precision: 0.79  ← acceptable; below 0.7 needs retrieval tuning
# context_recall: 0.83  ← target > 0.8

A faithfulness score below 0.8 is a hard stop — it means the LLM is generating content not supported by the retrieved context (hallucination). Tune the system prompt and reduce generation temperature before shipping. For a deep dive on common production failures, see our guide on hybrid RAG for domain-specific LLMs.

Frequently Asked Questions

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is an architecture pattern that grounds LLM responses in your own data. Instead of relying purely on the model's training data, RAG retrieves the most relevant passages from your document corpus at query time and injects them into the LLM's context window. The model then generates an answer based on both the retrieved context and its general knowledge. This prevents hallucinations about domain-specific facts, keeps responses current without fine-tuning, and gives you a citation trail for every answer.

When should I use RAG vs fine-tuning?

Use RAG when: your knowledge base changes frequently (new documents, updated policies), you need source citations for compliance or trust, your dataset is too large to fit in context, or you need to answer questions over thousands of documents. Use fine-tuning when: you need the model to adopt a specific tone, style, or format that prompt engineering cannot achieve, you're optimising for a narrow, well-defined task with stable data, or you need to distil reasoning patterns into a smaller model for cost efficiency. In practice, RAG and fine-tuning are complementary — fine-tuned models with RAG retrieval are common in production.

What is the best vector database for production RAG in 2026?

For most production RAG systems in 2026, Qdrant is the recommended default: it supports both dense and sparse (BM25) vectors natively, has strong filtering and metadata support, ships as a single binary or managed cloud service, and has an excellent Python client. Pinecone is the easiest managed option with minimal operational overhead. Weaviate suits teams that want a graph-like object model alongside vectors. pgvector works well if you're already on Postgres and your scale is under 10 million vectors. Chroma is fine for development and small-scale deployments but is not recommended for production workloads above 1 million vectors.

How do I evaluate RAG pipeline quality?

Use RAGAS (Retrieval-Augmented Generation Assessment) for systematic evaluation. The four core metrics are: Faithfulness (does the answer stick to the retrieved context?), Answer Relevancy (does the answer address the question?), Context Precision (are the retrieved passages actually relevant?), and Context Recall (did retrieval surface all the information needed to answer?). Run RAGAS evaluations on a curated test set of 50–200 question-answer pairs before shipping any RAG system. A faithfulness score below 0.8 indicates hallucination risk. Context precision below 0.7 indicates your retrieval strategy needs tuning.

How much does a production RAG system cost to run?

For a typical enterprise RAG system handling 1,000 queries/day: embedding costs (text-embedding-3-small) ≈ $0.50/day; Cohere reranker ≈ $1.00/day at 1,000 queries; LLM generation (GPT-4o-mini, ~500 tokens per response) ≈ $1.50/day; Qdrant Cloud (1M vectors) ≈ $25/month. Total: roughly $90–120/month for 1,000 queries/day. For 10,000 queries/day, scale the LLM generation cost proportionally (the biggest variable); infrastructure costs scale sub-linearly. Using an open-source embedding model (BGE-M3) and self-hosting Qdrant eliminates the first and last cost lines entirely.

Related Reading

Building a RAG System for Your Enterprise?

We build production RAG pipelines with hybrid retrieval, reranking, evaluation, and full observability — not demos.

Talk to Our Engineers