AIFebruary 202614 min read

RAG Architecture Explained
How to Build AI That Actually Knows Your Business

RAG Architecture Diagram

INTRODUCTION

Here's the fundamental problem with raw LLMs: they know a lot about the world in general, and almost nothing specific about your business. Ask GPT-4 about your company's refund policy, your product's technical specifications, or what happened in last quarter's board meeting — and you'll get either a confident hallucination or an honest "I don't know."

Retrieval-Augmented Generation (RAG) is the architecture that solves this. Instead of hoping the model memorizes your information during training (it won't), RAG retrieves relevant documents from your knowledge base at query time and feeds them directly into the context window before the model generates a response.

The concept is elegant. The execution — if you want it to actually work well — is considerably more nuanced than most introductions let on.

The Basic RAG Pipeline (And Why It's Just the Starting Point)

Every RAG system has the same core components: an indexing pipeline that processes and stores your documents, and a query pipeline that retrieves and generates at runtime.

The indexing pipeline looks like this: you take a document, split it into chunks, convert each chunk into a vector embedding using an embedding model, and store those vectors in a vector database. At query time, you embed the user's question using the same embedding model, find the chunks whose vectors are closest to the question vector, and pass those chunks to the LLM as context.

That's the textbook version. In practice, every step in that pipeline has implementation choices that dramatically affect quality — and the default choices are usually not the right ones for production.

Chunking: The Step Everyone Gets Wrong

Chunking strategy has more impact on RAG quality than almost any other decision, and it's routinely done wrong.

The naive approach is fixed-size chunking: split every document into 512-token blocks with a 50-token overlap. This is quick to implement and consistently mediocre. The problem is that fixed-size chunks are structurally ignorant — they'll split a table in half, cut a sentence mid-thought, or lump unrelated paragraphs together.

Better approaches depend on your content type. For structured documents like technical documentation or legal contracts, semantic chunking — splitting on paragraph and section boundaries — preserves meaning much better. For conversational logs or support transcripts, splitting by speaker turn or topic shift makes more sense.

The hierarchy that works best for most enterprise content is parent-document retrieval: create small chunks (128–256 tokens) for high-precision retrieval, but when you retrieve a match, return its parent section (512–1024 tokens) to give the LLM sufficient context. You get the precision of small chunks and the context richness of large ones.

Choosing the Right Embedding Model

Not all embedding models are equal, and the best one for your use case depends heavily on your domain.

OpenAI's text-embedding-3-large is excellent for general-purpose English content. It's fast, reliable, and the embedding space is well-calibrated. For most applications, it's a solid default.

But if you're working in a specialized domain — medical literature, legal documents, financial filings, code — you'll often get significantly better retrieval quality from a domain-specific model. BioBERT consistently outperforms general embedding models on clinical text. Legal-BERT does the same for contracts.

One thing that trips people up: the embedding model you use during indexing and the one you use at query time must be identical. Swap out your model after indexing and your entire vector index is junk. This sounds obvious until it burns you at 2 AM during a production incident.

Hybrid Search: Why Vector-Only Retrieval Isn't Enough

Pure semantic search has a blind spot: exact matches. If a user asks about "Clause 14.3(b)" or a specific product SKU like "INV-2024-X7", semantic search might retrieve plausible-sounding documents while completely missing the exact one that contains that specific reference.

Hybrid search combines dense vector retrieval (semantic similarity) with sparse keyword retrieval (BM25 or TF-IDF) and merges the results using Reciprocal Rank Fusion. In our experience, switching from pure vector search to hybrid search improves retrieval precision on real-world enterprise queries by 15–25%. It's not complicated to implement — Weaviate, Pinecone, and Elasticsearch all support hybrid search out of the box — and the improvement is almost always worth it.

Re-ranking: The Quality Filter You're Probably Skipping

Vector similarity retrieves the top-k approximate matches. But approximate nearest neighbor search — which is how every vector database works at scale — trades some precision for speed. You might retrieve 20 results and have the most relevant one sitting at position 14.

A re-ranker is a second-pass model that takes your initial retrieval results and re-scores them using a more expensive cross-encoder — one that looks at the query and each document together, rather than comparing pre-computed vectors. Cohere's rerank API and cross-encoder models from sentence-transformers are the common options.

The pattern we use in production: retrieve top-20 candidates with vector + BM25 hybrid, re-rank to top-5 with a cross-encoder, pass top-5 to the LLM. The extra re-ranking latency is typically 100–200ms — a small price for a meaningful jump in answer quality.

The Failure Modes That Will Bite You

Retrieval failure without generation failure

This is the insidious one. The LLM doesn't know when it has received irrelevant context — it will often answer confidently using whatever you give it. A bad retrieval that returns somewhat plausible but wrong documents produces a confident, wrong answer. You need evaluation mechanisms that independently score retrieval quality, not just generation quality. RAGAS is the framework we use for this.

Context window stuffing

More context is not always better. Overly long contexts dilute the relevant signal — the LLM's attention spreads across 5,000 tokens of loosely related content and misses the single sentence that actually answers the question. We've had cases where cutting context from 10 retrieved chunks to 4 improved answer accuracy measurably. Find the sweet spot for your use case empirically.

Stale indexes

Your documents change. If your vector index isn't updated when source documents are modified, your RAG system will confidently answer questions with outdated information. Implement a document change detection pipeline — webhooks from your CMS, file system watchers, or scheduled re-indexing jobs. And importantly: delete old document vectors when content is removed, or your system will keep retrieving documents that no longer exist.

Agentic RAG: The Next Level

Standard RAG is a single retrieval step followed by generation. Agentic RAG uses an LLM agent to decide how to retrieve — and it changes the ceiling on what's possible.

Instead of one fixed query, the agent can reformulate questions, run multiple retrievals with different strategies, synthesize across sources, and ask clarifying questions when a query is ambiguous. For complex analytical questions that span multiple documents or require multi-hop reasoning, the improvement over single-step RAG is substantial.

The tradeoff is latency and complexity. Agentic RAG is significantly harder to debug and takes 3–5x longer to respond. We typically use it for backend analytical workflows where depth matters more than speed, and stick with optimized single-step RAG for real-time user-facing applications.

The Practical Reality

RAG is genuinely transformative when you implement it well. A well-built RAG system can make an LLM behave like a domain expert — answering questions about your specific products, policies, and processes with accuracy that no amount of prompt engineering alone achieves.

But the gap between a toy RAG demo and a production RAG system is significant. Chunking strategy, embedding choice, hybrid search, re-ranking, evaluation, and index maintenance are all real engineering concerns that require deliberate design.

If you're building RAG for production, invest in evaluation first. Define what "good" looks like before you build, create a test set of representative questions with known correct answers, and measure against it continuously. That discipline is what separates RAG systems that stay good from ones that quietly degrade as your data changes.