How to Build a RAG System in Practice
A production RAG system requires more than connecting a vector database to an LLM. You need document preprocessing, retrieval tuning, fallback strategies, and quality monitoring. This guide shows how to design each component for reliability.
System Architecture
A production RAG system has five main components working together:
- Document Processing Pipeline - Ingests, chunks, and embeds knowledge base content
- Retrieval Layer - Searches for relevant documents using semantic and keyword matching
- Context Assembler - Ranks and formats retrieved documents for the LLM
- Generation Layer - Produces answers using retrieved context
- Evaluation System - Monitors quality and catches failures
↓
Query Processing (rewrite, expand)
↓
Retrieval (semantic + keyword)
↓
Score & Filter Results
↓
Assemble Context
↓
LLM Generation
↓
Quality Check
↓
Response to User
Document Processing Pipeline
Before retrieval works, documents need proper preparation. Poor document processing causes most RAG failures.
Chunking Strategy
Split documents into chunks small enough to be meaningful but large enough to contain complete thoughts. For technical documentation, 300-500 tokens per chunk works well.
Use semantic chunking where possible. Break at section headers, paragraph boundaries, or logical topic shifts rather than arbitrary character counts. Each chunk should be independently understandable.
Add metadata to chunks including document title, section, date, and any relevant tags. This metadata helps with filtering and ranking during retrieval.
Embedding Generation
Generate embeddings for each chunk using a model optimized for your domain. For general technical content, models like OpenAI text-embedding-3 or sentence-transformers work well.
Store both the embedding vector and the original text. You need the text for the LLM context and the vector for retrieval.
Retrieval Layer Design
Retrieval makes or breaks your RAG system. Invest time here.
Hybrid Search
Combine semantic search with keyword search. Semantic search finds conceptually similar content. Keyword search catches specific terms, codes, or names that embeddings might miss.
Weight semantic results higher for conceptual questions. Weight keyword results higher for specific lookups like product codes or error messages.
Query Processing
Before retrieval, process the user query. Expand acronyms, correct typos, and identify key entities. This improves retrieval accuracy.
For ambiguous queries, consider generating multiple query variations and retrieving for each. Combine and deduplicate the results.
Filtering and Ranking
Set a minimum relevance score threshold. If no documents exceed this threshold, do not attempt to generate an answer. Return a message that the information is not available.
Retrieve 20-30 candidates, then re-rank the top 5-10 using a cross-encoder model for better accuracy. This two-stage approach balances speed and quality.
Context Assembly
You have retrieved relevant documents. Now prepare them for the LLM.
Format each retrieved chunk with its metadata. Include the source document, section, and date. This helps the LLM cite sources and understand freshness.
Order chunks by relevance score, but consider diversity. If the top 5 results are all from the same document, include a result from a different source to provide broader context.
Respect token limits. Count tokens in your prompt template, the user query, and retrieved context. Leave room for the generated response. Typical allocation: 30% prompt, 50% context, 20% response buffer.
Generation with Safeguards
The LLM generates the answer using retrieved context.
Prompt Design
Your prompt should instruct the model to only use the provided context, cite sources, and admit when it cannot answer from the given information.
Example prompt structure:
Context:
[Retrieved documents here]
User Question: [question]
Instructions:
- Answer directly and concisely
- Cite which document you used
- If unsure, say you don't have that information
Response Validation
Check the generated response before returning it to the user. Verify that it cites sources, does not contradict the retrieved context, and answers the question.
Flag responses that hedge excessively or provide generic answers. These often indicate retrieval failure.
Error Handling and Fallbacks
Production systems need fallback strategies for common failure modes.
No Relevant Documents Found
When retrieval confidence is low, skip generation. Return: "I could not find information about that in our knowledge base. Let me connect you with a human agent."
Retrieval Returns Contradictory Documents
If top-ranked documents contradict each other, either return the most recent document or escalate to a human. Do not let the LLM try to reconcile contradictions.
Generation Fails or Times Out
Implement retries with exponential backoff. If generation fails after retries, return the retrieved documents directly with a message like "Here are the relevant help articles."
User Question Outside Scope
Train a classifier to detect off-topic questions. Route these to a fallback flow rather than attempting RAG retrieval.
Evaluation and Monitoring
Measure system performance continuously. Track these metrics:
- Retrieval Precision - What percentage of retrieved documents are actually relevant?
- Retrieval Recall - Does the system find all relevant documents?
- Answer Quality - Human ratings on helpfulness and accuracy
- Citation Rate - Does the response cite sources?
- Fallback Rate - How often does the system admit it cannot answer?
Create a test set of questions with known correct answers. Run this test set weekly to catch regressions.
Log all queries, retrieved documents, and responses. Sample and manually review 50-100 interactions per week to find issues.
Integration with Prompt Chaining
RAG often works better as part of a larger prompt chain. For complex queries, use this pattern:
- Chain Step 1: Analyze the user query, identify intent and key entities
- Chain Step 2: Perform RAG retrieval using the analyzed query
- Chain Step 3: Generate the answer using retrieved context
- Chain Step 4: Format the answer with citations and next steps
This separation lets you optimize each step independently and provides clearer debugging when something fails.
Real Example: Support Bot
Here is how the system works for a customer question:
User asks: "My API key is giving a 401 error"
Query Processing: Identifies "API authentication issue" as the intent
Retrieval: Finds 3 relevant articles about API key configuration, authentication errors, and troubleshooting
Context Assembly: Formats the 3 articles with titles and relevance scores
Generation: LLM synthesizes an answer explaining that 401 means invalid credentials, provides steps to regenerate the key, and cites "API Authentication Guide"
Validation: Checks that the response cites sources and addresses the error code
Response: "A 401 error means your API key is invalid. Here's how to fix it: [steps]. Source: API Authentication Guide"
Related Concepts
Building production RAG systems requires understanding embeddings for retrieval quality, prompt chaining for complex workflows, and fine-tuning to improve model performance on your domain.
LLM agents can use RAG as one of their tools, retrieving information when needed during autonomous task execution.