How to Build a RAG System in Practice

Overview: This guide walks through designing a production RAG system for a customer support knowledge base, including document processing, retrieval, generation, error handling, and evaluation.

A production RAG system requires more than connecting a vector database to an LLM. You need document preprocessing, retrieval tuning, fallback strategies, and quality monitoring. This guide shows how to design each component for reliability.

System Architecture

A production RAG system has five main components working together:

  1. Document Processing Pipeline - Ingests, chunks, and embeds knowledge base content
  2. Retrieval Layer - Searches for relevant documents using semantic and keyword matching
  3. Context Assembler - Ranks and formats retrieved documents for the LLM
  4. Generation Layer - Produces answers using retrieved context
  5. Evaluation System - Monitors quality and catches failures
RAG System Flow
User Query

Query Processing (rewrite, expand)

Retrieval (semantic + keyword)

Score & Filter Results

Assemble Context

LLM Generation

Quality Check

Response to User

Document Processing Pipeline

Before retrieval works, documents need proper preparation. Poor document processing causes most RAG failures.

Chunking Strategy

Split documents into chunks small enough to be meaningful but large enough to contain complete thoughts. For technical documentation, 300-500 tokens per chunk works well.

Use semantic chunking where possible. Break at section headers, paragraph boundaries, or logical topic shifts rather than arbitrary character counts. Each chunk should be independently understandable.

Add metadata to chunks including document title, section, date, and any relevant tags. This metadata helps with filtering and ranking during retrieval.

Embedding Generation

Generate embeddings for each chunk using a model optimized for your domain. For general technical content, models like OpenAI text-embedding-3 or sentence-transformers work well.

Store both the embedding vector and the original text. You need the text for the LLM context and the vector for retrieval.

Retrieval Layer Design

Retrieval makes or breaks your RAG system. Invest time here.

Hybrid Search

Combine semantic search with keyword search. Semantic search finds conceptually similar content. Keyword search catches specific terms, codes, or names that embeddings might miss.

Weight semantic results higher for conceptual questions. Weight keyword results higher for specific lookups like product codes or error messages.

Query Processing

Before retrieval, process the user query. Expand acronyms, correct typos, and identify key entities. This improves retrieval accuracy.

For ambiguous queries, consider generating multiple query variations and retrieving for each. Combine and deduplicate the results.

Filtering and Ranking

Set a minimum relevance score threshold. If no documents exceed this threshold, do not attempt to generate an answer. Return a message that the information is not available.

Retrieve 20-30 candidates, then re-rank the top 5-10 using a cross-encoder model for better accuracy. This two-stage approach balances speed and quality.

Context Assembly

You have retrieved relevant documents. Now prepare them for the LLM.

Format each retrieved chunk with its metadata. Include the source document, section, and date. This helps the LLM cite sources and understand freshness.

Order chunks by relevance score, but consider diversity. If the top 5 results are all from the same document, include a result from a different source to provide broader context.

Respect token limits. Count tokens in your prompt template, the user query, and retrieved context. Leave room for the generated response. Typical allocation: 30% prompt, 50% context, 20% response buffer.

Generation with Safeguards

The LLM generates the answer using retrieved context.

Prompt Design

Your prompt should instruct the model to only use the provided context, cite sources, and admit when it cannot answer from the given information.

Example prompt structure:

You are a customer support assistant. Answer the user's question using ONLY the information in the context below. If the context does not contain enough information, say so.

Context:
[Retrieved documents here]

User Question: [question]

Instructions:
- Answer directly and concisely
- Cite which document you used
- If unsure, say you don't have that information

Response Validation

Check the generated response before returning it to the user. Verify that it cites sources, does not contradict the retrieved context, and answers the question.

Flag responses that hedge excessively or provide generic answers. These often indicate retrieval failure.

Error Handling and Fallbacks

Production systems need fallback strategies for common failure modes.

No Relevant Documents Found

When retrieval confidence is low, skip generation. Return: "I could not find information about that in our knowledge base. Let me connect you with a human agent."

Retrieval Returns Contradictory Documents

If top-ranked documents contradict each other, either return the most recent document or escalate to a human. Do not let the LLM try to reconcile contradictions.

Generation Fails or Times Out

Implement retries with exponential backoff. If generation fails after retries, return the retrieved documents directly with a message like "Here are the relevant help articles."

User Question Outside Scope

Train a classifier to detect off-topic questions. Route these to a fallback flow rather than attempting RAG retrieval.

Evaluation and Monitoring

Measure system performance continuously. Track these metrics:

Create a test set of questions with known correct answers. Run this test set weekly to catch regressions.

Log all queries, retrieved documents, and responses. Sample and manually review 50-100 interactions per week to find issues.

Integration with Prompt Chaining

RAG often works better as part of a larger prompt chain. For complex queries, use this pattern:

  1. Chain Step 1: Analyze the user query, identify intent and key entities
  2. Chain Step 2: Perform RAG retrieval using the analyzed query
  3. Chain Step 3: Generate the answer using retrieved context
  4. Chain Step 4: Format the answer with citations and next steps

This separation lets you optimize each step independently and provides clearer debugging when something fails.

Real Example: Support Bot

Here is how the system works for a customer question:

User asks: "My API key is giving a 401 error"

Query Processing: Identifies "API authentication issue" as the intent

Retrieval: Finds 3 relevant articles about API key configuration, authentication errors, and troubleshooting

Context Assembly: Formats the 3 articles with titles and relevance scores

Generation: LLM synthesizes an answer explaining that 401 means invalid credentials, provides steps to regenerate the key, and cites "API Authentication Guide"

Validation: Checks that the response cites sources and addresses the error code

Response: "A 401 error means your API key is invalid. Here's how to fix it: [steps]. Source: API Authentication Guide"

Related Concepts

Building production RAG systems requires understanding embeddings for retrieval quality, prompt chaining for complex workflows, and fine-tuning to improve model performance on your domain.

LLM agents can use RAG as one of their tools, retrieving information when needed during autonomous task execution.

Frequently Asked Questions

What is the most common failure point in RAG systems?
Poor retrieval quality is the most common failure point. If the system retrieves irrelevant documents, even the best language model cannot generate a good answer. Focus on retrieval quality first.
How do you handle queries when no relevant documents are found?
Implement a confidence threshold. If retrieval scores fall below the threshold, return a message like "I could not find relevant information in our knowledge base" rather than generating an answer from irrelevant context.
Should you use multiple retrieval strategies?
Yes, hybrid retrieval combining semantic search with keyword search often performs better than either alone. Start with semantic search, then add keyword matching if needed.