How to Build a RAG System: Retrieval Augmented Generation From Scratch
Large language models are impressive, but they have a fatal flaw: they only know what was in their training data. Ask about your company's internal docs, last week's support tickets, or a niche research paper published yesterday, and the model will either hallucinate an answer or politely tell you it doesn't know.
Retrieval Augmented Generation — RAG — fixes this. Instead of relying solely on what the model memorized during training, a RAG system fetches relevant documents at query time and feeds them into the prompt alongside the user's question. The model generates its answer grounded in actual source material rather than guessing.
I've built several RAG systems, including the chatbot on this site that answers questions about my tools and articles. This guide covers everything I learned in the process — the architecture, the tricky parts, and the mistakes that cost me time.
1. What RAG Actually Is (and Isn't)
RAG was introduced in a 2020 paper by Lewis et al. at Facebook AI Research. The core idea is simple: before generating a response, retrieve relevant documents from an external knowledge base and include them in the context window.
Think of it like an open-book exam. The model doesn't need to memorize every fact — it just needs to know how to find and use the right reference material. This gives you three concrete benefits:
- Reduced hallucination. The model answers based on retrieved text, not vague recollections from training. When the source material is accurate, the output tends to be accurate too.
- Up-to-date information. Your knowledge base can be updated independently of the model. No retraining needed.
- Source attribution. Since you know which documents were retrieved, you can cite them. Users can verify the answer themselves.
What RAG is not: a magic fix for every LLM problem. If your question requires reasoning across hundreds of documents simultaneously, RAG's fixed context window becomes a bottleneck. If your documents are poorly written or contradictory, the model will faithfully reproduce that confusion. RAG amplifies the quality of your knowledge base — for better and worse.
2. The Two-Part Architecture: Retriever + Generator
Every RAG system has two components working in sequence:
The Retriever takes a user query, searches a knowledge base, and returns the most relevant documents (or document chunks). This is typically a vector similarity search, though keyword search and hybrid approaches work too.
The Generator is the language model itself. It receives the original query plus the retrieved documents and produces a grounded response. The generator doesn't search for information — it synthesizes what the retriever already found.
The data flow looks like this:
User Query
↓
[Embedding Model] → query vector
↓
[Vector Database] → top-k relevant chunks
↓
[Prompt Assembly] → "Given these documents: {chunks}, answer: {query}"
↓
[Language Model] → grounded response
↓
User sees answer + sources
Each step in this pipeline has choices and tradeoffs. Let's walk through them.
3. Embeddings: Turning Text Into Vectors
Embeddings are numerical representations of text — arrays of floating-point numbers (typically 384 to 3072 dimensions) that capture semantic meaning. Two pieces of text about the same topic will have similar embedding vectors, even if they use completely different words.
This is what makes semantic search possible. Instead of matching keywords, you compare the meaning of the query against the meaning of every document in your knowledge base.
Picking an embedding model
You have several solid options in 2026:
- OpenAI text-embedding-3-small: Good balance of quality and cost. 1536 dimensions. Works well for most use cases.
- OpenAI text-embedding-3-large: Higher quality, 3072 dimensions. Use when accuracy matters more than speed or cost.
- Cohere embed-v3: Strong multilingual support. Good if your documents aren't all in English.
- Open-source (e5-large-v2, BGE, GTE): Free, run locally, no API dependency. Quality is competitive with commercial options for many tasks.
One rule that will save you headaches: use the same embedding model for indexing and querying. Embeddings from different models are not compatible. If you index your documents with text-embedding-3-small, you must query with text-embedding-3-small. Mixing models produces meaningless similarity scores.
If you're looking for a cost-effective way to access these models, the OpenRouter free API guide covers how to route requests through free and low-cost model endpoints.
4. Choosing a Vector Database
Once you have embeddings, you need somewhere to store and search them. A vector database is optimized for nearest-neighbor search across high-dimensional vectors — something traditional databases handle poorly.
Options by scale
Small scale (under 100K documents):
- ChromaDB: Runs in-process with Python. Zero infrastructure. Perfect for prototyping and small production systems.
- SQLite + sqlite-vss: If you already use SQLite, this extension adds vector search without a separate service.
- FAISS: Facebook's library. Fast and battle-tested, but it's a library, not a database — no built-in persistence or filtering.
Medium scale (100K to 10M documents):
- Qdrant: Rust-based, fast, has good filtering support. My recommendation for most production use cases.
- Weaviate: Full-featured, supports hybrid search (vector + keyword) out of the box.
- Pinecone: Fully managed. You trade control for convenience — no infrastructure to manage, but you're locked into their platform.
Large scale (10M+ documents):
- Milvus: Distributed, handles billions of vectors. Significant operational overhead.
- Elasticsearch with vector search: If you already run Elasticsearch, adding vector search avoids a new system in your stack.
5. Chunking Strategies (Where Most People Go Wrong)
This is the part that will make or break your RAG system. Chunking is how you split your documents into smaller pieces for indexing. Get it wrong, and even a perfect retriever will return garbage.
The fundamental tension: chunks need to be small enough to be specific (so retrieval is precise) but large enough to be self-contained (so the model has enough context to generate a useful answer).
Common chunking approaches
Fixed-size chunking: Split text every N tokens (typically 256-512). Simple to implement. Works surprisingly well as a baseline. The problem: chunks often split mid-sentence or mid-paragraph, breaking context.
Recursive character splitting: Try to split on paragraph boundaries first, then sentences, then words. This preserves natural text boundaries. LangChain's RecursiveCharacterTextSplitter uses this approach.
Semantic chunking: Use embeddings to detect topic shifts within a document, then split at those boundaries. More expensive to compute, but produces chunks that are topically coherent.
Document-structure-aware chunking: Use headings, section breaks, or HTML structure to define chunk boundaries. If your documents have clear structure (technical docs, wikis, articles), this is usually the best approach.
What I've found works in practice
After building several RAG systems, here's what I keep coming back to:
- Chunk size of 400-600 tokens for most text. Smaller chunks (200-300) for FAQ-style content where each entry is self-contained.
- Overlap of 50-100 tokens between adjacent chunks. This prevents information from falling into the gap between two chunks.
- Preserve metadata. Every chunk should carry its source document title, section heading, URL, and position. You'll need this for citations and debugging.
- Prepend context. Add the document title and section heading to the start of each chunk. A chunk that says "To configure this, set the timeout parameter to 30" is useless without knowing what "this" refers to. A chunk that starts with "nginx Configuration > Proxy Settings: To configure this..." is self-contained.
Source: nginx Documentation > Proxy Settings > Timeouts
To configure upstream timeouts, set the
proxy_read_timeout directive in your location
block. The default is 60 seconds. For long-running
API calls, increase this to 300s:
location /api/ {
proxy_pass http://backend;
proxy_read_timeout 300s;
}
That fourth point — prepending context — is the single biggest improvement I made to my own RAG pipeline. It's cheap to implement and dramatically improves retrieval quality because the embedding now captures what the chunk is actually about, not just what it says in isolation.
6. Building the Retrieval Pipeline
Basic retrieval is straightforward: embed the query, find the top-k nearest chunks, return them. But basic retrieval often isn't good enough. Here are the techniques that actually move the needle:
Hybrid search
Combine vector similarity with keyword matching. Vector search handles semantic similarity ("how do I fix a slow database" matches "query optimization techniques"), while keyword search catches exact terms the embedding model might miss (error codes, product names, configuration flags).
Most production RAG systems use hybrid search. Weaviate and Elasticsearch support it natively. For others, run both searches and merge the results using Reciprocal Rank Fusion (RRF).
Query transformation
Sometimes the user's query is too vague or too short for good retrieval. Techniques that help:
- Query expansion: Use the LLM to rewrite the query into a more detailed version before searching. "Why is my app slow?" becomes "What are common causes of application performance degradation including database queries, memory leaks, and network latency?"
- HyDE (Hypothetical Document Embeddings): Ask the LLM to generate a hypothetical answer, then use that answer's embedding to search. This works because the hypothetical answer is closer in embedding space to the real documents than the short query is.
- Multi-query: Generate 3-5 different phrasings of the same question, search with each, and combine the results. This catches documents that would be missed by any single phrasing.
For more advanced reasoning patterns that complement retrieval, check out my Graph of Thought implementation — it handles cases where you need to synthesize information from multiple retrieved chunks in a structured way.
Reranking
After initial retrieval, run a reranker model (like Cohere Rerank or a cross-encoder) to re-score the results. Embedding similarity is fast but approximate. A reranker reads the full query-document pair together and produces a more accurate relevance score. This typically improves precision by 10-20% at the cost of added latency.
7. The Generation Step
Once you have your retrieved chunks, you need to assemble a prompt and call the language model. This part is more about prompt engineering than infrastructure.
Prompt structure that works
You are a helpful assistant. Answer the user's
question based ONLY on the provided context. If the
context doesn't contain enough information to answer,
say so — do not make up information.
Context:
---
{chunk_1}
---
{chunk_2}
---
{chunk_3}
Question: {user_query}
Answer:
Key decisions in the generation step:
- How many chunks to include: 3-5 is typical. More chunks mean more context but also more noise and higher token cost. Diminishing returns kick in fast — chunk #8 is rarely as relevant as chunk #1.
- Ordering: Put the most relevant chunks first. Models pay more attention to the beginning and end of the context.
- Citation instructions: Tell the model to cite which chunk it used. This makes hallucination detectable — if the model cites chunk 2 but chunk 2 doesn't support the claim, you've caught a problem.
- Fallback behavior: Explicitly tell the model what to do when the retrieved context doesn't answer the question. Without this instruction, many models will try to answer anyway using their training data, which defeats the purpose of RAG.
8. Evaluating Your RAG System
You can't improve what you don't measure. RAG evaluation has two separate dimensions: retrieval quality and generation quality.
Retrieval metrics
- Recall@k: Of all the relevant documents in your knowledge base, how many appeared in the top-k results? Low recall means your retriever is missing important information.
- Precision@k: Of the k documents returned, how many were actually relevant? Low precision means your retriever is returning noise.
- MRR (Mean Reciprocal Rank): How high does the first relevant result appear? An MRR of 1.0 means the top result is always relevant. Useful for systems where users mostly look at the first answer.
Generation metrics
- Faithfulness: Does the generated answer actually reflect what the retrieved documents say? This catches hallucination — the model producing claims that aren't supported by the context.
- Answer relevance: Does the answer address the question that was asked? A faithful answer to the wrong question is still a bad answer.
- Completeness: Did the answer cover all the relevant information from the retrieved chunks, or did it cherry-pick?
Building an evaluation set
Create a test set of 50-100 question-answer pairs where you know the correct answer and which documents contain it. This sounds tedious — and it is — but it's the only way to make reliable improvements. Without a test set, you're tuning parameters by vibes.
Tools like RAGAS, DeepEval, and TruLens can automate parts of this evaluation. They use an LLM to score faithfulness and relevance, which isn't perfect but catches the obvious failures.
9. Production Considerations
Getting a RAG prototype working takes a weekend. Making it reliable in production takes considerably longer. Here's what surprised me:
Indexing pipeline
Your documents will change. You need an indexing pipeline that can incrementally update your vector store — adding new documents, updating modified ones, and removing deleted ones. Batch re-indexing works for small knowledge bases but becomes slow and expensive at scale.
Caching
Identical or near-identical queries don't need to hit the full pipeline every time. Cache at two levels: embedding cache (same query string = same embedding vector) and result cache (same query + same knowledge base version = same results).
Monitoring
Log every query, the retrieved chunks, and the generated response. You need this data to debug quality issues and identify gaps in your knowledge base. When users ask questions that get poor answers, those queries are your best signal for what to add or improve.
Cost management
RAG is more expensive per query than a standalone LLM call because you're paying for embedding the query, vector search, and a longer prompt (more input tokens). For high-traffic systems, the cost difference is significant. Batch your embedding calls, use smaller models for retrieval, and reserve larger models for generation.
Where to Go From Here
If you've read this far, you have enough knowledge to build a working RAG system. Here's a reasonable order of operations:
- Pick a small document set (10-50 documents). Start with something you know well so you can evaluate quality by reading the outputs.
- Use ChromaDB and a free embedding model. Don't overthink the infrastructure.
- Implement basic chunking with overlap and context prepending.
- Build the pipeline end-to-end. Get a question in, get an answer out.
- Create 20 test questions with known answers. Measure retrieval recall and answer faithfulness.
- Iterate on chunking strategy and retrieval parameters until your metrics improve.
- Then — and only then — consider hybrid search, reranking, and query transformation.
The biggest mistake I see is people starting with the advanced techniques before the basics work. Get naive RAG working first. Measure it. Then add complexity where the measurements tell you it's needed.
Want to see RAG in action? Try the Andy AI Chat — it uses retrieval augmented generation to answer questions about all the tools and articles on this site. Or check out the OpenRouter free API guide for cost-effective ways to power the generator side of your RAG pipeline.
— Andy