helloandy.net Guides

How to Build a RAG System: Retrieval Augmented Generation From Scratch

By Andy · March 14, 2026 · 12 min read

Large language models are impressive, but they have a fatal flaw: they only know what was in their training data. Ask about your company's internal docs, last week's support tickets, or a niche research paper published yesterday, and the model will either hallucinate an answer or politely tell you it doesn't know.

Retrieval Augmented Generation — RAG — fixes this. Instead of relying solely on what the model memorized during training, a RAG system fetches relevant documents at query time and feeds them into the prompt alongside the user's question. The model generates its answer grounded in actual source material rather than guessing.

I've built several RAG systems, including the chatbot on this site that answers questions about my tools and articles. This guide covers everything I learned in the process — the architecture, the tricky parts, and the mistakes that cost me time.

1. What RAG Actually Is (and Isn't)

RAG was introduced in a 2020 paper by Lewis et al. at Facebook AI Research. The core idea is simple: before generating a response, retrieve relevant documents from an external knowledge base and include them in the context window.

Think of it like an open-book exam. The model doesn't need to memorize every fact — it just needs to know how to find and use the right reference material. This gives you three concrete benefits:

What RAG is not: a magic fix for every LLM problem. If your question requires reasoning across hundreds of documents simultaneously, RAG's fixed context window becomes a bottleneck. If your documents are poorly written or contradictory, the model will faithfully reproduce that confusion. RAG amplifies the quality of your knowledge base — for better and worse.

2. The Two-Part Architecture: Retriever + Generator

Every RAG system has two components working in sequence:

The Retriever takes a user query, searches a knowledge base, and returns the most relevant documents (or document chunks). This is typically a vector similarity search, though keyword search and hybrid approaches work too.

The Generator is the language model itself. It receives the original query plus the retrieved documents and produces a grounded response. The generator doesn't search for information — it synthesizes what the retriever already found.

The data flow looks like this:

RAG Pipeline User Query ↓ [Embedding Model] → query vector ↓ [Vector Database] → top-k relevant chunks ↓ [Prompt Assembly] → "Given these documents: {chunks}, answer: {query}" ↓ [Language Model] → grounded response ↓ User sees answer + sources

Each step in this pipeline has choices and tradeoffs. Let's walk through them.

3. Embeddings: Turning Text Into Vectors

Embeddings are numerical representations of text — arrays of floating-point numbers (typically 384 to 3072 dimensions) that capture semantic meaning. Two pieces of text about the same topic will have similar embedding vectors, even if they use completely different words.

This is what makes semantic search possible. Instead of matching keywords, you compare the meaning of the query against the meaning of every document in your knowledge base.

Picking an embedding model

You have several solid options in 2026:

One rule that will save you headaches: use the same embedding model for indexing and querying. Embeddings from different models are not compatible. If you index your documents with text-embedding-3-small, you must query with text-embedding-3-small. Mixing models produces meaningless similarity scores.

If you're looking for a cost-effective way to access these models, the OpenRouter free API guide covers how to route requests through free and low-cost model endpoints.

4. Choosing a Vector Database

Once you have embeddings, you need somewhere to store and search them. A vector database is optimized for nearest-neighbor search across high-dimensional vectors — something traditional databases handle poorly.

Options by scale

Small scale (under 100K documents):

Medium scale (100K to 10M documents):

Large scale (10M+ documents):

Practical advice: Start with ChromaDB or FAISS. Get your pipeline working end-to-end before thinking about scaling. I've seen teams spend weeks evaluating vector databases before they had a single working prototype. Don't do that.

5. Chunking Strategies (Where Most People Go Wrong)

This is the part that will make or break your RAG system. Chunking is how you split your documents into smaller pieces for indexing. Get it wrong, and even a perfect retriever will return garbage.

The fundamental tension: chunks need to be small enough to be specific (so retrieval is precise) but large enough to be self-contained (so the model has enough context to generate a useful answer).

Common chunking approaches

Fixed-size chunking: Split text every N tokens (typically 256-512). Simple to implement. Works surprisingly well as a baseline. The problem: chunks often split mid-sentence or mid-paragraph, breaking context.

Recursive character splitting: Try to split on paragraph boundaries first, then sentences, then words. This preserves natural text boundaries. LangChain's RecursiveCharacterTextSplitter uses this approach.

Semantic chunking: Use embeddings to detect topic shifts within a document, then split at those boundaries. More expensive to compute, but produces chunks that are topically coherent.

Document-structure-aware chunking: Use headings, section breaks, or HTML structure to define chunk boundaries. If your documents have clear structure (technical docs, wikis, articles), this is usually the best approach.

What I've found works in practice

After building several RAG systems, here's what I keep coming back to:

  1. Chunk size of 400-600 tokens for most text. Smaller chunks (200-300) for FAQ-style content where each entry is self-contained.
  2. Overlap of 50-100 tokens between adjacent chunks. This prevents information from falling into the gap between two chunks.
  3. Preserve metadata. Every chunk should carry its source document title, section heading, URL, and position. You'll need this for citations and debugging.
  4. Prepend context. Add the document title and section heading to the start of each chunk. A chunk that says "To configure this, set the timeout parameter to 30" is useless without knowing what "this" refers to. A chunk that starts with "nginx Configuration > Proxy Settings: To configure this..." is self-contained.
Chunk with prepended context Source: nginx Documentation > Proxy Settings > Timeouts To configure upstream timeouts, set the proxy_read_timeout directive in your location block. The default is 60 seconds. For long-running API calls, increase this to 300s: location /api/ { proxy_pass http://backend; proxy_read_timeout 300s; }

That fourth point — prepending context — is the single biggest improvement I made to my own RAG pipeline. It's cheap to implement and dramatically improves retrieval quality because the embedding now captures what the chunk is actually about, not just what it says in isolation.

6. Building the Retrieval Pipeline

Basic retrieval is straightforward: embed the query, find the top-k nearest chunks, return them. But basic retrieval often isn't good enough. Here are the techniques that actually move the needle:

Hybrid search

Combine vector similarity with keyword matching. Vector search handles semantic similarity ("how do I fix a slow database" matches "query optimization techniques"), while keyword search catches exact terms the embedding model might miss (error codes, product names, configuration flags).

Most production RAG systems use hybrid search. Weaviate and Elasticsearch support it natively. For others, run both searches and merge the results using Reciprocal Rank Fusion (RRF).

Query transformation

Sometimes the user's query is too vague or too short for good retrieval. Techniques that help:

For more advanced reasoning patterns that complement retrieval, check out my Graph of Thought implementation — it handles cases where you need to synthesize information from multiple retrieved chunks in a structured way.

Reranking

After initial retrieval, run a reranker model (like Cohere Rerank or a cross-encoder) to re-score the results. Embedding similarity is fast but approximate. A reranker reads the full query-document pair together and produces a more accurate relevance score. This typically improves precision by 10-20% at the cost of added latency.

7. The Generation Step

Once you have your retrieved chunks, you need to assemble a prompt and call the language model. This part is more about prompt engineering than infrastructure.

Prompt structure that works

RAG prompt template You are a helpful assistant. Answer the user's question based ONLY on the provided context. If the context doesn't contain enough information to answer, say so — do not make up information. Context: --- {chunk_1} --- {chunk_2} --- {chunk_3} Question: {user_query} Answer:

Key decisions in the generation step:

8. Evaluating Your RAG System

You can't improve what you don't measure. RAG evaluation has two separate dimensions: retrieval quality and generation quality.

Retrieval metrics

Generation metrics

Building an evaluation set

Create a test set of 50-100 question-answer pairs where you know the correct answer and which documents contain it. This sounds tedious — and it is — but it's the only way to make reliable improvements. Without a test set, you're tuning parameters by vibes.

Tools like RAGAS, DeepEval, and TruLens can automate parts of this evaluation. They use an LLM to score faithfulness and relevance, which isn't perfect but catches the obvious failures.

From experience: The first version of my chatbot had decent retrieval but terrible generation because I wasn't measuring faithfulness. The model would retrieve the right document and then paraphrase it so aggressively that the meaning changed. Adding a faithfulness check to my evaluation pipeline caught this immediately.

9. Production Considerations

Getting a RAG prototype working takes a weekend. Making it reliable in production takes considerably longer. Here's what surprised me:

Indexing pipeline

Your documents will change. You need an indexing pipeline that can incrementally update your vector store — adding new documents, updating modified ones, and removing deleted ones. Batch re-indexing works for small knowledge bases but becomes slow and expensive at scale.

Caching

Identical or near-identical queries don't need to hit the full pipeline every time. Cache at two levels: embedding cache (same query string = same embedding vector) and result cache (same query + same knowledge base version = same results).

Monitoring

Log every query, the retrieved chunks, and the generated response. You need this data to debug quality issues and identify gaps in your knowledge base. When users ask questions that get poor answers, those queries are your best signal for what to add or improve.

Cost management

RAG is more expensive per query than a standalone LLM call because you're paying for embedding the query, vector search, and a longer prompt (more input tokens). For high-traffic systems, the cost difference is significant. Batch your embedding calls, use smaller models for retrieval, and reserve larger models for generation.

Where to Go From Here

If you've read this far, you have enough knowledge to build a working RAG system. Here's a reasonable order of operations:

  1. Pick a small document set (10-50 documents). Start with something you know well so you can evaluate quality by reading the outputs.
  2. Use ChromaDB and a free embedding model. Don't overthink the infrastructure.
  3. Implement basic chunking with overlap and context prepending.
  4. Build the pipeline end-to-end. Get a question in, get an answer out.
  5. Create 20 test questions with known answers. Measure retrieval recall and answer faithfulness.
  6. Iterate on chunking strategy and retrieval parameters until your metrics improve.
  7. Then — and only then — consider hybrid search, reranking, and query transformation.

The biggest mistake I see is people starting with the advanced techniques before the basics work. Get naive RAG working first. Measure it. Then add complexity where the measurements tell you it's needed.

Want to see RAG in action? Try the Andy AI Chat — it uses retrieval augmented generation to answer questions about all the tools and articles on this site. Or check out the OpenRouter free API guide for cost-effective ways to power the generator side of your RAG pipeline.

— Andy