Large language models are remarkable at generating coherent text, reasoning through problems, and following complex instructions. But they have a fundamental limitation: they only know what was in their training data, and that data has a cutoff date. Ask a general-purpose LLM about your company's internal policies, your product documentation, or yesterday's support tickets, and you will get confident-sounding nonsense.
Retrieval-Augmented Generation (RAG) solves this by giving the model access to external knowledge at inference time. Instead of relying solely on parametric memory (the weights learned during training), a RAG system retrieves relevant documents from a knowledge base and includes them in the prompt context. The model then generates its response grounded in actual source material.
This pattern has become the backbone of enterprise AI applications in 2026. At Pepla, we have built RAG systems for clients across industries, from legal document analysis to customer support automation. This article breaks down the architecture, the decisions you will face, and the production considerations that separate a demo from a reliable system.
The Core RAG Pipeline
Every RAG system follows the same fundamental flow: ingest documents, chunk them into manageable pieces, generate vector embeddings for each chunk, store those embeddings in a vector database, and at query time, retrieve the most relevant chunks and feed them to the LLM alongside the user's question.
RAG layers -- embedding, retrieval, reranking, generation -- each require distinct tuning for production quality.
That description fits on a napkin. Making it work reliably in production is where the engineering lives.
Vector Embeddings: The Foundation
Embeddings are numerical representations of text that capture semantic meaning. Two sentences that mean similar things will have embeddings that are close together in vector space, even if they share no common words. This is what makes semantic search possible, and it is fundamentally different from keyword-based search.
In 2026, the embedding model landscape has matured considerably. OpenAI's text-embedding-3-large remains a solid general-purpose choice. Cohere's Embed v4 models offer strong multilingual support. For teams that need to run embeddings on-premise, open-source models like those from the Nomic and BAAI families deliver competitive quality at zero API cost.
The choice of embedding model is one of the most consequential decisions in your RAG pipeline. It is also one of the hardest to change later, because re-embedding your entire corpus is expensive and disruptive.
When evaluating embedding models, test them against your actual data. Generic benchmarks like MTEB are useful starting points, but domain-specific performance can vary significantly. A model that excels at general web text may struggle with legal contracts or medical records.
Chunking Strategies: Getting the Granularity Right
Before you can embed documents, you need to break them into chunks. This is less trivial than it sounds. Chunk too large and your embeddings become diluted, losing the ability to match specific queries. Chunk too small and you lose context, returning fragments that do not make sense on their own.
The main strategies, in order of increasing sophistication:
- Fixed-size chunking splits text at regular token intervals (typically 256-512 tokens) with overlap between consecutive chunks. Simple, fast, and surprisingly effective as a baseline.
- Recursive character splitting tries to break at natural boundaries (paragraphs, then sentences, then words) while staying within a target size. This preserves semantic coherence better than fixed-size chunking.
- Semantic chunking uses the embedding model itself to detect topic shifts. When the cosine similarity between consecutive sentences drops below a threshold, a new chunk begins. This produces chunks that align with actual topic boundaries.
- Document-structure-aware chunking leverages headings, sections, tables, and other structural elements to define chunk boundaries. This is particularly valuable for well-structured documents like technical manuals, legal contracts, and API documentation.
In practice, we have found that recursive splitting with a 400-token target and 50-token overlap is a strong default. For highly structured documents, combining structure-aware splitting with semantic chunking produces the best retrieval quality.
Metadata Enrichment
Raw chunks are not enough. Each chunk should carry metadata: the source document, section heading, page number, document date, author, and any domain-specific tags. This metadata enables filtered retrieval (for example, only searching documents from a specific department or date range) and helps the LLM cite its sources accurately.
Chunking strategy has more impact on RAG quality than model choice -- get the granularity right.
The Retrieval Pipeline
When a user asks a question, the naive approach is to embed the query, find the top-K nearest chunks by cosine similarity, and pass them to the LLM. This works for simple cases but breaks down quickly in production.
Hybrid Search: The Best of Both Worlds
Pure vector search excels at semantic matching but can miss exact keyword matches. If a user asks about "Policy 4.2.1" or a specific product code, semantic similarity might not surface the right document. Conversely, keyword search (BM25) excels at exact matching but misses semantic relationships.
Hybrid search combines both approaches. Most modern vector databases, including Weaviate, Qdrant, and Pinecone, support hybrid search natively. The typical approach is to run both searches in parallel, normalise the scores, and combine them using Reciprocal Rank Fusion (RRF) or a weighted linear combination.
In our experience at Pepla, hybrid search consistently outperforms either pure vector or pure keyword search by 15-25% on retrieval accuracy metrics. It is now our default recommendation for production RAG systems.
Reranking: The Second Pass
Initial retrieval casts a wide net, typically pulling 20-50 candidate chunks. A reranker then scores each candidate against the original query using a cross-encoder model, which is more accurate than the bi-encoder used for initial retrieval but too slow to run against the entire corpus.
Cross-encoder rerankers from Cohere (Rerank 3.5), Jina AI, and the open-source bge-reranker family have become standard components. The reranker reorders the candidates and the top 5-10 are passed to the LLM. This two-stage approach, fast initial retrieval followed by accurate reranking, is one of the highest-impact improvements you can make to a RAG system.
Query Transformation
Users rarely phrase their questions in ways that align perfectly with your document corpus. Query transformation techniques bridge this gap:
- Query rewriting uses the LLM to rephrase the user's question into a form more likely to match relevant documents. A conversational question like "What's the deal with overtime?" becomes "Company policy on overtime compensation and eligibility."
- Hypothetical Document Embeddings (HyDE) asks the LLM to generate a hypothetical answer to the query, then uses that answer's embedding for retrieval. This can significantly improve recall for complex questions.
- Multi-query expansion generates multiple variations of the original query, retrieves results for each, and merges the results. This captures different facets of ambiguous questions.
RAG beats fine-tuning when your knowledge base changes frequently or accuracy is auditable.
When RAG Beats Fine-Tuning
Fine-tuning and RAG address different problems, and understanding when to use each saves significant time and money.
RAG is the right choice when:
- Your knowledge base changes frequently (daily or weekly updates)
- You need the model to cite specific sources and provide traceable answers
- The knowledge is factual and document-based rather than stylistic
- You need to control access to different knowledge bases per user or role
Fine-tuning is the right choice when:
- You need to teach the model a specific output format or communication style
- The knowledge is relatively stable and does not change frequently
- You need the model to internalise domain-specific reasoning patterns
- Inference latency is critical and you cannot afford the retrieval step
In many production systems, the answer is both. Fine-tune for style and reasoning patterns, then use RAG for factual grounding. This layered approach gives you the best of both worlds.
Production Considerations
Evaluation and Monitoring
You cannot improve what you cannot measure. A production RAG system needs evaluation at multiple levels:
- Retrieval quality: Are the right documents being retrieved? Measure recall, precision, and Mean Reciprocal Rank (MRR) against a labelled test set.
- Answer quality: Is the LLM generating accurate, complete, and well-grounded responses? Automated evaluation using LLM-as-judge frameworks (like RAGAS) provides scalable quality signals.
- Faithfulness: Is the model actually using the retrieved context, or hallucinating? Faithfulness metrics detect when the answer contains claims not supported by the provided documents.
- Latency: End-to-end response time matters. Measure retrieval latency, reranking latency, and LLM generation latency separately so you know where to optimise.
Handling Updates and Deletions
Real knowledge bases are not static. Documents get updated, deprecated, and deleted. Your ingestion pipeline needs to handle incremental updates without re-processing the entire corpus. This typically means tracking document versions, detecting changes via hashing, and updating only affected chunks and their embeddings.
Security and Access Control
In enterprise settings, not every user should have access to every document. Your RAG system needs to enforce document-level access control during retrieval. This is typically implemented as metadata filtering: each chunk carries access control metadata, and the retrieval query includes a filter for the current user's permissions.
Cost Management
RAG systems have multiple cost drivers: embedding API calls, vector database hosting, reranker API calls, and LLM token usage (which scales with the amount of retrieved context). Monitor these costs per query and optimise aggressively. Caching frequent queries, using smaller models for initial retrieval, and limiting the number of retrieved chunks all help control costs at scale.
Start simple, measure everything, and add complexity only when your metrics tell you to.
Architecture Patterns We Recommend
For most production RAG applications, we recommend starting with this architecture:
- A document processing pipeline that handles PDF, DOCX, HTML, and plain text with structure-aware chunking
- A managed vector database (Pinecone, Weaviate Cloud, or Qdrant Cloud) with hybrid search enabled
- A cross-encoder reranker in the retrieval pipeline
- Query rewriting as a pre-processing step
- An evaluation harness that runs nightly against a golden test set
- Structured logging of every query, retrieval result, and generated answer for debugging and improvement
Start simple, measure everything, and add complexity only when your metrics tell you to. A well-tuned simple pipeline will outperform a poorly-configured complex one every time.
Pepla has implemented RAG systems for clients ranging from legal document search to internal knowledge bases, using a combination of Azure Cognitive Search and custom embedding pipelines.
RAG has moved from experimental to essential in the enterprise AI toolkit. The patterns are well-established, the tooling is mature, and the results are genuinely transformative for organisations drowning in unstructured knowledge. The engineering challenge is no longer whether RAG works, but how to make it work reliably, efficiently, and securely at scale.




