‹ Back to Blog AI

RAG Architecture: Building Knowledge-Aware Applications

March 20, 2026 · 10 min read
AI neural network architecture

Large language models are remarkable at generating coherent text, reasoning through problems, and following complex instructions. But they have a fundamental limitation: they only know what was in their training data, and that data has a cutoff date. Ask a general-purpose LLM about your company's internal policies, your product documentation, or yesterday's support tickets, and you will get confident-sounding nonsense.

Retrieval-Augmented Generation (RAG) solves this by giving the model access to external knowledge at inference time. Instead of relying solely on parametric memory (the weights learned during training), a RAG system retrieves relevant documents from a knowledge base and includes them in the prompt context. The model then generates its response grounded in actual source material.

This pattern has become the backbone of enterprise AI applications in 2026. At Pepla, we have built RAG systems for clients across industries, from legal document analysis to customer support automation. This article breaks down the architecture, the decisions you will face, and the production considerations that separate a demo from a reliable system.

The Core RAG Pipeline

Every RAG system follows the same fundamental flow: ingest documents, chunk them into manageable pieces, generate vector embeddings for each chunk, store those embeddings in a vector database, and at query time, retrieve the most relevant chunks and feed them to the LLM alongside the user's question.

RAG layers -- embedding, retrieval, reranking, generation -- each require distinct tuning for production quality.

Architecture code

That description fits on a napkin. Making it work reliably in production is where the engineering lives.

Vector Embeddings: The Foundation

Embeddings are numerical representations of text that capture semantic meaning. Two sentences that mean similar things will have embeddings that are close together in vector space, even if they share no common words. This is what makes semantic search possible, and it is fundamentally different from keyword-based search.

In 2026, the embedding model landscape has matured considerably. OpenAI's text-embedding-3-large remains a solid general-purpose choice. Cohere's Embed v4 models offer strong multilingual support. For teams that need to run embeddings on-premise, open-source models like those from the Nomic and BAAI families deliver competitive quality at zero API cost.

The choice of embedding model is one of the most consequential decisions in your RAG pipeline. It is also one of the hardest to change later, because re-embedding your entire corpus is expensive and disruptive.

When evaluating embedding models, test them against your actual data. Generic benchmarks like MTEB are useful starting points, but domain-specific performance can vary significantly. A model that excels at general web text may struggle with legal contracts or medical records.

Chunking Strategies: Getting the Granularity Right

Before you can embed documents, you need to break them into chunks. This is less trivial than it sounds. Chunk too large and your embeddings become diluted, losing the ability to match specific queries. Chunk too small and you lose context, returning fragments that do not make sense on their own.

The main strategies, in order of increasing sophistication:

In practice, we have found that recursive splitting with a 400-token target and 50-token overlap is a strong default. For highly structured documents, combining structure-aware splitting with semantic chunking produces the best retrieval quality.

Metadata Enrichment

Raw chunks are not enough. Each chunk should carry metadata: the source document, section heading, page number, document date, author, and any domain-specific tags. This metadata enables filtered retrieval (for example, only searching documents from a specific department or date range) and helps the LLM cite its sources accurately.

Chunking strategy has more impact on RAG quality than model choice -- get the granularity right.

The Retrieval Pipeline

When a user asks a question, the naive approach is to embed the query, find the top-K nearest chunks by cosine similarity, and pass them to the LLM. This works for simple cases but breaks down quickly in production.

Data visualization

Hybrid Search: The Best of Both Worlds

Pure vector search excels at semantic matching but can miss exact keyword matches. If a user asks about "Policy 4.2.1" or a specific product code, semantic similarity might not surface the right document. Conversely, keyword search (BM25) excels at exact matching but misses semantic relationships.

Hybrid search combines both approaches. Most modern vector databases, including Weaviate, Qdrant, and Pinecone, support hybrid search natively. The typical approach is to run both searches in parallel, normalise the scores, and combine them using Reciprocal Rank Fusion (RRF) or a weighted linear combination.

In our experience at Pepla, hybrid search consistently outperforms either pure vector or pure keyword search by 15-25% on retrieval accuracy metrics. It is now our default recommendation for production RAG systems.

Reranking: The Second Pass

Initial retrieval casts a wide net, typically pulling 20-50 candidate chunks. A reranker then scores each candidate against the original query using a cross-encoder model, which is more accurate than the bi-encoder used for initial retrieval but too slow to run against the entire corpus.

Cross-encoder rerankers from Cohere (Rerank 3.5), Jina AI, and the open-source bge-reranker family have become standard components. The reranker reorders the candidates and the top 5-10 are passed to the LLM. This two-stage approach, fast initial retrieval followed by accurate reranking, is one of the highest-impact improvements you can make to a RAG system.

Query Transformation

Users rarely phrase their questions in ways that align perfectly with your document corpus. Query transformation techniques bridge this gap:

RAG beats fine-tuning when your knowledge base changes frequently or accuracy is auditable.

When RAG Beats Fine-Tuning

Fine-tuning and RAG address different problems, and understanding when to use each saves significant time and money.

RAG is the right choice when:

Fine-tuning is the right choice when:

In many production systems, the answer is both. Fine-tune for style and reasoning patterns, then use RAG for factual grounding. This layered approach gives you the best of both worlds.

Production Considerations

Evaluation and Monitoring

You cannot improve what you cannot measure. A production RAG system needs evaluation at multiple levels:

Handling Updates and Deletions

Real knowledge bases are not static. Documents get updated, deprecated, and deleted. Your ingestion pipeline needs to handle incremental updates without re-processing the entire corpus. This typically means tracking document versions, detecting changes via hashing, and updating only affected chunks and their embeddings.

Security and Access Control

In enterprise settings, not every user should have access to every document. Your RAG system needs to enforce document-level access control during retrieval. This is typically implemented as metadata filtering: each chunk carries access control metadata, and the retrieval query includes a filter for the current user's permissions.

Cost Management

RAG systems have multiple cost drivers: embedding API calls, vector database hosting, reranker API calls, and LLM token usage (which scales with the amount of retrieved context). Monitor these costs per query and optimise aggressively. Caching frequent queries, using smaller models for initial retrieval, and limiting the number of retrieved chunks all help control costs at scale.

Start simple, measure everything, and add complexity only when your metrics tell you to.

Architecture Patterns We Recommend

For most production RAG applications, we recommend starting with this architecture:

Start simple, measure everything, and add complexity only when your metrics tell you to. A well-tuned simple pipeline will outperform a poorly-configured complex one every time.

Pepla has implemented RAG systems for clients ranging from legal document search to internal knowledge bases, using a combination of Azure Cognitive Search and custom embedding pipelines.

RAG has moved from experimental to essential in the enterprise AI toolkit. The patterns are well-established, the tooling is mature, and the results are genuinely transformative for organisations drowning in unstructured knowledge. The engineering challenge is no longer whether RAG works, but how to make it work reliably, efficiently, and securely at scale.

Need help with this?

Pepla builds production-grade RAG systems for enterprises. Let us help you unlock the value in your data.

Get in Touch

Contact Us

Schedule a Meeting

Book a free consultation to discuss your project requirements.

Book a Meeting ›

Let's Connect