Why RAG architecture decisions matter
Retrieval-augmented generation is the most common pattern for connecting LLMs to private enterprise data. The concept is straightforward: retrieve relevant documents, inject them into the model's context, and generate grounded responses. The implementation details, however, determine whether the system is useful or frustrating.
We've built RAG systems for enterprise knowledge bases with millions of documents, multi-tenant chatbot platforms, and domain-specific search applications. This post walks through the architecture patterns and trade-offs we've encountered in production.
Document ingestion and chunking
The first decision is how to break documents into retrievable units. This is more consequential than most teams expect. Chunking strategy directly affects retrieval quality, and bad chunking is difficult to fix downstream.
Fixed-size chunking
The simplest approach: split documents into chunks of N tokens with M tokens of overlap. We typically start with 512 tokens and 64 tokens of overlap.
def chunk_fixed(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
tokens = tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk_tokens = tokens[i:i + chunk_size]
chunks.append(tokenizer.decode(chunk_tokens))
return chunks
Fixed-size chunking is predictable and fast, but it ignores document structure. A chunk might split mid-sentence, mid-paragraph, or mid-section. Losing semantic coherence.
Structure-aware chunking
For documents with meaningful structure (markdown, HTML, PDFs with headings), we chunk along structural boundaries. This preserves the semantic units the author intended.
def chunk_by_structure(doc: Document) -> list[Chunk]:
sections = doc.split_by_headings()
chunks = []
for section in sections:
if section.token_count <= MAX_CHUNK_SIZE:
chunks.append(section)
else:
# Fall back to paragraph-level splitting
for para in section.paragraphs:
if para.token_count <= MAX_CHUNK_SIZE:
chunks.append(para)
else:
chunks.extend(chunk_fixed(para.text))
return chunks
The trade-off is complexity: you need document-type-specific parsers, and the chunk sizes are variable, which complicates batch embedding.
Our default approach
In practice, we use a hybrid: structure-aware splitting as the primary strategy, with fixed-size fallback for unstructured content. Every chunk gets prepended with a context header containing the document title, section hierarchy, and metadata. This gives the embedding model (and later the LLM) context about where the chunk sits within the broader document.
def enrich_chunk(chunk: Chunk, doc: Document) -> str:
header = f"Document: {doc.title}\n"
if chunk.section_path:
header += f"Section: {' > '.join(chunk.section_path)}\n"
header += f"Source: {doc.source_url}\n---\n"
return header + chunk.text
Embedding selection
The embedding model converts chunks into vector representations for similarity search. The choice matters more than most benchmarks suggest, because benchmark performance on academic datasets doesn't always translate to enterprise document retrieval.
What we evaluate
For each deployment, we test embedding models against a sample of the client's actual data with known-good query-document pairs. The metrics that matter:
Recall@10. Does the correct document appear in the top 10 results for a given query? This is more important than Recall@1 because the LLM can usually identify the right answer from 10 candidates.
Latency, embedding time per chunk during ingestion, and embedding time per query at retrieval, for real-time conversational use, query embedding needs to be under 100ms.
Dimension size. Higher dimensions give more expressive vectors but increase storage costs and retrieval latency, for most enterprise workloads, 1024 dimensions is a practical ceiling.
Models we commonly deploy
For English-language enterprise content, OpenAI's text-embedding-3-large at 1024 dimensions is our current default. It balances quality, latency, and cost well for most use cases.
For multilingual deployments or when we need on-premise embedding, we use Cohere's embed-v3 or self-hosted BGE models. The self-hosted option adds operational overhead but eliminates data residency concerns entirely.
Retrieval and scoring
Vector similarity search is the starting point, not the endpoint. Raw cosine similarity scores are often noisy. Semantically similar but irrelevant documents can score highly.
Hybrid retrieval
We combine dense vector search with sparse keyword matching (BM25) using reciprocal rank fusion. This catches cases where exact terminology matters. Product names, error codes, regulatory references, that embedding models sometimes handle poorly.
def hybrid_search(
query: str,
vector_store: VectorStore,
bm25_index: BM25Index,
k: int = 10,
alpha: float = 0.7
) -> list[SearchResult]:
vector_results = vector_store.search(embed(query), k=k*2)
keyword_results = bm25_index.search(query, k=k*2)
scores: dict[str, float] = {}
for rank, result in enumerate(vector_results):
scores[result.id] = scores.get(result.id, 0) + alpha / (rank + 60)
for rank, result in enumerate(keyword_results):
scores[result.id] = scores.get(result.id, 0) + (1-alpha) / (rank + 60)
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [get_chunk(doc_id) for doc_id, _ in ranked[:k]]
The alpha parameter controls the balance between semantic and keyword matching. We typically start at 0.7 (favouring semantic) and tune based on evaluation results.
Reranking
After initial retrieval, we optionally apply a cross-encoder reranker to rescore the top candidates. Cross-encoders are more accurate than embedding similarity because they see the query and document together, but they're too slow to run over the full corpus.
The pattern is: retrieve 20-50 candidates cheaply with hybrid search, then rerank the top candidates with a cross-encoder, and pass the top 5-10 to the LLM.
Context assembly and generation
The retrieved chunks need to be assembled into a coherent context for the LLM. This is where many RAG implementations fall apart. Stuffing too many chunks into context produces worse results than being selective.
Context window management
We track token budgets explicitly: system prompt, conversation history, retrieved context, and generation space. Retrieved context gets the remainder after the other components are accounted for.
def assemble_context(
chunks: list[Chunk],
max_context_tokens: int = 4096
) -> str:
context_parts = []
token_count = 0
for chunk in chunks:
chunk_tokens = count_tokens(chunk.text)
if token_count + chunk_tokens > max_context_tokens:
break
context_parts.append(chunk.text)
token_count += chunk_tokens
return "\n\n---\n\n".join(context_parts)
Grounding and citation
For enterprise use, the LLM must cite its sources. We instruct the model to reference specific documents and include source metadata in the response. This makes outputs verifiable and builds user trust.
The system prompt includes explicit instructions to only answer from provided context, cite document titles and sections, and say "I don't have enough information" when the retrieved context doesn't contain the answer.
Guardrails in production
Enterprise RAG systems need guardrails beyond what the LLM provides by default.
Input filtering: detect and block queries that attempt prompt injection, request data outside the user's access scope, or contain sensitive information that shouldn't be logged.
Output monitoring: check responses for hallucinated content (claims not grounded in retrieved documents), leaked sensitive data, and policy violations before returning to the user.
Access control: different users should see different documents. The retrieval layer must respect the same permission model as the source systems. We implement this as metadata filters on the vector store. Each chunk inherits the access control list from its source document.
The operational reality
The architecture above is the starting point. Production RAG systems require ongoing maintenance: re-indexing when source documents change, monitoring retrieval quality as the document corpus grows, tuning chunk sizes and retrieval parameters as usage patterns emerge, and managing embedding model upgrades without breaking existing indexes.
We design for this from day one by instrumenting retrieval quality metrics, building re-indexing pipelines that run incrementally, and keeping the retrieval layer modular so individual components can be upgraded independently.
If you're building a RAG system for enterprise data and want to avoid the common pitfalls, book a diagnostic. We'll review your architecture and data landscape and identify the approach that fits your specific requirements.