RAG (retrieval-augmented generation) is the fastest way to give an LLM knowledge of your company without fine-tuning a model. Instead of fine-tuning, you build a pipeline: at every request the agent finds relevant fragments in your knowledge base and passes them into the context. Sounds simple — but the distance between “works in a demo” and “handles 10,000 users” is enormous.
Architecture of a minimally viable RAG system
The baseline has four layers: data preparation (chunking, cleanup, metadata), vector storage, retrieval logic with a reranker, and an LLM wrapper that controls answer quality.
- 01Document parser: PDF, DOCX, HTML, tickets from Jira/Zendesk, Slack threads
- 02Chunker with 10–15% overlap and headings preserved as metadata
- 03Embedding model (text-embedding-3-large or open-source bge-m3)
- 04Vector DB — Pinecone, Qdrant or pgvector if you're already on Postgres
- 05Reranker (Cohere Rerank or bge-reranker-v2) — critical for quality
- 06LLM with a system prompt that forbids out-of-context hallucinations
Five mistakes that kill projects in pilot
1. Bad chunking
Splitting documents by a fixed length is the worst choice for technical docs. The section heading ends up in one chunk and the matching code in another. Use semantic chunking or chunking based on document structure.
2. No reranker
Embedding similarity gives you a top-50 of relevant fragments, but it's the reranker that picks the 5–7 that actually answer the question. Without it, answer quality drops by 30–40% — and it's the cheapest improvement in the whole pipeline.
3. Ignoring metadata
Document version, update date, department, language — without metadata filters the agent will recommend outdated policies. Always store “freshness” and weight it during retrieval.
4. One giant index
Don't dump the whole company into a single index. Separate namespaces for HR, legal, product and customer data — both for security and quality.
5. No evaluations
Without a set of 100–200 reference questions, you won't know whether a prompt change actually improved anything. Invest in evals from day one.
RAG is 80% data engineering and 20% prompt engineering. Teams that think it's the other way around get stuck in pilot.
When RAG is not your answer
- When your company's knowledge lives in people's heads, not documents
- When documents update every hour and you can't run real-time indexing
- When you need numerical accuracy (finance, legal) — structured retrieval with a SQL agent fits better there
Facing a similar challenge in your company?
Tell us about the task — we'll get back within one business day with a call agenda.
Start a project →