I Built a RAG System for Ancient Hindu Scriptures: Here’s What Production Actually Looks Like

The first time I got a RAG system demo working, I felt like a genius. One PDF, forty lines of Python, and I could ask questions about any document and get answers. “This is it,” I thought. “I’ve figured this out.” If you’re trying to build production RAG app 2026, you’ve probably had that same moment. And then reality showed up.

I found out what “production” really means when I started building DharmaSutra.org, an AI platform to make ancient Hindu scriptures searchable and queryable in Hindi, Sanskrit, and English. I needed users to be able to ask questions like “What does the Bhagavad Gita say about karma yoga?” and get accurate, cited answers. Not hallucinated answers. Not close-enough answers. Accurate answers from actual scripture.

That project broke almost every assumption my tutorial experience had given me. And it taught me what production RAG actually looks like.


First, Let’s Get Clear on What RAG Actually Is

You’ve probably seen the acronym everywhere. RAG stands for Retrieval-Augmented Generation. But here’s the thing: most explanations make it sound more complicated than it needs to be.

Think about it this way. Imagine you visit two doctors.

The first doctor sees fifty patients a day. When you walk in and describe your symptoms, she draws on her medical school training and memory. She’s smart. She’s experienced. But she’s working from general knowledge, not your specific history.

The second doctor pulls up your chart before saying a word. He sees your past diagnoses, your medication list, your allergies. When he gives you a recommendation, it’s grounded in your data, not just his general memory.

That second doctor is a RAG system.

A RAG system retrieves relevant documents from your knowledge base before passing anything to the language model. The LLM doesn’t just rely on what it learned during training. It reads your documents first, then answers. The result is more accurate, more specific, and much harder to fake.

Without RAG, you get general answers. With RAG, you get answers grounded in your content.

That’s it. That’s the core concept.

Where it gets complicated is building one that actually works at scale.


The Tutorial-to-Production Gap Is Real (And Bigger Than You Think)

Every RAG tutorial follows the same formula. Load one PDF. Chunk it into 1000-character pieces. Embed those chunks. Store them in a vector database. Ask a question. Get an answer. Ship it.

That demo works beautifully. I’ve built dozens of them.

Here’s what none of those tutorials show you:

  • What happens when you have thousands of documents instead of one
  • What happens when your content is in three languages and none of them English-only
  • What happens when concurrent users hit your system and latency starts creeping up
  • What happens when your costs scale with every query
  • What happens when the model makes something up and it’s wrong in a way that matters

Does this sound familiar? If you’ve gone beyond the tutorial stage, you’ve hit at least one of these walls.

I hit all five. At once. Because DharmaSutra is not a simple problem.

Let that sink in for a second. I was building a system to handle sacred texts, in multiple languages, for users who take scriptural accuracy seriously. There is no “close enough” when someone asks what Krishna said in the Gita. Either you get it right, or you’ve misrepresented a scripture that billions of people consider divine.

The stakes made me a much better RAG engineer.


Why DharmaSutra Was (and it still is) a Hard RAG Problem

Let me give you some context on why this particular project pushed everything to the limit. I am not done with it, and working to improve it every day. I know it’s not a perfect system yet. But I know it will be by the time I am done. Technology is improving, techniques are evolving and I am positive I will make it better with time. Okay, so, let’s get back to it!

The content is multilingual and ancient. Hindu scriptures exist in Sanskrit (the original), Hindi translations, and increasingly in English translations. A single verse might appear in all three languages across different documents. Modern embedding models were not designed for Sanskrit. Some of them don’t even recognize it as a distinct language.

The structure is verse-based, not paragraph-based. Standard chunking treats text like prose. You split on character count or sentence boundaries. Ancient scripture doesn’t work that way. A single shloka (verse) from the Bhagavad Gita might be sixteen syllables in Sanskrit but expand to a full paragraph of commentary in English. If you chunk it the wrong way, you separate the verse from its translation, the question from its answer, the teaching from its context.

Accuracy is non-negotiable. If a general-purpose chatbot hallucinates a product feature, the user is annoyed. If a scripture AI hallucinates a verse that Krishna never said, you’ve created a theological error that could spread. That’s a completely different category of failure.

The user queries are complex and contextual. People asking about scripture are not asking simple lookup questions. They’re asking interpretive questions. “What does the Gita teach about detachment from results?” requires the system to understand concept, not just keyword.

Every one of these factors made my RAG architecture choices matter in ways that tutorials never prepare you for.


The 5 Production Challenges Nobody Talks About

1. Chunking Strategy Is Everything

The single biggest mistake I see in RAG projects is treating chunking as a default setting. Load your documents, set chunk size to 1000, overlap to 200, done.

For DharmaSutra, that approach was a disaster.

Ancient texts are structured by verse. Each verse is a complete unit of meaning. A verse from the Shiva Mahapurana might be four lines long. The commentary on that verse might be four paragraphs. If I split them with a character-count chunker, I’d end up with half a verse in one chunk, half the commentary in another. The retrieval system would find neither context correctly.

The solution was semantic chunking based on the document’s natural structure. I parse by chapter and verse markers. Each chunk represents one complete unit: verse plus translation plus brief commentary. That chunk is what gets embedded and stored.

Think about it this way. If you were building a data warehouse, you wouldn’t dump raw rows into a fact table without defining your grain first. The grain of your DW determines everything downstream. The grain of your chunks determines everything downstream in RAG. Same principle.

Your chunking strategy should answer: what is the smallest meaningful unit in my content? Start there.

2. Embedding Model Choice Changes Everything

I cannot overstate how important this decision is, and how little tutorials discuss it.

Most tutorials use OpenAI’s text-embedding-ada-002 or the newer text-embedding-3-small. These are excellent models. For English content, they work extremely well.

For Sanskrit/Hindi/English mixed content? Not so much.

I needed a multilingual embedding model that could understand semantic similarity across language boundaries. A user asking “what is dharma” in English should retrieve relevant chunks even if the most relevant passage is in Hindi or Sanskrit transliteration.

I tested several options:

Model Multilingual Support Quality Cost
OpenAI text-embedding-3-small Limited High (English) Low
Google text-embedding-004 Good High Low
multilingual-e5-large Strong High Self-hosted
cohere-embed-multilingual-v3.0 Excellent High Medium

For DharmaSutra, I ended up using a combination: Google’s embedding model for broad multilingual coverage plus a specialized approach for Sanskrit content.

The lesson: match your embedding model to your content’s language and domain. Do not skip this decision.

3. Pure Vector Similarity Is Not Enough

Every beginner tutorial shows you the same retrieval flow. Embed the query. Find the nearest vectors. Return the top-k results. Feed to LLM.

This works okay. In production, “okay” is not good enough.

Here’s what I found: pure semantic search is excellent at finding conceptually similar content, even when the exact words differ. But it struggles with specific names, proper nouns, and exact terminology. In a corpus of Hindu scriptures, the name “Arjuna” appears thousands of times. A semantic search for “Arjuna” might retrieve passages that are conceptually about warriors in general rather than the specific character.

Hybrid search solves this. You run both semantic search (vector similarity) and keyword search (BM25 or similar) in parallel, then combine the results.

The improvement in retrieval quality was significant when I made this switch. Semantic search handles the “what does the Gita teach about” part. Keyword search handles the “find me the specific verse number” part. Together they cover both use cases.

In LangChain, this is the EnsembleRetriever. In LlamaIndex, it’s the QueryFusionRetriever. Neither tutorial I read in 2024 mentioned this. It should be the default.

4. Hallucination Prevention at the Prompt and Architecture Level

You’re probably wondering how I handled the hardest problem: making sure the system didn’t make up scripture.

Here’s what most people miss: hallucination prevention is not just a prompt engineering problem. It’s an architecture problem.

At the architecture level, I implemented strict retrieval grounding. The system is not allowed to answer from general LLM knowledge. If it cannot find a relevant passage in the retrieved chunks, it says so. The prompt explicitly instructs: “Answer only using the provided source passages. If the passages do not contain sufficient information, say so. Do not draw on your general knowledge.”

At the prompt level, every answer is required to include a citation: which scripture, which chapter, which verse. If the model cannot cite a verse, it cannot make the claim.

This combination is not perfect. But it reduced unsourced claims dramatically. When the system did fail, it tended to say “I don’t have enough information” rather than inventing something. That failure mode is much safer than confident hallucination.

For sacred texts specifically, I also added a human review layer for any response that generated a new citation pattern. That’s extra for most use cases. For mine, it was essential.

The broader lesson: define your acceptable failure mode before you build. Would you rather the system say “I don’t know” or risk giving a wrong answer confidently? For most production use cases, humble uncertainty beats confident fabrication.

5. Latency and Cost at Scale

Every query in a RAG system makes at least two expensive calls: one to your vector database (retrieval) and one to your LLM (generation). In a demo, that’s fine. In production with concurrent users, those costs compound.

My architecture went through three versions before I got latency to acceptable levels.

Version 1 (naive): Embed query. Retrieve top-10 chunks. Send all 10 chunks plus query to LLM. Generate answer. Total: 3-4 seconds per query, high token usage.

Version 2 (with re-ranking): Retrieve top-20 chunks. Use a cross-encoder re-ranker to score each chunk against the query. Pass only top-3 to LLM. This improved answer quality dramatically and reduced token costs. Latency improved to 2-3 seconds but added re-ranking compute time.

Version 3 (with caching): Added a semantic cache layer. If someone asks a question that’s semantically close to a question already answered, return the cached response. Common questions (like “what is karma?”) now respond in milliseconds. Unique questions still take 2-3 seconds but that’s acceptable.

The good news is that you don’t have to solve all of this on day one. Start with the naive approach. Measure. Then optimize the bottlenecks.


The Enterprise Data Parallel

I spent years building data warehouses and ETL pipelines before I touched a vector database. When I started building production RAG systems, something clicked: these problems are not new. The patterns are the same.

Chunking is ETL. How you split and structure your documents is exactly like how you define your ETL transformation rules. Garbage in, garbage out. Careful preprocessing in, quality retrieval out.

The vector database is your data warehouse. It stores processed, indexed data optimized for a specific query pattern (semantic similarity rather than relational joins, but the concept is identical). Your ingestion pipeline is your ETL. Your embedding model is your transformation logic.

Retrieval is query optimization. Getting hybrid search tuned, re-ranking implemented, and caching layered in is the same discipline as writing efficient SQL, using indexes correctly, and managing query plans.

Document preprocessing quality is data quality. In data warehousing, bad source data is the root cause of most reporting failures. In RAG, messy source documents are the root cause of most retrieval failures. The discipline is the same: clean your data before it enters the system.

If you have a background in data engineering or data architecture, you already have the mental models for this. RAG is not a foreign concept. It’s a new application of patterns you already know.


Common Mistakes That Kill Production RAG Systems

Before I share the steps to build yours, let me save you the pain of the mistakes I see repeatedly.

Using chunk size 1000 blindly. Your chunk size should match the semantic grain of your content. For contracts, that might be one clause. For scripture, one verse. For support documentation, one FAQ entry. There is no universal right answer.

Ignoring re-ranking. Vector similarity retrieves candidates. Re-ranking selects the best ones. Skipping re-ranking means your LLM gets mediocre context and produces mediocre answers. Add a cross-encoder re-ranker. You’ll see the improvement immediately.

Skipping evaluation. How do you know if your RAG system is working? Most teams don’t have a good answer. RAGAS (Retrieval Augmented Generation Assessment) is a framework built specifically to evaluate RAG systems. It measures faithfulness, answer relevance, context precision, and context recall. Run this before you ship.

Not monitoring in production. Which queries are failing? Which ones are returning low-confidence answers? Which ones are taking too long? You need observability. Log your retrievals, your LLM responses, your latencies. RAG systems degrade over time as your knowledge base changes. You need to know when it happens.

Using the cheapest embedding model for specialized content. I get it. Costs matter. But the embedding model determines your retrieval quality ceiling. If your content is specialized (legal, medical, multilingual, technical), match your embedding model to your domain.


The Tech Stack That Actually Works in Production

Here’s what I use and recommend for building production RAG systems in 2026:

Orchestration:LangChain (better ecosystem, more integrations) – LlamaIndex (better for document-heavy use cases, excellent ingestion pipeline)

Vector Databases:Chroma for development and prototyping (free, local, easy) – Pinecone for production (managed, scalable, reliable) – Weaviate for production when you need hybrid search built-in

Embedding Models:OpenAI text-embedding-3-small for English content (cheap, excellent) – Google text-embedding-004 for multilingual content – Cohere embed-multilingual-v3.0 for the best multilingual quality

LLMs:GPT-4o for quality-critical applications – Gemini 1.5 Pro for large context windows and multilingual – Claude 3.5 Sonnet for long-document reasoning

Evaluation:RAGAS (non-negotiable: evaluate before you ship)

Observability:LangSmith (if using LangChain) – Arize Phoenix (model-agnostic, excellent for RAG tracing)


How to Build Your Production RAG App: 7 Steps

This is the process I follow now, after learning the hard way.

  1. Preprocess your documents thoughtfully. Clean your source content. Remove headers, footers, page numbers, artifacts. Identify the semantic grain of your content. Define your chunking strategy based on that grain, not on a default parameter.
  2. Choose your embedding model based on your content. English-only? OpenAI works great. Multilingual? Use Google or Cohere. Domain-specific? Consider fine-tuning or domain-adapted models.
  3. Pick the right vector database for your scale. Chroma for development. Pinecone or Weaviate for production. Consider whether you need managed infrastructure or can self-host.
  4. Implement hybrid retrieval. Combine semantic search and keyword search from the start. Do not build pure vector search and try to retrofit hybrid later. It’s much harder to add than to start with.
  5. Add a re-ranker. After retrieval, score your candidates with a cross-encoder model. Pass only the top 3-5 chunks to your LLM. Your answer quality will improve measurably.
  6. Evaluate with RAGAS before you ship. Build a test set of 50-100 question-answer pairs. Run RAGAS metrics. Set a minimum threshold. Do not ship below that threshold.
  7. Monitor in production. Log everything. Track latency, retrieval quality, user feedback. Set up alerts for degradation. Your knowledge base changes. Your system needs to keep up.

This Works for SMBs, Not Just Big Projects

You might be thinking: “DharmaSutra sounds like a complex, specialized project. My use case is simpler.”

Here’s the thing. The same architecture applies to:

  • Customer support documentation (users asking about product features, troubleshooting steps)
  • Internal knowledge bases (employees searching company policies, SOPs, HR docs)
  • Product catalogs (complex specifications, compatibility matrices, pricing logic)
  • Legal documents (contract review, clause extraction, compliance checking)
  • Medical/clinical guidelines (care protocols, drug interactions, treatment options)

Any business that has documents and needs people to find answers in those documents is a RAG use case. The production principles are the same whether your corpus is 50 documents or 50,000.

The companies I work with through Krishna Worldwide are often surprised at how quickly a well-built RAG system can replace hours of manual search. A support team that spent three hours a day hunting through documentation can get that time back. That’s a real business outcome.

The good news is you do not need to hire a large team or spend millions. A well-designed production RAG system can be built and deployed by a small team if you follow the right architecture from the start.


Where to Go From Here

If reading this made you realize your current RAG setup is closer to “tutorial” than “production,” you’re not alone. Most systems I see in the wild are one PDF and a prayer.

The gap is closeable. The architecture is not secret. The tools are open source or low-cost. What’s missing for most people is a structured path from concept to production-grade system.

That’s what my AI Engineering course covers. We go from the basics of how LLMs work all the way through building, evaluating, and deploying production RAG systems. We build real projects, including a document query system you can actually use. Every architecture decision in this post is something we work through together in the course.

If you’re serious about building AI applications that actually work, that’s where I’d point you.

In the meantime: start with your chunking strategy. It’s the thing most tutorials skip that matters most. Get your grain right, and everything downstream gets easier.


Your Turn To Share

I’ve shared what building DharmaSutra taught me about production RAG. Now I want to hear from you.

What’s the hardest RAG problem you’ve faced? Was it retrieval quality, hallucination, latency, or something else entirely? Or maybe you’re just starting out and wondering where to begin?

Drop your question or experience in the comments. I read every one, and I’m always happy to think through these problems together.

Leave a Comment