Nov 11, 2024
Production insights from implementing retrieval augmented generation at OpenAI, Docker, Langchain, CircleCI, and more.
Let's start with a hard truth: Most retrieval augmented generation (RAG) implementations fail to make it out of the proof-of-concept stage. A rencet global survey of 500 technology leaders shows that more than 80% of in-house generative AI projects fall short. But it doesn't have to be this way.
At kapa.ai, we've worked with over 100 technical teams like Docker, CircleCI, Reddit and Monday.com to implement RAG-based systems in production. Below is what we've learned about making it past the proof-of-concept stage.
Before diving what we've learned, here's a quick primer on RAG. Think of it as giving an AI a carefully curated reference library before asking it questions. Instead of relying solely on its training data (which can lead to hallucinations), RAG-based systems first retrieve relevant information from your knowledge base, then use that to generate accurate answers. It's like the difference between asking someone to recall a conversation from memory versus having them reference the actual transcript.
In practice, this means indexing your knowledge in a vector database—think super-powered search engine—and connecting it to large language models that can use this information to answer questions naturally and accurately. This approach has become the go-to method for building reliable AI systems that can discuss your specific product or domain.
1. Carefully Curate Your Data Sources
The old programming adage "garbage in, garbage out" holds especially true for RAG systems. The quality of your RAG system is directly proportional to the quality of your knowledge base.
Here's what many teams get wrong: They dump their entire knowledge base—every Slack message, support ticket, and documentation page from the last decade—into their RAG system. They assume more data equals better results. Instead, we recommend starting with your core content sources.
For technical AI assistants, that often includes primary sources like:
Technical documentation and API references
Product updates and release notes
Verified support solutions
Knowledge base articles
Once you have your primary sources covered, you can thoughtfully expand to secondary sources like Slack channels, forum discussions, and support tickets. But be selective—apply filters like recency (only posts from last year) and authority (only replies from verified community members).
When it comes to implementation, you have options. You can use open-source tools like LangChain, which provides information retrieval connectors to various data sources including Slack. Or if you prefer a no-code approach with pre-built filters and data connectors, that's exactly what we've built at kapa.ai (that's us 👋). Either way, the key is to be intentional about what data you include and how you filter it.
A final important consideration is separating your public knowledge sources from private data. We recommend maintaining distinct vector stores: one for external data like public documentation, and another for sensitive enterprise data and relevant documents. This separation ensures better security and makes it easier to manage access controls.
2. Implement a Robust Refresh Pipeline
Your data sources for an AI agent are not static—they are constantly evolving. Just look at companies like Stripe, whose documentation repositories sees dozens of updates daily. Your RAG system needs to keep up with this pace to stay up to date with changes in the underlying knowledge base and latest information.
Here's what happens if you don't get this right: Your AI starts giving outdated answers, missing critical updates, or worse, mixing old and new information in confusing ways. Yet we've seen teams treat their RAG knowledge base like a one-time setup.
A production-ready system needs automated refresh pipelines. But here's the trick: you don't want to reindex everything every time. That's expensive and unnecessary. Instead, build a delta processing system similar to a Git diff that only updates what's changed. Think of it like continuous deployment for your AI's knowledge.
Key pipeline components to consider:
Change detection system to monitor documentation updates
Content validation to catch breaking layout changes
Incremental updating for efficiency
Version control to track changes
Quality monitoring to prevent degradation
For technical teams building this in-house, here's a practical approach:
Set up cron jobs to regularly check for content changes
Use a message queue (like RabbitMQ) to handle update processing
Implement validation checks before indexing
Deploy monitoring to track refresh performance
This can work well, but it requires significant engineering effort to build and maintain. That's actually why we built kapa.ai with automatic content refreshes out of the box. Our platform handles all the complexity of keeping your knowledge base current—from detecting changes to validating content before it goes live.
Whether you build it yourself or use a platform like ours, the key is making sure your RAG system stays as current as your knowledge sources. After all, your AI assistant is only as good as the information it has access to.
As a sidenote: this ability to update your knowledge store without retraining the core model is one of the key benefits of RAG versus fine-tuning. It's what makes RAG particularly powerful for teams with frequently changing documentation.
3. Build Comprehensive Evaluations
Here's where most teams drop the ball: they lack rigorous evaluation frameworks. When you're building RAG applications, you're juggling countless parameters—chunk sizes, embedding models, context windows, citation strategies, and more. Each choice cascades into the next, creating a maze of optimization possibilities.
Modern RAG architectures have evolved far beyond simple embeddings and retrieval. Companies like Perplexity have pioneered techniques like query decomposition, while others push the boundaries with cross-encoder reranking and hybrid search approaches. But here's the catch: you can't optimize what you can't measure.
Vibe checks ("does this answer look right?") might work for proof-of-concepts. You know the drill—throw a few test questions at your system, eyeball the responses, and call it a day. While that's a fine way to start, it won't get you to production.
You have two options for evaluation frameworks:
Open-source tools like Ragas provide out-of-the-box metrics for answer correctness, context relevance, and hallucination detection
Custom evaluation frameworks built specifically for your use case
While tools like Ragas offer a great starting point, they often need significant extension to match real-world needs. Your evaluation framework should cover:
Query understanding accuracy
Citation and source tracking
Response completeness
Hallucination detection
The key is building evaluations that match your specific use case. If you're building a product AI copilot for sales, your evaluation criteria will be very different from a system designed for customer support or legal document analysis.
At kapa.ai, we've taken a focused approach: optimizing specifically for accurately answering product questions. Rather than trying to build a one-size-fits-all solution, we've spent years developing evaluation frameworks that reflect real-world usage patterns and customer feedback in this specific domain. We've learned that academic benchmarks and generic evaluation tools only take you so far—you need evaluations that truly reflect your users' needs.
Whether you build your own evaluation framework, extend open-source tools, or use a specialized platform like ours, remember: every improvement to your RAG system should be validated through rigorous testing. It's the only way to ensure your optimizations actually improve the end-user experience rather than just looking good on paper.
4. Optimize Prompting for Your Use Case
Getting your prompting strategy right is crucial for a production RAG system. It's not just about crafting clever prompts—it's about building a comprehensive strategy that aligns with your specific use case. Here are the key principles we've learned from working with technical teams:
Ground Your Answers
The first rule of RAG systems is: never let your AI make things up. Your prompting strategy should explicitly instruct the model to (1) only use provided context, and (2) include clear citations for claims.
B. Know When to Say "I Don't Know"
This might sound obvious, but it's crucial: your AI should confidently acknowledge its limitations. A good RAG system should (1) recognize when it lacks sufficient information, (2) suggest alternative resources when possible, and (3) never guess or hallucinate answers.
C. Stay on Topic
A production RAG system needs clear boundaries. Your prompts should ensure the AI (1) stays within its knowledge domain, (2) refuses to answer questions about unrelated products, and (3) maintains consistent tone and formatting.
D. Handle Multiple Sources
Your prompting strategy needs to elegantly handle information from different sources. This includes (1) synthesizing information from multiple documents, (2) handling version-specific information, (3) managing conflicting information, and (4) providing relevant context.
These principles work together. For instance, when handling multiple sources, your system might need to say "I don't know" about newer product versions, or when grounding answers, it might need to acknowledge conflicting information between sources.
Implementation Options
When it comes to implementing these principles, you have several approaches:
DIY Approach: Tools like Anthropic's Workbench let you iterate on prompts rapidly, testing different approaches against various scenarios. It's particularly useful for finding the right balance between informative and cautious responses for your specific use case.
Managed Solutions: At kapa.ai, we've built our Answer Engine to handle these challenges out of the box, continuously balancing these principles to provide reliable, accurate responses to technical questions.
The key is to test your prompting strategy extensively with real-world scenarios before deploying to production. Pay particular attention to edge cases and potentially problematic queries that might tempt your system to guess or hallucinate.
5. Implement Security Best Practices
Security can't be an afterthought for production RAG systems. Two major risk factors make RAG systems particularly vulnerable: prompt hijacking (where users craft inputs to manipulate the system's behavior) and hallucinations (where systems generate false or sensitive information as discussed above). If you're handling customer data or internal documentation, these risks become even more critical. However there's a long tail of additional risks that are often not covered for production systems, which we will cover here.
PII Detection and Masking
Your RAG system needs to handle sensitive information carefully. Users often accidentally share sensitive data in their questions—API keys in error messages, email addresses in examples, or customer information in support tickets. Once this information enters your system, it's hard to guarantee it's completely removed.
B. Bot Protection and Rate Limiting
The moment you deploy a public-facing RAG system, it becomes a target. We've seen cases where unprotected endpoints got hammered with thousands of requests per minute, not just driving up costs but potentially extracting sensitive information. Essential protections include rate limiting, reCAPTCHA integration, and request validation.
Modern solutions are emerging to address these challenges. Cloudflare recently launched their Firewall for AI, showing how the industry is evolving to protect AI systems at scale.
C. Access Controls
Not everyone should see everything. Without proper access controls, internal documentation or customer data can leak across team boundaries. Role-based access control ensures your knowledge base stays secure while remaining accessible to the right users. More importantly, it lets you track who's accessing what.
You can build these protections yourself using established security libraries and services like Cloudflare's AI Firewall. Alternatively, managed solutions like kapa.ai include these protections out of the box in a SOC II Type II certified environment.
The key is implementing these protections before you deploy—not after an incident.
6. Conclusion: Making it Work
After working with hundreds of teams, here's what successful RAG implementations have in common:
Start Small, Start Strong
Begin with:
A focused set of high-quality documentation
One or two well-defined use cases
Clear evaluation metrics
Basic security measures
B. Common Pitfalls to Avoid
Including too much data too quickly
Neglecting refresh pipelines
Relying on manual testing
Treating security as an afterthought
Whether you build it yourself with open-source tools or use a managed solution like kapa.ai, these principles remain the same. The key is treating your RAG system as core infrastructure rather than an experimental add-on.
Want to learn more about implementing RAG for your technical product questions? If you are interested in seeing how kapa.ai can transform your knowledge base into an intelligent assistant, you can test it out by requesting a demo here.
Turn your knowledge base into a production-ready AI assistant
Request a demo to try kapa.ai on your data sources today.