Product

Solutions

Customers

Security

Resources

Pricing

Docs

Book a demo

Book Demo

Request Demo

Back

Feb 21, 2025

Evaluating modular RAG with reasoning models

A debrief of our recent exploration of OpenAI’s o3-mini reasoning models in the context of a Retrieval-Augmented Generation (RAG) system.

David Karlsson

Overview

Hypothesis

Testing setup

Results

Key takeaways

Reasoning ≠ experience fallacy

Conclusion

kapa.ai is an LLM-powered AI assistant that answers technical questions about your products. We integrate with knowledge sources to answer user questions and support tickets through a RAG pipeline.

Building and maintaining a robust, general-purpose RAG system is hard. There are many controls and parameters that influence the final output's quality, and they all interact in complex ways:

Prompt templates
Context size
Query expansion
Chunking
Reranking
Etc.

As we make changes to the system, and particularly when integrating new models, revisiting and refining these parameters is crucial to maintain good performance. But this is also time-consuming to do and takes experience to get right.

New reasoning models like DeepSeek-R1 and OpenAI’s o3-mini produce impressive results by reasoning through built-in Chain-of-Thought (CoT) prompting, where the model is designed to “think” through problems step by step and even self-correct when needed. These models reportedly perform better at more difficult challenges which require logical reasoning, where the answer to a question is verifiable.

So we asked ourselves: if reasoning models can break down complex challenges and self-correct, could we also use them in our RAG pipeline for handling tasks like query expansion, document retrieval, and reranking? By building an information retrieval toolbox and handing it to a reasoning model, we might be able to build a more adaptive system that can adjust to change, reducing the need for constant manual tweaking by a human.

This paradigm is sometimes referred to as Modular RAG. In this article, we’ll share our findings from recent research on refactoring a standard RAG pipeline to a pipeline based on reasoning models.

Hypothesis

The main reason for exploring this idea was to see if it could let us simplify the pipeline and remove the need for human fine-tuning of parameters. The core components of RAG pipelines are dense embeddings and document retrieval. A typical advanced RAG pipeline goes something like this:

Receive the user’s prompt.
Preprocess the query to improve information retrieval.
Find relevant documents through a similarity search in a vector database.
Rerank the results and use the most relevant documents.
Generate a response.

Each step in the pipeline is optimized through heuristics, e.g., filtering rules and ranking adjustments, to prioritize relevant data. These hard-coded optimizations define the pipeline’s behavior but also limit its adaptability.

In order to allow for a reasoning model to use the components of our pipeline, we had to set things up differently. Rather than defining a linear sequence of steps, we needed to make each component an independent module that the model could call.

In this architecture, instead of following a fixed pipeline, a model with reasoning capabilities can dynamically control its own workflow to a greater degree. By leveraging tool usage, the model would determine when and how often to run a full retrieval or simpler retrieval, and what retrieval parameters to use. If successful, this approach could potentially replace traditional RAG orchestration frameworks like LangGraph.

Additionally, a more modular system could unlock a few additional advantages:

Individual modules can be swapped or upgraded without overhauling the entire pipeline.
Clearer separation of responsibilities makes debugging and testing more manageable.
Different modules (e.g., retrievers with different embeddings) can be tested and swapped out to compare performance.
Modules can scale independently for different data sources.
It could potentially allow us to build different modules tailored for specific tasks or domains.

Lastly, another use case we wanted to explore was whether this approach could help us “short-circuit” abusive or off-topic queries more effectively. The hardest cases often involve ambiguity, where it’s unclear whether a query is related to the product. Abusive queries are often deliberately crafted to evade detection. While simpler cases can already be handled efficiently, we hoped reasoning models might allow us to identify and exit from more complex issues earlier, and exit early.

Testing setup

To trial this workflow, we set up a sandboxed RAG system with the necessary components, static data, and an LLM-as-a-judge evaluation suite. In one configuration, we used a typical fixed, linear pipeline, with hard-coded optimizations built-in.

For the modular RAG pipeline, using o3-mini as the reasoning model, we ran various configurations of the pipeline under different strategies to evaluate what worked well and what didn’t:

Tool usage: we tried giving the model full access to all the tools and the entire pipeline, and we also tried restricting tool usage to a single tool in combination with a fixed linear pipeline.
Prompting and parameterization: we tested using both open-ended prompts with minimal instructions and highly structured prompts. We also experimented with various degrees of pre-parametrized tool calls versus letting the model itself decide on the parameters.

For all the tests we ran, we capped the number of tool calls to a maximum of 20 - for any given query, the model was only allowed to use a maximum of 20 tool calls. We also ran all the tests under both medium and high reason effort:

Medium: Shorter CoT (Chain-of-Thought) steps
High: Longer CoT steps with more detailed reasoning

In total, we ran 58 evaluations of different modular RAG configurations.

Results

Our experiments showed mixed results. In some configurations, we observed modest gains, most notably in areas like code generation and, to a limited extent, factuality. However, key metrics such as information retrieval quality and knowledge extraction remained largely unchanged when compared to our traditional, manually tuned workflow.

A recurring theme throughout our tests was the increased latency introduced by Chain-of-Thought (CoT) reasoning. While deeper reasoning allows the model to break down complex queries and self-correct, it comes at the cost of additional time required for iterative tool calls.

The most significant challenge we found was the “reasoning ≠ experience” fallacy: The reasoning model, despite its ability to think step-by-step, lacks prior experience with retrieval tools. Even with strict prompting, it struggled to retrieve quality results and differentiate between good and bad outputs. The model often hesitated to use the tools we provided, similar to our previous experiments with o1 last year. This highlights a broader issue: reasoning models excel at abstract problem-solving, but optimizing tool usage without prior training remains an open challenge.

Key takeaways

Our experiments reveal a clear “reasoning ≠ experience fallacy”: A reasoning model does not inherently "understand" retrieval tools. It understands what the tool does and what it's for, but it doesn't know how to use it, the tacit knowledge that a human would have after using the tool. Unlike traditional pipelines, where experience is encoded in heuristics and optimizations, reasoning models must be explicitly taught how to use tools effectively.
Despite o3-mini's ability to handle larger contexts, we observed no significant improvement over models like 4o or Sonnet in terms of knowledge extraction. Simply increasing context size is not a magic fix for retrieval performance.
Increasing the model’s reasoning effort only marginally improved factual accuracy. Our dataset focused on technical content relevant to real-world use cases, rather than math competition problems or advanced coding challenges. The impact of reasoning effort may vary depending on the domain, with different results possible for datasets containing more structured or computationally complex queries.
One area where the model did excel was code generation, suggesting that reasoning models might be particularly useful in domains requiring structured, logical output rather than pure retrieval.

Reasoning ≠ experience fallacy

The key takeaway from our experiment is that reasoning models do not naturally possess tool-specific knowledge. Unlike a finely tuned RAG pipeline, which encodes retrieval logic into predefined steps, a reasoning model approaches each retrieval call from a blank slate. This leads to inefficiencies, hesitations, and suboptimal tool usage.

To mitigate this, a few possible strategies come to mind. Further refining prompting strategies, i.e. structuring tool-specific instructions in a way that provides more explicit guidance to the model, can likely help. Pre-training or fine-tuning a model on tool usage could also allow it to gain familiarity with specific retrieval mechanisms.

Additionally, a hybrid approach could be considered, where predefined heuristics handle certain tasks while reasoning models selectively intervene where needed.

These ideas are still speculative, but they point to ways we might bridge the gap between reasoning capabilities and practical tool execution.

Conclusion

While we couldn't see a clear advantage of reasoning-based modular RAG over traditional pipelines for the scope of our use case, the experiment did provide valuable insights into its potential and limitations. The flexibility of a modular approach remains appealing. It allows for greater adaptability, easier upgrades, and dynamic adjustments to new models or data sources.

Moving forward, some promising techniques for further exploration include:

Using different prompting strategies and pre-training/fine-tuning to improve how models understand and interact with retrieval tools.
Using reasoning models strategically in parts of the pipeline, e.g. for specific use cases or tasks such as complex question answering or code generation, rather than to orchestrate the entire workflow.

At this stage, reasoning models like o3-mini do not yet surpass traditional RAG pipelines in core retrieval tasks within reasonable time constraints. As models advance and tool usage strategies evolve, a reasoning-based modular RAG system could become a viable alternative, particularly for domains requiring dynamic, logic-intensive workflows.

Trusted by hundreds of COMPANIES to power production-ready AI assistants

Turn your knowledge base into a production-ready AI assistant

Request a demo to try kapa.ai on your data sources today.

Request Demo

Evaluating modular RAG with reasoning models

Evaluating modular RAG with reasoning models

Hypothesis

Testing setup

Results

Key takeaways

Reasoning ≠ experience fallacy

Conclusion

Turn your knowledge base into a production-ready AI assistant

Instant Al answers for

your technical product.

Instant Al answers for

your technical product.

Instant Al answers for

your technical product.