Nov 25, 2024
Nine Technical Strategies for Reducing Hallucination
An AI assistant casually promises a refund policy that never existed, leaving a company liable for an invented commitment. This incident with Air Canada’s chatbot is a clear example of 'AI hallucination,' where AI can generate confident, yet entirely fictional, answers. These errors—ranging from factual inaccuracies and biases to reasoning failures—are collectively referred to as 'hallucinations.'
In simple terms, Large Language Models (LLMs) work like advanced 'autocomplete' tools, generating content by predicting the next word in a sequence based on patterns in their training data. This process is like 'filling in the blanks' without understanding the topic. Since LLMs lack true reasoning skills to check their outputs, they rely only on word probability patterns rather than comprehension.
At kapa.ai we've worked with over 100 technical teams like Docker, CircleCI, Reddit and Monday.com to implement LLMs in production, so we wanted to share what we've learned about managing hallucinations, which starts with a technical deep-dive into what they are.
(source)
1. Why Are LLM Hallucinations Important?
As Artificial Intelligence (AI) models become central to information retrieval and decision-making, trust in these technologies is paramount. AI chatbots have produced several well-known misleading statements, that have led to trust and reputational issues:
Misinformation: Google’s Bard once inaccurately stated in a promotional video that the James Webb Space Telescope captured the first exoplanet image. While it was actually the European Southern Observatory's Very Large Telescope.
Ethical concern: Microsoft's AI chatbot generated inappropriate responses, such as professing emotions and attributing motivations to itself, which led to user discomfort and raised ethical questions about AI behavior.
Legal implications: A lawyer using ChatGPT for legal research included fabricated citations and quotes, resulting in a fine risking reputational damage and wasting judicial resources.
These examples highlight how AI hallucinations can create reputational and trust issues for organizations.
2. Why Do LLMs Hallucinate?
LLM hallucinations stem from three core technical challenges: (A) model architecture limitations, (B) fundamental constraints of probabilistic generation, and (C) training data gaps.
View the full tweet for more detailed insight
A. Design and Architectural Constraints:
Theoretical limitations: The transformer-based attention mechanism within an LLM enables the model to focus on parts of an input that are relevant. In transformer models, a fixed attention window restricts the length of input context the model can retain, leading to earlier content being 'dropped' when sequences are too long. This constraint often causes a breakdown in coherence and increases the likelihood of hallucinated or irrelevant content in longer outputs.
Sequential token generation: LLMs generate responses one token at a time. Each token depends only on a previously generated token, and there is no way to revise earlier output. This design limits real-time error correction, causing initial mistakes to escalate into confidently incorrect completions.
B. Probabilistic Output Generation:
Limitations of generative models: Generative AI models can produce responses that, while appearing plausible, lack true comprehension of the subject matter. For example, a supermarket’s AI meal planner suggested a chlorine gas recipe(which is toxic) as “the perfect nonalcoholic beverage to quench your thirst and refresh your senses”—showing how, even assuming it was trained on valid data, AI can generate unsafe outputs without understanding context.
Unclear input handling: When faced with ambiguous or vague prompts, LLMs attempt to “fill in the blanks,” leading to speculative and sometimes incorrect responses.
C. Training Data Gaps:
Exposure bias: During training, models rely on 'ground truth' data provided by human annotators as the basis for predicting the next words. However, during the Inference stage, they must depend on their own previously generated synthetic data. This creates this feedback loop whereby slight mistakes earlier in the process amplify with time and often cause the system to drift away in coherence and accuracy.
Training data coverage gaps: Despite being trained on vast datasets, models often will not cover less frequent or niche information. Models, therefore, when tested on these aspects, inevitably result in a response that contains hallucination. Underrepresented patterns or overfitting on commonly occurring information impact generalization, especially for out-of-scope inputs.
3. How To Mitigate AI Hallucination?
While the hallucination problem in LLMs is inevitable, they can be significantly reduced through a three-layer defense strategy: (A) input layer controls that optimize queries and context, (B) design layer implementations that enhance model architecture and training, and (C) output layer validations that verify and filter responses.
Each layer serves as a critical checkpoint, working together to improve the reliability and accuracy of AI outputs. Let's explore a high-level introduction to these techniques within each layer:
A. Input Layer Mitigation Strategies
Design and deploy layers that process queries before they reach the model. These layers would assess ambiguity, relevance, and complexity, ensuring the queries are refined and optimized for better model performance.
Query processing: Evaluate if the query contains sufficient context or if it needs clarification. Refine the query to make it more relevant by discarding irrelevant noise, for example. Emphasize the complexity of the query to trigger various model behaviors, such as simplification using simpler models or generating clarification questions when highly uncertain.
Context size optimization: Reduces the size of input so more context could be fit into the input of the model effectively without losing quality: Using self-information filtering to retain key context.
Context injection: This Technique involves redefining and "injecting" a contextual template or structured prompt before the user's main query to help the model understand the query better. Thus, structuring the prompts with specific role tags, content separators, and turn markers may help context injection to make the model understand complex prompts better and get the response closer to what the user wants rather than unsupported or speculative outputs.
One way to reduce hallucinations might be to “do it the Apple way” and simply ask the model to “Do not hallucinate.”
B. Design Layer Mitigation Strategies
The design layer focuses on enhancing how models process and generate information through architectural improvements and better training approaches. These strategies work at the core of the model to produce more reliable outputs.
Chain-of-thought prompting: Chain-of-thought (CoT) prompts the model to “think” in a sequential, logical manner rather than immediately providing a final answer, simulating a reasoning process, and improving output accuracy and coherence. The CoT approach encourages models, especially those with substantial parameter counts often in the range of 100 billion or more, to produce step-by-step answers that mirror human-like thought processes. Smaller models cannot process and utilize the multi-step dependencies that CoT relies on, thus yielding lower accuracy than standard prompting methods. A recent tweet by an OpenAI researcher highlights how the shift to the o1 model improves CoT by producing a more uniform information density, resembling a model’s 'inner monologue' rather than mimicking pre-training data.
Retrieval-Augmented Generation (RAG): RAGs are an extension of the LLMs with retrieval mechanisms that draw relevant, timely information from external databases, reducing hallucinations and anchoring outputs in a factual context. Researchers describe the three paradigms of RAG:
The simplest form of RAG is the Naive RAG, where top-ranked documents are directly input to the LLM.
Advanced RAG builds upon this by introducing additional pre- and post-processing steps, including query expansion, subquery generation, Chain-of-Verification, and document reranking, to further refine the relevance of its retrieved chunks. We've covered our recommended advanced RAG techniques in another post.
Modular RAG transforms the traditional RAG systems into a flexible, reconfigurable framework, like LEGO, with adaptive retrieval and prediction modules that only selectively retrieve new data in the face of uncertainty. Such further configuration makes it easier to support iterative querying over multi-step complex reasoning. Accompanied by features such as a memory pool, RAG can store knowledge regarding conversational context for coherence in dialogue applications. RAG could be further optimized with Customizable Components like Hierarchical indexing and Metadata tagging to improve the precision of retrieval. Hierarchical structures like knowledge graphs, are able to represent relationships between entities that better align with the intent of the query. Further optimizations of relevance come through techniques such as Small2Big chunking or sentence-level retrieval.
Comparison between the three paradigms of RAG (Gao et al. 2024)
Fine-tuning: Ideal for scenarios where there is sufficient task-specific training data and standardized tasks. Tailor models on domain-specific or task-specific data enhancing their accuracy within those specialized domains where the general pretraining data is imprecise. Fine-tuning does not entirely override the originally pre-trained weights but updates them. That allows the model to absorb new information without losing foundational knowledge. This balance helps the model maintain a general context and prevents the loss of essential pre-existing understanding.
C. Output Layer Mitigation Strategies
While input and design layers prevent hallucinations from occurring, the output layer acts as a final defense by filtering and verifying generated content. These verification methods ensure that only accurate and relevant information reaches the end user:
Rule-based filtering: Using rule-based systems or algorithms to filter out incorrect or irrelevant responses. Rule-based systems check responses in the model’s output against verified databases that have a very low chance of causing hallucinations.
Output re-ranking: Ranking multiple outputs based on relevance and factual consistency, ensuring only the most accurate responses reach the user.
Fact-checking and verification: Using advanced fact-checking frameworks like Search-Augmented Factuality Evaluator (SAFE) or WebGPT Developed by OpenAI, the process involves dissecting long-form responses into discrete factual statements. Each of these statements is then cross-referenced with up-to-date sources via online searches. This process allows the framework to determine whether each claim is supported by current information found on the web.
Encourage contextual awareness: Encouraging models to refrain from generating answers when they lack sufficient context or certainty helps avoid speculative or incorrect content.
4. Future Outlook
Current research in furthering AI reliability focuses on innovating around these mitigating techniques or understanding the inner workings of LLMs Better, or potentially lead to new architectures of AI models that enable them to "understand" the data they are being trained on:
Encoded truth: Recent work suggests that LLMs encode more truthfulness than previously understood, with certain tokens concentrating this information, which improves error detection. However, this encoding is complex and dataset-specific, hence limiting generalization. Notably, models may be encoding the correct answers internally despite generating errors, highlighting areas for targeted mitigation strategies.
Detection methods: Another recent study highlights entropy-based methods for detecting hallucinations in LLMs, It offers a way to identify hallucinations by assessing uncertainty at a semantic level. This approach generalizes well across diverse tasks and new datasets, allowing users to anticipate when extra caution is needed with LLM outputs.
Self-improvement: With self-evaluation and self-update modules, an LLM could improve response consistency at the representation level, especially with methods such as self-consistency and self-refinement. This method promises methods that alleviate hallucinations and improve internal coherence across different reasoning tasks.
5. Final Thoughts and Conclusion
Hallucinations in LLMs result from neural network limitations and probabilistic model architectures. While they cannot be eliminated, understanding the causes of hallucinations provides the foundation for effective mitigation. Techniques such as selective context filtering, retrieval-augmented generation, chain-of-thought prompting, and task-specific modeling, significantly reduce hallucination risks, enhancing the reliability and trustworthiness of LLM outputs. As the field continues to evolve, these strategies will likely play a central role in developing AI systems that are both accurate and contextually aware, advancing the practical application of LLMs in diverse domains.
Whether you choose to build an open-source solution or opt for a managed tool like kapa.ai, the principles, and strategies to mitigate AI hallucination remain consistent. At kapa, we address many of these challenges directly, ensuring more reliable and accurate outputs. If you're interested in seeing how Kapa.ai can transform your knowledge base into an intelligent assistant, request a demo here.
Turn your knowledge base into a production-ready AI assistant
Request a demo to try kapa.ai on your data sources today.