Nathana SharmaMarch 8, 2023

What does it mean when an LLM “hallucinates” & why do LLMs hallucinate?

How to know if your LLM is generating true information? Why LLMs need reinforcement learning (RLHF).

With the emergence of ChatGPT and similar tools, it's becoming a common occurrence to type in a seemingly perfect prompt into ChatGPT and get back a polished sophisticated essay on the science of cucumber farming in Costa Rica. The only problem is that some parts of the essay are just not factually true. But which ones?

Large Language Models (“LLMs”) are a powerful tool to generate coherent and contextually appropriate text. LLMs can be used for everything from travel suggestions, marketing advice, to “helping” with homework. However, LLMs are susceptible to “hallucination”, where the model generates text that is factually incorrect or entirely fictional.

In this blog post, we will explore why hallucination occurs from a technical perspective and what can be done to mitigate this. We'll also dive into why there is a tremendous opportunity to supplement LLMs with steps to verify their output using reinforcement learning with human feedback (“RLHF”).

LLM Hallucination

I spent the summer of 2008 doing research on a Hebrew poet, Avraham ben Yitzchak, an influential modern Hebrew poet who published only 11 poems during his lifetime. When asking ChatGPT about this poet, it confidently provided some incredibly detailed answers, including making up a poem and translation purportedly by Avraham ben Yitzchak, but is in fact totally made up by ChatGPT. It’s an incredible technology but if you don’t know that the poem is not real, you might think that it is in fact a translation of a real poem.

Example LLM hallucination in historical poetry using ChatGPT
Example LLM hallucination in historical poetry using ChatGPT

At its core, ChatGPT (using the latest versions of GPT) are neural networks trained on vast amounts of text data. It's a statistical machine, learning patterns and relationships from the data it ingests. During training, they are exposed to diverse sources of text, from scientific articles to works of fiction. It learns to predict the next word in a sentence based on the context provided by the preceding words.

LLM hallucination occurs because the model's primary objective is to generate text that is coherent and contextually appropriate, rather than factually accurate. The model's training data may contain inaccuracies, inconsistencies, and fictional content, and the model has no way of distinguishing between fact and fiction. As a result, it may generate text that aligns with the patterns observed in the training data but is not grounded in reality.

Why do LLMs hallucinate?

From a technical perspective, hallucination in large language models can be attributed to a lack of ground truth from external sources. Ground truth refers to data that accurately represents the real-world phenomena or outcomes that an AI model aims to predict, classify, or generate. It is considered the "gold standard" or reference point against which the model's predictions or outputs are compared.

Unlike traditional supervised learning tasks where ground truth labels are explicitly provided, ChatGPT is trained using a variant of the transformer architecture and is designed to predict the next word in a sentence given the preceding words.

The training process involves feeding the model large amounts of text data, and the model learns to predict the next word based on the context provided by the previous words. In this scenario, the ground truth is derived from the text data itself. For example, if the input sentence is "The cat is sitting on the ___," the ground truth for the next word might be "mat." The model's objective is to generate text that aligns with the patterns observed in the training data, which serves as the ground truth.

ChatGPT generates answers based on patterns in the training data and does not check for factual accuracy in the model’s predictions.

Unlike traditional supervised learning tasks, language models like ChatGPT do not rely on explicitly labeled data. Instead, they learn from the inherent structure of the text itself. While this self-supervised approach allows for training on large, unlabeled datasets, it also means that the model lacks access to external sources of ground truth for verification. As a result, the model may learn and propagate inaccuracies present in the training data.

Additionally, text data used for training large language models may include fictional content, such as literature, as well as subjective content, such as opinions and beliefs. The presence of fictional and subjective content poses challenges for defining ground truth, as the model must learn to generate text that is coherent and contextually appropriate, even if it is not factually accurate or objective.

How to mitigate hallucinations

Our LLM spins tales as easily as it recounts facts—a digital bard, if you will. It's a marvelous tool, but it has a quirk: sometimes it makes things up. It weaves stories that sound plausible, but they're pure fiction. How do we teach LLMs to stick to the truth?

Enter reinforcement learning with human feedback (RLHF), a method that lets us train language models like GPT-4 to be more discerning about the accuracy of their output. As Ilya Sutskever, Chief Scientist at OpenAI has advanced, “I'm quite hopeful that by simply improving this subsequent reinforcement learning from human feedback step, we can teach it to not hallucinate.” The idea is simple: we use human feedback as a guiding light to reward the model when it's right and nudge it back on track when it strays.

Reinforcement learning (RL) is all about an agent learning to make decisions in an environment. The agent takes actions, receives feedback in the form of rewards or penalties, and iteratively adjusts its behavior to maximize cumulative reward. In our case, the agent is the language model, the actions are generating text, and the rewards come from human evaluators who assess the quality of the generated text.

Human feedback is the linchpin of this process, acting as the compass that points the model toward factual accuracy. Human evaluators assess the coherence, relevance, and truthfulness of the generated text, providing feedback that shapes the model's learning trajectory. This feedback is distilled into a reward signal that guides the model's quest for optimization.

The process of fine-tuning a language model like ChatGPT using reinforcement learning with human feedback involves several key steps:

1. Pre-training: We first start with a language model pre-trained on a vast corpus of text data. It's a treasure trove of language patterns, syntax, and semantics. This step lays the groundwork for the fine-tuning that follows.

2. Data Collection: Human evaluators step in, reviewing input prompts and corresponding model-generated responses. They rank or rate the responses based on criteria like coherence, relevance, and factual accuracy. This step assembles a dataset of human preferences and evaluations.

3. Reward Modeling: We use the collected human feedback to forge a reward model—a quantifier of text quality. This serves as a proxy for human judgment, a beacon that illuminates the path for reinforcement learning.

4. Proximal Policy Optimization: Next, we fine-tune the language model using an RL algorithm like Proximal Policy Optimization (PPO). The model generates text, receives rewards from the reward model, and iteratively updates its parameters to maximize cumulative reward.

5. Evaluation and Validation: Lastly, we put the fine-tuned model to the test on new and unseen data. Human evaluators can join in to provide additional feedback and validate the model's output. We measure coherence, factual accuracy, and alignment with human preferences.

As you can see, fine-tuning LLM outputs require us to set up ways to reliably get high-quality human feedback, which is a complex coordination challenge in of itself. Some other active areas of research to mitigate hallucinations in LLMs include domain specific fine tuning, adversarial training, and multi-modal models. Note that all of these approaches require some level of verification for factual accuracy outside the model itself, currently best done with RLHF.

Domain Specific Fine-Tuning: Fine-tuning the model on domain-specific or task-specific data can help improve its performance and reduce hallucination for specific use cases. Domain-specific fine-tuning can help the model better understand the context and conventions of a particular field, leading to more accurate and reliable output. Bloomberg GPT is an example of a new LLM that is domain specific to financial knowledge.

Adversarial Training: Adversarial training involves training the model to recognize and avoid generating hallucinated content. This can be achieved by using adversarial examples, where the model is presented with text containing hallucinations and is trained to identify and correct them.

Multi-Modal Models: Combining language models with other modalities, such as images or structured data, can provide additional context and help ground the model's output in reality. Multi-modal models can leverage information from multiple sources to generate more accurate and reliable text.

Final thoughts on LLM hallucination

While LLMs are proving to be a breakthrough innovation, we need to remember that not everything that they confidently state is actually true. While they are showing strong promise in domains such as natural language processing, machine translation, and content generation, beware of hallucinations in LLMs as they can generate outputs that are either incorrect or potentially harmful. There is an immense opportunity to take the outputs from LLMs and improve them by including extra verification steps through reinforcement learning with human feedback.


  • https://cdn.openai.com/papers/gpt-4.pdf
  • https://openai.com/research/learning-to-summarize-with-human-feedback
  • https://www.forbes.com/sites/craigsmith/2023/03/15/gpt-4-creator-ilya-sutskever-on-ai-hallucinations-and-ai-democracy/?sh=4759cd191218
  • https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1--VmlldzoyODk5MTIx
  • Note that integrating ChatGPT with WolframAlpha can help with some computational hallucinations but does not in itself solve the hallucination problem(https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-wolfram-superpowers/). ChatGPT still makes up confident answers in response to questions without flagging those answers as made up every time as you can see in the example of making up a poem in this post.
  • https://www.arxiv-vanity.com/papers/2302.12813/
  • https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/
  • https://openaccess.thecvf.com/content_CVPR_2020/papers/Li_Adversarial_Feature_Hallucination_Networks_for_Few-Shot_Learning_CVPR_2020_paper.pdf
  • https://arxiv.org/abs/2302.04023