In a groundbreaking study, researchers from Technion, Google Research, and Apple have revealed surprising insights into the inner workings of large language models (LLMs). Their findings suggest that LLMs, like the popular Mistral and Llama 2 models, possess a more nuanced understanding of truthfulness than previously assumed, potentially reshaping the way we approach AI-generated errors, or “hallucinations.”
The term hallucination has become a catch-all in AI for various types of LLM errors, from factual inaccuracies to reasoning missteps. In this study, researchers took a broad view, defining hallucinations as any error produced by an LLM, whether related to fact, logic, or language processing. Traditionally, attempts to mitigate hallucinations have focused on how users perceive these errors, but this approach offers limited insight into how models actually process and encode truthfulness internally. This study, however, examines LLMs’ internal states to understand how they perceive correctness, providing a fresh perspective on error detection and mitigation.
A unique aspect of this study lies in its approach to examining LLM responses. Rather than solely assessing final output tokens (the last word or phrase generated), researchers analyzed specific “exact answer tokens”—the response components that determine the correctness of an answer. This technique revealed that these exact answer tokens house concentrated signals of truthfulness. Such signals could potentially be leveraged to predict errors before they reach the user.
In their experiments, the researchers trained “probing classifiers” to identify patterns related to truthfulness in the internal activations of LLMs, spanning tasks such as question answering, natural language inference, math problem-solving, and sentiment analysis. Notably, these classifiers, trained to detect errors based on exact answer tokens, achieved higher accuracy in error prediction than prior models focused solely on output.
“Our demonstration that a trained probing classifier can predict errors suggests that LLMs encode information related to their own truthfulness,” the study authors write, hinting at new possibilities for understanding how AI perceives correctness.
Interestingly, the probing classifiers exhibited what researchers called “skill-specific truthfulness,” meaning they could predict errors within tasks requiring similar skills but struggled to generalize across unrelated tasks. For example, an error classifier trained on factual retrieval could generalize within that domain but not effectively detect errors in sentiment analysis. This insight hints at a multifaceted representation of truthfulness within LLMs, suggesting that models may approach truth in varying ways depending on the type of task or skill required.
Further experiments indicated that the internal representations within LLMs hold information not only on error likelihood but also on the types of errors likely to occur, providing a foundation for more targeted error mitigation.
A particularly intriguing discovery was that a model’s internal activations could often “know” the correct answer, yet still generate an incorrect response. This discrepancy suggests that while the model’s inner layers may encode truthfulness accurately, something misaligns in the final generation steps. Such findings challenge the effectiveness of current evaluation methods that rely solely on external outputs, pointing to hidden layers of knowledge that, if harnessed, could substantially reduce errors.
By shedding light on how LLMs internally handle truthfulness, this study opens the door for advanced error mitigation techniques. Currently, accessing these internal representations is feasible only with open-source models, but the findings are likely to drive demand for more accessible insights from closed-source AI. The research also aligns with the ongoing work of other leading AI labs, like OpenAI and DeepMind, to interpret the inner workings of LLMs, building toward systems with greater transparency and reliability.
These findings reinforce that the key to improving LLM reliability may lie not only in refining their external outputs but in unlocking the depth of internal processes that underpin them. The potential for LLMs to recognize their own mistakes is more than a technical breakthrough—it’s a promising step toward creating AI systems that understand and adapt to the truth more like humans.