While AI "hallucination" is typically viewed as a flaw (often providing comic relief), a recent whitepaper by OpenAI suggests it's actually a natural, emergent property of the way language models are currently trained and benchmarked.
Hallucinations are a logical consequence of optimization strategies that prioritize performance on standard benchmarks. These benchmarks are used to compare AI models, so the providers have every incentive to game the system so their models perform better than their competitors on these benchmarks.
Though that strategy does improve performance, it doesn't hide the inherent flaws in how models are trained.
For example, according to the whitepaper, on "Humanity's Last Exam" (HLE), a benchmark designed to be "google-proof" across dozens of fields, all reported state-of-the-art model scores were below 30% accuracy. This indicates a failure rate of over 70% on expert-level questions.
AI models are often inaccurate in both their answers and their self-assessment of correctness. Most current models on the HLE benchmark show calibration error rates above 70%. Though pretrained base models can be well-calibrated (with errors as low as 0.7%), post-training processes like reinforcement learning (PPO) can increase this flawed calibration to 7.4% or higher, making models more overconfident in their incorrect guesses
Also according to the report, if a certain percentage of facts (like birthdays) appear only once in a training data set, the model is expected to hallucinate on at least that same percentage of those facts. For example, if 20% of birthday facts are singletons, the model’s hallucination rate will be at least 20% for those prompts.
Because models are trained to recognize patterns in large data sets, statistical pressures encourage them to calibrate their performance to the training data. This is because they are trained to minimize cross-entropy loss. For a model to be well-calibrated, it must assign probabilities that reflect the true likelihood of a statement being correct. In these cases, the statistical objective of the model forces it to produce a "best guess" rather than admitting it doesn't know.
Benchmarks typically penalize honesty. Most influential benchmarks (like GPQA, MMLU-Pro, and HLE) use binary (0-1) scoring. In a binary scoring system, a correct answer earns 1 point, while an incorrect answer and an "I don't know" (IDK) response both earn 0 points.
Because there is no penalty for being wrong compared to being silent, the mathematically optimal strategy for a model is to always guess when in doubt. A model that always guesses will statistically outperform a "more honest" model that admits uncertainty on current leaderboards because guesses can be right some of the time.
Users expect modern AI models to "know everything". Even though they are trained on incredibly diverse data sets, this expectation is mostly unreasonable. Given this, any model that attempts to generalize beyond its training data must inherently risk hallucinating. Otherwise, it would suffer from mode collapse, failing to produce the full range of valid human responses.
Even advanced techniques like Retrieval-Augmented Generation (RAG) or chain-of-thought reasoning do not eliminate this pressure because the underlying grading system still rewards guessing when these tools fail to find a definitive answer.
Hallucinations are a rewarded behavior. AI models are optimized to be "good test-takers," and in the current world of binary evaluations, hallucination-like guessing is the most successful survival strategy for a model aiming for the top of a leaderboard.
Service confidence is also a concern. If a model responds "I don't know" or "I'm not confident in this answer", how many users would abandon the service? Since an informed guess is correct some of the time, this is seen as a good compromise to ensure service confidence.