OpenAI says the field has approached AI hallucinations the wrong way. It’s not just about bad data or large models, it turns out; it’s about the way that models get evaluated. When you reward systems for acting the way straight-A test takers do, they’ll guess when they need to say, “I don’t know.” The solution, according to OpenAI, is disarmingly simple: revise the way evaluations are made so that uncertainty is a first-class answer.
Why models guess when they shouldn’t
Current language models are tuned to maximize test accuracy on benchmarks such as MMLU and GSM8K. Those scoreboards tend to employ binary grading — right versus wrong — with no points awarded for calibrated doubt. In that world, taking a wild guess is better than refusing to guess. If a user prompts for something unknowable — for example, his own birthday — a model that simply guesses correctly has a 1-in-365 chance of being scored as right. Honest “I don’t know” costs nothing every time.

Scale that across millions of prompts and you have a low-grade, statistical nudge toward overconfident answers. Reinforcement Learning From Human Feedback (RLHF) can exacerbate this, as it does not properly account for ambiguous ground truth when learning to be helpful and confident. This miscalibration has been reported by OpenAI researchers and others for years now: models will frequently be very confident they have the right answer on a multiple-choice task, even when they don’t — something that academics studying expected calibration error have demonstrated in various studies.
The result is what people experience as hallucinations — fluent, plausible statements that are nothing but false. It’s not that the model is “lying” in a purposeful way; it’s arching toward the rules we established.
The simple fix: pay for honesty
OpenAI’s suggestion is to shift incentives, not just data. In addition to scoring simply on correctness, scorers would want, for example, to penalize students who only write what they give us and never think of expressing a degree of uncertainty. In practice, what this means is two changes: have models abstain when they’re uncertain, and score how confident they are with proper scoring rules (think something like Brier or log scores) that penalize misplaced certainty and reward well-calibrated probabilities.
Here’s an example, on a four-choice question random guessing will result in 25% accuracy. Under less scoring, where you get partial credit for “I don’t know”—say at a calibrated confidence level of 20%—then the model’s rational decision is to abstain unless its internal confidence is above that threshold. This gradually encourages systems to make a distinction between what they know, what they can infer with the aid of tools, and what they should defer.
That fits with previous evidence. OpenAI researchers have proved that enormous models can predict when they are most likely to be right, and that they perform “selective answering.” Standford and Berkeley work in selective prediction, and calibration, suggest the same conclusion: few errors are best achieved not by maximal assertiveness, but by calibrated responses paired with the option to abstain from having to provide any prediction.
What the ruling means for benchmarks and products
Leaderboards shape behavior. When the Hugging Face Open LLM Leaderboard or common academic suites treat abstentions as “wrong,” vendors have an incentive to ship models with the wrong answer. Changing the metrics to embrace uncertainty — to allow “I don’t know,” to score confidence and to test selective answering — steers competition back from bravado to reliability.

The impact for product teams is practical. Calibrated helpers can forward unsure questions to retrieval, tools, or human judgment. In regulated sectors such as health care insurance and finance, that’s not just a UX uplift — it’s risk management. The NIST Risk Management Framework for AI focuses on transparency regarding uncertainty and enterprise customers are demanding confidence indicators and ABSTAIN rates during evaluation.
Importantly, this doesn’t necessitate a revision of model architectures. In many systems internal scores are produced, and a confidence score can be derived for, or, we may abstain. The heavier lift is in updating evaluation harness and reward models so that restraint is scored not as failure, but as competence.
Evidence, constraints, and how to get started
OpenAI’s researchers write that merely modifying standard evaluations by crediting uncertainty and penalizing overconfidence reduce hallucinations on a variety of tasks. That complements previous findings on “models knowing what they know,” in which self-assessed confidence improves the percent of the time the model gets the answer that it would have chosen to give.
There are trade-offs. Models that are overly careful can be irritating to users, and naive abstention policies could tank coverage. The answer is thresholding: pick confidence thresholds for domains, interleave abstention with strong retrieval and tool use, and track both coverage and error. Teams should be monitoring calibration metrics (like Brier score and expected calibration error), abstention rates and end-to-end task success, not just raw accuracy.
Three concrete steps for organizations evaluating models: include abstain as allowed output in test suites, use proper scoring rules for confidence, and report selective accuracy (model’s performance when it chooses to answer.) For training, reward and RLHF models with objectives aligned with these should be aligned to prevent the teaching of systems that certainty is always “better.”
A modest change, an outsized impact
And hallucinations aren’t an inherent defect of generative AI; they’re an artifact of incentive. By allowing models to say, “I don’t know” — and paying them when they’re right about being uncertain — the industry can trim falsity without resorting to exotic new algorithms, OpenAI insists. It’s a refreshingly modest prescription: Revise the rules of the game and the players will act differently.
