OpenAI says the industry has been attacking AI hallucinations from the wrong angle. The problem isn’t just messy training data or model size—it’s the way we score models. If you reward systems for acting like straight-A test takers, they’ll guess when they should say, “I’m not sure.” The fix, OpenAI argues, is disarmingly simple: change evaluations so uncertainty is a first-class answer.
Why models guess when they shouldn’t
Modern language models are tuned to maximize test accuracy on benchmarks like MMLU and GSM8K. Those scoreboards typically use binary grading—right or wrong—with no credit for calibrated doubt. In that regime, guessing beats abstaining. If a user asks for something unknowable—say, their birthday—a model that guesses has a 1-in-365 chance of being marked correct. An honest “I don’t know” earns zero every time.

Scale that over millions of prompts and you get a quiet, statistical push toward overconfident answers. Reinforcement learning from human feedback (RLHF) can make this worse by training assistants to be helpful and decisive, even when the ground truth is uncertain. Researchers at OpenAI and elsewhere have long observed this miscalibration: models frequently assign high confidence to wrong answers on multiple-choice tasks, a pattern documented by academic groups studying expected calibration error.
The result is what users experience as hallucinations—fluent, plausible statements that simply aren’t true. It’s not that the model is “lying” with intent; it’s playing to the rules we set.
The straightforward fix: pay for honesty
OpenAI’s proposal is to adjust incentives, not just data. Instead of grading only on accuracy, evaluators should also reward appropriate expressions of uncertainty. In practice, that means two changes: let models abstain when they’re unsure, and score their confidence with proper scoring rules (think Brier or log scores) that punish misplaced certainty and reward well-calibrated probabilities.
A quick example: on a four-option question, random guessing yields 25% accuracy. If evaluations award partial credit for “I don’t know”—say, equivalent to a calibrated 20% confidence—the rational strategy is to abstain unless the model’s internal confidence exceeds that threshold. Over time, this nudges systems to differentiate between what they know, what they can infer with tools, and what they should defer.
This lines up with prior evidence. OpenAI researchers have shown that large models can predict when they’re likely to be right, enabling “selective answering.” Work from Stanford and Berkeley on selective prediction and calibration supports the same idea: the path to fewer errors is not maximal assertiveness, but calibrated responses and the option to abstain.
What it means for benchmarks and products
Leaderboards shape behavior. When the Hugging Face Open LLM Leaderboard or popular academic suites treat abstentions as wrong, vendors are incentivized to ship models that guess. Switching to metrics that accept uncertainty—allowing “I don’t know,” scoring confidence, and testing selective answering—realigns competition toward reliability, not bravado.
For product teams, the implications are practical. Calibrated assistants can route uncertain questions to retrieval, tools, or human review. In regulated contexts like healthcare and finance, that’s not just a UX upgrade—it’s risk management. The NIST AI Risk Management Framework emphasizes transparency around uncertainty, and enterprise buyers increasingly ask for confidence indicators and abstention rates during evaluation.
Crucially, this doesn’t require rethinking model architectures. Many systems already produce internal logits and can generate confidence estimates or abstain tokens. The heavier lift is updating evaluation harnesses and reward models so that restraint is scored as competence, not failure.
Evidence, limits, and how to get started
OpenAI’s researchers report that simple tweaks to mainstream evaluations—crediting uncertainty and penalizing overconfidence—reduce hallucinations across tasks. That dovetails with earlier findings from work on “models knowing what they know,” where self-assessed confidence improves the accuracy of answers the model chooses to give.
There are trade-offs. Over-cautious models can frustrate users, and naive abstention policies may tank coverage. The solution is thresholding: set confidence cutoffs by domain, couple abstention with strong retrieval and tool use, and monitor both coverage and error. Teams should track calibration metrics (like Brier score and expected calibration error), abstention rates, and end-to-end task success—not just raw accuracy.
For organizations evaluating models, three steps make this concrete: include abstain as an allowed output in test suites, adopt proper scoring rules for confidence, and report selective accuracy (performance when the model chooses to answer). For training, align RLHF and reward models with these same objectives to avoid teaching systems that certainty is always “helpful.”
A small change with outsized impact
Hallucinations aren’t an immutable flaw of generative AI; they’re a byproduct of incentives. By letting models say “I don’t know” and paying them for being right about their own uncertainty, OpenAI argues the industry can cut falsehoods without exotic new algorithms. It’s a refreshingly modest prescription: fix the rules of the game, and the players will behave differently.