Large language models don’t merely make mistakes; they confidently make up facts. An increasing number of researchers say the problem isn’t just model capacity or data quality, but the incentives we’ve created around training, evaluating, and deploying these systems.
OpenAI researchers recently shined a light on how models guess when to opt out, attributing much of the problem to accuracy-only scoreboards that celebrate lucky shots as much as careful reasoning. If incentives matter, then current incentives are teaching models to bluff.
Why next‑token training encourages confident guesses
At pretraining time, language models are trained to predict the next token the rather than to distinguish true from false. The goal promotes fluency and pattern-matching on huge spans of text, yet punishes noisily inventing details. As OpenAI observes in the paper, syntactic and stylistic quality improve incrementally with scale, but rare facts, such as a specialized dissertation title, have few reliable patterns to ride across.
This is a brittle foundation: when there is not enough context, models will interpolate. The result can sound credible because the same training that disregards truth also maximizes for coherence and confidence. It’s a reification of Goodhart’s Law: optimize for the proxy (fluency) and you distort the target (truth).
The evaluation trap: the “accuracy-only” scoreboard
Evaluation culture amplifies the problem. Leaderboards that value percent correct with no penalty for overconfidence incentivize guessing. As multiple‑choice tests with no penalization for bad guessing reward risk‑taking, accuracy‑only eval treats a confident wrong answer as preferable to an honest “I don’t know.”
Reinforcement learning from human feedback may exacerbate this bias. Human raters also prefer cooperative, helpful, and agreeable answers. Anthropic has observed this phenomenon of “sycophancy,” where models parrot user beliefs even when those beliefs are false, since agreement is rewarded with good ratings. And when we measure the success of a product in terms of answer rate, speed, and user satisfaction, abstentions feel like failings and fabrications sneak through.
What the data and deployment reveal
Benchmarks expose the gulf between coherence and correctness. Early general‑purpose models on TruthfulQA answered barely over two‑thirds on average, while newer systems excel with careful prompting, error numbers remain untouched. Stanford HAI’s AI Index has learned to yell about hallucinations as a robust failure mode in summarization, question‑answering, and reasoning tasks.
Real‑world tests echo this. Medico-legal assessments have described made‑up references and imaginary studies in a significant proportion of outputs where models are not grafted onto evidence. Industry experiments with anal-o-generation (from groups like Google DeepMind and Meta) demonstrate material reductions in factal errors, but even retrieval can be overridden when incentives are structured so that they still favor the rapid, confident utters of force:Text.
The bottom line: models often know when they might be wrong—work from Anthropic and others demonstrates that language models harbor latent uncertainty signals—but our current scoring and product KPIs do not encourage making those known unknowns.
Fix the incentives: Punish confidence, reward doubt
OpenAI’s suggestion is simple: make evals uncertainty‑aware. Punish confident errors more than honest uncertainty, and award partial credit for not taking a stand or hedging when the evidence is weak. In practice, they say, that would involve replacing single accuracy numbers with metrics such as calibrated accuracy, precision at high confidence, and selective risk curves.
Product metrics should follow suit. Rather than optimize for maximum “answers per query,” teams may monitor verifiability (proportion of claims with references), grounded precision (accuracy of claims supported by evidence), and safe abstention rate (cases in which the model requests tools or refrains). The NIST AI Risk Management Framework and recommendations from the UK AI Safety Institute each promote measurable uncertainty and calibration as central components of trustworthiness in AI.
There are technical levers to backstop those incentives: calibrated confidence scoring, self‑consistency and cross‑examination prompts (machinists argue against each other, rather than having only one AI argue against you), tool use at retrieval and code execution, and chain‑of‑verification (the chance to force models to check a claim against the sources before they can speak). Systems like DeepMind’s Sparrow led the way with rule‑based refusal policies; more recent enterprise deployments gate high‑risk answers behind citations or human review.
Trade‑offs and limits
Harsher penalties for overconfidence leads to higher levels of abstaining participants and reduces the response speed. Some users will find that less insightful. There’s also the danger of “excess doubt,” when models disclaim too much, especially for underrepresented topics where retrieval data is limited. Guardrails need to be domain‑aware: in creative ideation, some level of speculation is OK; in finance or in medicine, it’s not.
And incentives aren’t the only part of the picture. Errors persist when there are gaps in knowledge, ambiguous prompts, and shifts in distribution. Better data curation, sensible tool craft, and true provenance can’t be replaced by any scoring tweak. But changing incentives is a high‑leverage move that alter the way models behave here and now, not until the next moonshot.
Bottom line
Bad incentives do not conjure hallucinations out of nothing, but they make them stick. When leaderboards and KPIs reward lucky guesses and polished prose, models are forced to bluff. Flip the incentives — penalize cocksure wrongness, reward calibrated uncertainty and willingness to update, insist on proof of how you know, in the most concrete and verifiable terms — and you turn honesty from a nice‑to‑have into the winning strategy.