OpenAI is taking heavy flak from all quarters of the AI and math communities after its researchers touted an earlier version of its new model as useful for discovering that they’d solved a collection of well-known Erdős problems — only to have to admit the “solution” was already in the literature. The episode has reopened a perennial question with real implications for science and security: when does an AI system really solve a mathematical problem, and when is it just very good at finding other people’s work?
The claim of new Erdős solutions and the swift walkback
A senior OpenAI executive celebrated GPT-5 for making progress on multiple open Erdős problems and finding solutions to others, according to reports in The Decoder and posts that have since been deleted. Those claims unraveled fast when mathematician Thomas Bloom, curator of the Erdős Problems site, explained that “open” on his page referred to being ignorant of a solution — not that one did not exist. In other words, the model retrieved existing proof terms rather than deriving new mathematics.
- The claim of new Erdős solutions and the swift walkback
- Why retrieval alone is not the same as real reasoning
- The high bar for validating genuine math breakthroughs
- Benchmark bragging and the fragility of reported scores
- What actual progress in AI mathematical reasoning looks like
- Competitive subtext as AI labs race to claim reasoning wins
OpenAI researcher Sébastien Bubeck would later admit that the model found solutions in the literature, before adding the caveat that this is still nontrivial because math research is sprawling and fragmented. Competitors were less forgiving. At Meta and Google DeepMind, top managers said publicly that this was a self-inflicted wound — if literature search is confused with discovery of new knowledge, credibility will suffer, they argued.
Why retrieval alone is not the same as real reasoning
Appreciating the existence of a known proof is helpful; causing one to exist is transformative. The distinction is significant because language models can be superb at retrieval, summarization, and pattern completion without being any good at the kind of rigorous deductive reasoning that supports much of mathematics. They can also hallucinate steps that “sound” like they should work, but they have little chance of working with formal proof checkers out of the way.
It is even more difficult given the enormity of the mathematical corpus. zbMATH Open and MathSciNet jointly index millions of publications, tens of thousands of new math papers every year. Preprints related to subfields continue to stream into arXiv. In that ocean, surfacing the right citation is an honest accomplishment — but it’s more akin to search than to a creative leap demanded of either a mathematician or a reasoning engine that says it has made breakthroughs.
The high bar for validating genuine math breakthroughs
In mathematics, a breakthrough involves a new argument and solution that is upheld by expert scrutiny or mechanical validation. That bar is high by design. The e-communities around proof assistants like Lean, Isabelle and Coq have demonstrated how computer-checked proofs can ratchet up standards; the Lean-driven formalization of parts of Peter Scholze’s work is a famous example of humans and machines collaborating to raise the level of rigor.
If a model is indeed making progress on an open problem, it should have a minimum level of evidence that can be interpreted as such:
- A detailed proof
- Formal artifacts or proof traces
- Expert involvement with specialists in the subject area
A claim about a major conjecture that does not involve those artifacts should be considered a hypothesis, not a headline.
Benchmark bragging and the fragility of reported scores
Some of that confusion comes from how labs sell “reasoning.” Benchmarks such as GSM8K, MATH and AIME-style tests highlight progress but are brittle. A low sample size, quick reaction times, and extra tools can make scores swing wildly. Research groups, like Stanford’s Center for Research on Foundation Models and independent audit communities, have documented contamination risks when training data intersects with test sets.
On popular math benchmarks, recent models regularly exceed 90% when using chain-of-thought prompts and careful sampling. That’s impressive, but it’s not the same as creating a new theorem or an extended original proof that experts would recognize. Confusing leaderboard wins for real discovery is how you would invite precisely the sort of backlash that OpenAI pulled in.
What actual progress in AI mathematical reasoning looks like
The template is clear for AI labs trying to prove out mathematical reasoning. Preregister claims and evaluation protocols to dampen hype cycles and post hoc rationalization. Second, publish proofs in formal languages whenever feasible with public verification. Third, work with domain experts early and often, not to post everything on social media. Finally, separate out the capabilities: retrieval-augmented literature discovery; informal heuristic proofs; mechanically verified proofs are different achievements and should be cashed out as such.
There is also a chance to make this stumble an actual service. Even without solving an Erdős problem on its own, a transparent “AI mathematical research assistant” that can reliably find prior work, suggest related lemmas, and compose formalizations could be of immense use to working mathematicians.
Competitive subtext as AI labs race to claim reasoning wins
The flap comes as OpenAI, Google DeepMind and Meta are locked in an increasingly competitive race to become the technical leader in reasoning. That competition can hasten real progress — and deepen the incentive to oversell. The quickest way to reset expectations is simple: let proofs, code, and third-party verification do the talking. Until that happens, claiming victory on open problems will be not so much innovating as scoring an own goal.