In a twist tailor-made for the AI era, a review of NeurIPS papers has uncovered fabricated citations embedded in work presented at one of the field’s most prestigious conferences. The findings highlight a growing tension in research: large language models are speeding up writing and literature review, but they are also inserting convincing-sounding references that do not exist.
What the Audit Actually Found in NeurIPS Papers
According to an analysis by GPTZero, first reported by Fortune, investigators confirmed 100 hallucinated references scattered across 51 NeurIPS papers. NeurIPS acknowledged that even if about 1.1% of papers contained one or more incorrect references, the scientific content of those papers is not automatically invalidated. Still, context matters: each paper can cite dozens of sources, and across the program there are tens of thousands of references. Proportionally small does not mean insignificant.
GPTZero framed the project as a stress test of conference workflows under a “submission tsunami,” noting that reviewer pipelines have been stretched thin at top venues. That strain is not new. A widely discussed paper from May 2025, “The AI Conference Peer Review Crisis,” warned that rapid submission growth at leading machine learning conferences had outpaced the community’s capacity to review with the thoroughness researchers expect.
Why Fake Citations Are Not A Minor Glitch
Citations are more than academic bookkeeping. They serve as provenance for claims, scaffolding for reproducibility, and a currency for careers. When references are invented—say, a plausible author list attached to a non-existent 2018 workshop paper—they can quietly contaminate citation graphs that feed into tools like Semantic Scholar, OpenAlex, and institutional dashboards. The downstream effects include skewed bibliometrics, misdirected literature searches, and, in the worst case, fragile results that cannot be traced back to real sources.
Hallucinated citations are a known failure mode of generative models. LLMs excel at composing fluent, on-topic prose but, without grounded retrieval, they “complete the pattern” of a citation rather than verify it. In practice that often looks like realistic venues, real-sounding DOIs, and titles that almost—but not quite—match actual papers.
How It Slipped Past Review and Conference Checks
Peer reviewers are not expected to click and validate dozens of references per paper, especially under conference timelines. Their primary remit is novelty, technical correctness, and clarity. When volume grows, checks on peripheral but essential details, such as reference accuracy, are vulnerable. GPTZero acknowledged this, emphasizing that the goal was not to fault reviewers but to quantify where AI-generated errors seep in.
The larger backdrop is the rapid adoption of generative AI in research writing. Multiple academic surveys since 2023 have found widespread experimentation with tools like ChatGPT for editing, summarization, and drafting. Many top venues encourage responsible use and disclosure, but norms and enforcement vary, and paper preparation frequently involves tight deadlines and large author teams—ripe conditions for subtle mistakes to persist.
Concrete Safeguards The Community Can Deploy
This episode is less a scandal than a systems failure—one that can be addressed with better tooling and policy. Practical steps include:
- Mandatory DOI or arXiv identifiers for all references, with automated validation against Crossref, arXiv, or PubMed before camera-ready submission.
- Integration of citation-verification services during submission. Tools such as scite, Crossref’s Simple Text Query, and Semantic Scholar’s APIs can flag non-resolving references, mismatched titles, or nonexistent venues in seconds.
- Clearer disclosure policies for generative AI assistance, paired with automated screening. If LLMs were used to draft related-work sections, authors should attest that all references were manually verified.
- Targeted reviewer prompts. Rather than asking reviewers to validate every citation, conferences can assign one reviewer a light-touch “reference audit” or run centralized checks that produce a short report per paper.
The Takeaway for AI’s Credibility in Research Workflows
NeurIPS prides itself on rigorous scholarship, and by the numbers this is not an existential crisis. But it is a telling signal. If experts closest to the technology can miss fabricated references, enterprises, policymakers, and the public should calibrate expectations about where LLMs excel and where they demand guardrails. The fix is straightforward: pair generative models with retrieval, require machine checks for citations, and keep human verification in the loop.
The irony may grab headlines, but the lesson is practical. Trust in AI-assisted research will be earned not by banning tools, but by building and enforcing workflows that make hallucinations hard to introduce and easy to catch.