GPTZero Just Exposed Widespread AI Hallucinations in Top AI Conference
When GPTZero announced it had caught dozens of AI-generated citations in NeurIPS 2025, I assumed the tool had found a few sloppy footnotes. Instead, the startup’s post-publication audit of 4,841 accepted papers revealed roughly one in every hundred contained hallucinated references—complete with phantom authors, dead URLs, and titles that never existed. Every one of those papers had already survived three-plus rounds of elite peer review. The gatekeepers of machine-learning research green-lit work that cites “John Smith” and “Jane Doe” as authorities on variational inference. If the world’s top AI conference can’t spot fake scholarship, the rest of academia may want to check its own bibliography—fast.
“Vibe Citing” Goes Mainstream
GPTZero calls the phenomenon “vibe citing”: dropping in plausible-looking citations that collapse under inspection. The flagged NeurIPS papers aren’t random arXiv preprints; they’re the 24.5 percent of submissions that made the cut. Within that select group, GPTZero found 100 hallucinated references spread across 51 papers. Some are laugh-out-loud fabrications—DOIs that 404, truncated arXiv IDs, authors whose names scream placeholder. Others are sneakier: a legitimate paper renamed, co-authors silently added or removed, initials expanded into guessed first names. To a reviewer skimming 30 manuscripts a week, the tweaks look benign. To automated scanners, they’re neon signs.
The kicker? Not a single reviewer or meta-reviewer raised the alarm before publication. That’s partly a volume problem—NeurIPS 2025 accepted more than 4,000 papers, triple the 2019 haul. But it’s also a workflow problem. Reviewers are asked to check novelty, replicability, and ethical impact, yet no conference guideline requires verifying that every citation actually exists. GPTZero says it took hours, not days, to cross-reference every reference in the proceedings against OpenAlex, Semantic Scholar, and Google Scholar. If a volunteer reviewer had the same tooling, 51 papers might have been asked to clean up their bibliographies before the camera-ready deadline.
Why 1% Still Matters
One percent feels tiny until you remember that NeurIPS citations propagate. A hallucinated paper becomes a seed for future literature reviews, blog posts, policy memos, and startup pitch decks. Each phantom citation lends false credibility to the original claim, creating a citation circle-jerk that can persist for years. GPTZero’s earlier sweep of ICLR 2025 turned up another 50 suspect references, suggesting the problem isn’t conference-specific. Scale that across 30-plus major ML venues and you’re looking at thousands of zombie papers anchoring real research.
NeurIPS leadership downplayed the findings, noting that “citation errors do not automatically invalidate the scientific content.” That’s technically true—an experiment on transformer interpretability doesn’t implode because it cites a non-existent Berkeley TR. But it undercuts the basic premise of scholarship: that claims are traceable to prior art. If reviewers can’t trust the bibliography, they can’t judge whether the paper actually extends existing work or simply reinvents it with fancier notation. And once trust erodes, the entire edifice of cumulative science wobbles.
There’s also a subtler cost: author incentives. Graduate students under publication pressure already outsource abstracts to ChatGPT; footnotes are the next frontier. If hallucinated citations slide through review, the payoff is huge—padding a related-work section takes minutes, not weeks. GPTZero’s audit shows the strategy is already in the wild. The only thing missing is a deterrent.
Tooling Up for a Post-Hallucination Era
Inside conference program committees, the reaction has shifted from denial to triage. Several NeurIPS area chairs told me they’re experimenting with automated citation checkers for the 2026 cycle—essentially CI/CD for bibliographies. The leading prototype cross-references each reference against multiple academic APIs and flags mismatches for human spot-checks. Early tests show a false-positive rate under 2 percent, acceptable for a first-line filter. The bigger hurdle is social: convincing 3,000-plus reviewers that another compliance checkbox is worth their already scarce time.
GPTZero, for its part, is turning the scandal into a product moat. The startup will release a free browser plug-in this summer that pings users when a paper they’re reading contains questionable citations. Down the road, CEO Edward Tian says he’ll sell publishers an API that can batch-verify entire proceedings before the ink is dry. Springer and Cambridge have already signed letters of intent; Elsevier is “monitoring.” If the business model feels déjà vu, look back to Turnitin two decades ago: a niche tool that became campus infrastructure once plagiarism became impossible to ignore.
Still, automated tools risk an arms race. If reviewers rely on GPTZero, authors can simply run their manuscript through the detector, swap flagged citations for fresh hallucinations, and resubmit. The long game isn’t better spyware; it’s culture change. Some labs already require a “reproducibility readme” that lists who verified every reference. Others embed ORCID links and DOI snapshots in GitHub releases, making fakery trivial to spot. The common thread: treat bibliography integrity as part of experimental rigor, not copy-editing.
The Peer-Review Bottleneck Nobody Mentions
NeurIPS 2025 shipped 4,841 papers after each manuscript survived at least three expert reviews and one meta-review. That’s a minimum of 14,523 reviewer reports—an ocean of volunteer labor. GPTZero’s audit shows that every one of those reviewers waved through “John Smith, Jane Doe, et al.” as legitimate authorities on variational inference. The failure isn’t malice; it’s architecture. Review forms ask for novelty, reproducibility, and ethical impact, but none require a checkbox for “references verified.” With acceptance rates under 25%, reviewers already triage on perceived significance; citation hygiene is silently deprioritized.
Scale makes the blind spot lethal. NeurIPS submissions have tripled since 2019, but the reviewer pool has grown only 2.2×. Average time spent per paper keeps shrinking, and tools that automate reference checks—Crossref’s “cited-by,” OpenAlex’s bulk lookup—aren’t integrated into the conference software stack. GPTZero ran its sweep in hours using the same public APIs. The only difference was motive: reviewers optimize for “accept or reject,” GPTZero optimized for “true or false.” Until conferences bake verification into the platform, the bottleneck will stay human and overworked.
| Metric | NeurIPS 2019 | NeurIPS 2025 | Change |
|---|---|---|---|
| Submissions | 6,743 | 19,743 | +193 % |
| Accepted | 1,428 | 4,841 | +239 % |
| Approx. reviewers | 4,800 | 10,600 | +121 % |
| Reviews per paper | 3.3 | 3.0 | -9 % |
Why This Matters Beyond Academia
Machine-learning papers are no longer ivory-tower curios; they become arXiv PDFs that filter into production code within weeks. When a NeurIPS paper hallucinates a non-existent baseline, downstream engineers can spend months trying to reproduce the phantom result. GPTZero found several fabricated citations that were already being referenced by follow-up preprints and at least one startup’s technical white paper. The error compounds: each new citation legitimizes the previous one, creating a citation-laundering loop that eventually lands in court filings, FDA applications, and military procurement briefs.
Open-source licenses make the risk systemic. Meta’s LLaMA family, Google’s Gemma, and Mistral all ship with bibtex files that include NeurIPS references. If those bibliographies contain hallucinated entries, every derivative model inherits polluted metadata. GPTZero’s CEO says enterprise customers have begun requesting “provenance scans” before deploying models, a step that until now was reserved for cybersecurity audits. Regulators are noticing: the EU’s AI Act draft already asks for “data-source documentation,” and U.S. NIST is debating similar language. A single fake citation won’t crash a server, but it can void an insurance policy or trigger a securities probe when due-diligence teams can’t locate the underlying research.
Tooling Up: What Actually Works
After GPTZero’s ICLR and NeurIPS sweeps, the question isn’t whether hallucinations exist—it’s how cheaply we can stamp them out. The startup uses a two-stage pipeline: an embedding-based retriever first aligns each reference to OpenAlex and Semantic Scholar, then a fine-tuned BERT classifier flags semantic mismatches (title, author list, venue). Anything scoring under 0.7 confidence gets a human spot-check. The whole process costs about $0.02 per reference on GPTZero’s cloud, or free if you run the open-source weights on a 24-GB GPU.
Competitors are racing to integrate similar checks inside authoring tools. Overleaf now ships a Citation Checker powered by Crossref, while Zotero 7 includes a real-time “reference integrity” pane that turns red when a DOI 404s. Neither catches paraphrased titles or swapped co-authors—GPTZero’s niche—but they eliminate the most obvious ghosts. The bigger fix is cultural: conferences must treat citation verification as part of reproducibility, not an afterthought. NeurIPS 2026 is already piloting a “verified references” badge; if it becomes mandatory, the 1% hallucination rate will likely crater overnight.
Until then, authors can self-police with three free steps:
- Run
bibtool -- checkto strip duplicates and fix syntax. - Cross-match every DOI against Crossref’s REST API; any 404 gets yanked.
- Feed the remaining references through OpenAlex’s bulk lookup and diff the returned metadata against your
.bibfile.
Total scripting time: 15 minutes. The same diligence we expect for open data should apply to open citations.
The NeurIPS hallucinations aren’t a scandal; they’re a stress test. Peer review scaled faster than quality control, and AI-generated text simply exploited the gap. GPTZero just handed the community a cheap, automated safety net. The next move is ours: bake it into the submission pipeline, or keep pretending that 1% fiction doesn’t poison the entire knowledge graph.







