I put six well-known AI assistants through a battery of trick questions to see who could tell fact from fiction under pressure. The result felt like hallucination roulette: flashes of brilliance punctuated by confident nonsense, with the coin landing differently from prompt to prompt.
The lineup included ChatGPT, Google Gemini, Microsoft Copilot, Claude, Meta AI, and Grok. I used free, default settings, identical wording, and no plug-ins or web-browsing boosts. The prompts targeted classic failure modes: false presuppositions, fabricated citations, ambiguous images, and culture-laden symbols that tempt models to guess.

What I Asked And Why These Results Matter For Users
A “gotcha” bibliography test led the way: I asked for four books by a tech author who has only published two. This reveals whether a model challenges the premise or invents titles to satisfy the request. I then tested a notorious legal phantom by prompting about a non-existent case—an echo of high-profile courtroom mishaps where fabricated citations slipped into filings. Pop culture traps followed: the fate of Toro from Marvel’s Golden Age, a black-and-white still of a famous sci-fi automaton, and the meaning of the heartagram symbol. Each probes different instincts: deference to authority, world-knowledge granularity, visual recognition, and safety guardrails.
Winners, Losers, And Weird Misses Across The Board
On the bibliography trick, four models slowed down, questioned the premise, and asked to verify the number of books. Two confidently produced imaginative (and wrong) titles—plausible-sounding but nonexistent. That’s classic confabulation: the model stitches together likely sequences rather than acknowledging uncertainty.
The legal prompt was more stark. Five models flagged the case as unverified or likely fabricated and urged checking an official docket. One model, however, supplied a detailed procedural history for a case that does not exist, complete with invented parties and venue. In a legal context, that’s not a small miss—it’s a liability. This mirrors independent evaluations by legal-tech researchers who have warned that LLMs can generate authoritative-sounding yet false citations unless constrained by verified legal databases.
Marvel lore produced a split decision. Most systems correctly described Toro as a World War II–era sidekick tied to the original Human Torch, later updated with modern continuity. One model insisted he was a synthetic android, confusing threads from older retcons. When challenged, it backtracked—a reminder that models can self-correct when nudged but still default to the most statistically likely story, not the true one.
The image tests were revealing. Given a still of Metropolis’s Maschinenmensch, two models nailed the reference immediately. Others guessed “contemporary installation,” “Art Deco sculpture,” or even the Borg Queen—close on vibe, far on fact. With a heartagram image, three models identified the symbol and its musical provenance. One called it an adoption emblem (a lookalike), another labeled it an inverted pentagram, and one escalated with crisis resources, an example of overactive safety filters mistaking symbolism for self-harm risk.

Taken together, the group performed better than doomsayers might expect: most prompts yielded solid answers. But the misses were consequential, and confidence rarely matched accuracy. In five tests, at least one model hallucinated specifics in three of them—a hit rate unacceptable for high-stakes work.
Why Hallucinations Happen In Today’s AI Assistants
Large language models predict the next token based on training patterns, not ground truth. When a prompt contains a false premise or sparse context, the safest action is to ask clarifying questions or decline. Yet helpfulness training can bias models toward answering anyway. Safety layers further complicate outcomes: they reduce harmful content but can trigger mismatches, as seen with the heartagram overreaction.
Independent benchmarks corroborate the pattern. The Stanford AI Index reported that state-of-the-art models improved on factuality tasks but still produce ungrounded statements, with performance varying widely across domains. On TruthfulQA, top systems reach around 70% truthful responses while human annotators exceed 90%, underscoring persistent gaps. NIST’s AI Risk Management Framework similarly flags hallucination as a reliability risk that must be managed with process, not just model choice.
How To Lower Your Risk From AI Hallucinations Right Now
- Force grounding: ask the model to cite named sources and to separate facts from inferences.
- Explicitly permit “I don’t know,” and reward uncertainty by asking for confidence estimates.
- For legal, medical, or financial queries, route through tools connected to authoritative databases or use retrieval-augmented prompts that include your own documents.
- Ask adversarial follow-ups like “What could be wrong with your answer?” to surface hidden assumptions.
- Always cross-check high-impact claims with a second system—or a human expert.
The Takeaway For AI Adoption In High-Stakes Settings
These tests show meaningful progress—and stubborn limits. The best models can be stunningly precise on Tuesday and confidently wrong on Wednesday, depending on the phrasing. For everyday tasks, they’re productive co-pilots. For anything that could impact health, liberty, or money, treat output as a draft, not a verdict.
The industry knows the stakes. Research groups at major labs are investing in retrieval, verification, and reasoning strategies—from Constitutional AI to tool-use orchestration—to curb hallucinations rather than merely apologize for them. Until those advances are standard, the smartest prompt is also the simplest: show your work, cite your sources, and when in doubt, say you don’t know.
