A new report led by the European Broadcasting Union, which built on work done initially by the BBC, determined that of several major AI systems dedicated to summarizing news, Google’s Gemini resulted in the most egregious outputs. Evaluators detected the largest proportion of major errors and worst sourcing behavior from Gemini compared with peers like OpenAI’s ChatGPT, Microsoft’s Copilot and Perplexity. The stakes are rising as younger audiences in particular turn to AI emissaries for headlines and context, with 15 percent of under-25s already using such tools to get the news.

How the Study Tested AI News Summaries in Depth

The BBC opened with a wide-reaching questionnaire and six focus groups to understand how people engage with, and feel about, AI in relation to news. The initiative was then extended across markets by the EBU, where leading systems are benchmarked on a range of news prompts. Reviewers looked at factual accuracy, context, quote integrity and attribution practices, including whether the tools made it clear to users that they linked to high-quality sources and weighted them appropriately.

Table of Contents

How the Study Tested AI News Summaries in Depth
Where Gemini Most Consistently Came Up Short
AI: It’s Something You Can Trust (More Than Yourself)
Gains Noted, But A Long-Standing Gap In Performance
What It Means For Platforms And Publishers

No model aced the test. The majority sat in the same performance band and made routine though mostly manageable errors. But Gemini broke through as an outlier in overall issue quantity and share of significant issues — mistakes that would materially mislead a reader or fabricate something into the record.

Where Gemini Most Consistently Came Up Short

Researchers noted a consistent pattern: Thin or missing links to original material, difficulty distinguishing reputable sources from satirical or low-credibility content and heavy reliance on secondary aggregators such as Wikipedia. Context-setting on complex, fast-changing stories often fell short, with summaries that failed to include important timelines or players.

Quotations were another weak spot. Reviewers recorded summaries with botched direct quotes, clipped key phrases and attributed statements to the incorrect person. In a news environment, the stakes of making those mistakes are high: A bad quote or misattributed source can fundamentally reverse the meaning of a story, perpetuate misinformation and undermine trust.

AI: It’s Something You Can Trust (More Than Yourself)

“Public expectations are complex,” the audience research reveals. In the sample from the UK, 42 per cent of adults reported they trust the accuracy of AI to some extent and this was more prevalent in younger people. At the same time, 84 percent said factual errors would be very damaging to their trust. The hitch: reviewers determined that most of the AI’s responses included at least one mistake, meaning that many people overlook or skim over errors in tidy little summaries.

That distance between confidence and detection is particularly dangerous for breaking news, when the facts can shift swiftly or sourcing matters. Without clear citations and click-throughs to primary reporting, users have no way to easily audit claims — or learn when the AI has gotten it wrong.

A colorful, four-pointed star icon with a gradient of red, yellow, green, and blue, set against a professional flat design background with soft blue and purple gradients and subtle geometric patterns.

Gains Noted, But A Long-Standing Gap In Performance

The study covered two main collection periods separated by approximately 6 months, thus capturing a moving target as models and retrieval pipelines evolved. All systems showed developmental progress. Gemini, in its defense, was one of those with the highest accuracy gain. But even after making gains on those metrics, reviewers concluded that Gemini was still lagging behind peers on the most weighty issues.

Why the gap? “News summarization is a punishing technical cocktail: reliable retrieval, robust source ranking, explicit citations, careful quote handling all have to flow in lockstep,” the team wrote. If a model hallucinates the context, overweights tertiary sources or cuts quote segments without preserving meaning, then it reads well but is untrustworthy. That’s as much a workflow problem as a model problem.

What It Means For Platforms And Publishers

Platforms, the message goes: Transparency trumps polish. Summaries should be of the most prominent one-click links to leading reporting on the included topic, citing document titles or descriptions, and direct readers to original sources rather than wikis or unsourced aggregations. We should construct guardrails to recognize satire and low-credibility domains — and classify uncertain claims as such — which could cut down the most high-stakes mistakes.

For publishers, the findings illustrate why provenance and good metadata are important. Efforts around content authenticity — in which newsrooms adopt verifiable provenance frameworks as part of what they do — this helps AI systems to ground outputs in the reporting that deserves credit. And for your everyday reader, the cautious posture remains the same: think of AI summaries as a jumping-off point, follow the links and let that original journalism be your last word.

The EBU’s report and the BBC’s audience research combined deliver a blunt verdict: AI can be useful in categorizing news, but it is accuracy, attribution, and accountability which ultimately decide whether a summary informs or misinforms. For the moment, Gemini has the most to show and tell.