Google has pulled AI Overviews out of some liver test searches after the feature began surfacing misleading directions on what constituted “normal,” raising new concerns about the potential health dangers of generative health search responses. (Read more on the reaction to the summaries here.) The action comes after reporting that the summaries gave blanket reference ranges for liver function tests without important clinical context, which experts said could delay care or falsely reassure patients.
What Triggered the Removal of Liver Test Overviews
An investigation by The Guardian revealed that AI Overviews provide oversimplified ranges of common markers such as ALT, AST, ALP, and bilirubin without accounting for variables including age, sex, ethnicity, or methodologies used by the testing laboratory.

That missing context matters. Two people who have the same number might be sitting on very different sides of that clinical line, depending on who they are, how the sample was collected, and which assay a lab uses.
Google said it had acted in accordance with longstanding policies and removed AI-generated summaries for the offending phrases, among them searches about the “normal range of liver blood tests.” But the restriction is narrow. Change the question slightly and AI Overviews can still pop up. Patient groups were quick to point out this limitation, with the British Liver Trust informing The BMJ that this leeway still brought up those less-than-helpful summaries.
Why Liver Test Ranges Aren’t One Size Fits All
Most labs use a 95% reference interval (as in, separating the highest and lowest remaining 2.5%) based on samples from healthy populations to tabulate what constitutes “normal.” By its very nature, that means 5% of healthy individuals will test as falling outside the range. Reference ranges also vary between labs, based on the equipment and population studied. What’s true for one lab’s method can be wrong enough when copying a number from another source.
Demographics further complicate the picture. The hepatology note indicates that the upper limits of normal for alanine aminotransferase (ALT) are lower in women than men, and can widely differ from adult values in children. The American Association for the Study of Liver Diseases has noted these anxieties, with widely cited upper limits roughly in the low 20s (U/L) for women and approximately 30 (U/L) for men, yet laboratories may establish their thresholds based on validation data.
Clinical context is equally critical. The value of alkaline phosphatase can be higher in adolescents, especially if they are undergoing a growth spurt. A mildly elevated AST in the endurance athlete may be due to recent hard exercise. On the other hand, “normal” values can be found in life-threatening diseases as well, according to disease depth and interindividual differences. This is why health care providers don’t interpret measures in isolation and consider them in conjunction with a patient’s history, symptoms, medications, radiological findings, and trends over time rather than any one particular static number.
The stakes are not abstract. There are approximately 25% of people around the world affected by NAFLD, and the majority are asymptomatic. Overconfident AI summaries that collapse nuance can flip the scales from timely assessment to choices that endanger lives.
Google’s Partial Rollback and Policy Tensions
Pulling AI Overviews for a few terms is a practical patch, but it reveals a larger problem: Guardrails that rely on pattern-matching certain queries are fragile.

A little variation in verbiage is all it takes to skirt around the block and reenact the same dangerous scene. That’s not good enough for high-stakes subjects.
Google’s search quality guidelines have treated medical queries for years as “Your Money or Your Life” content, accorded the potential to do real harm, and thus deserving heightened scrutiny of sources. Generative summaries complicate that calculus. The system has already made public stumbles on relatively low-stakes topics, such as recommending glue on pizza; in medicine, the margin for error is nearly nonexistent.
Experts in clinical informatics argue that health answers should be retrieval-anchored to vetted sources and contained within rigid templates that hold caveats, units, and uncertainty. That’s clearer model abstention when confidence is low, explicit cues to check with a clinician, and more stringent unit- and range-based validation. Organizations like the NHS, CDC, and NICE offer structured guidance that could act as guardrails — if the system can resist oversimplifying it.
What Searchers Can Do Now to Interpret Lab Results
If you have lab results, begin with the reference interval listed on your report; this is based on how your lab does its work. It’s important to look at trends over repeated tests, rather than a single data point, and to consider the findings with a clinician familiar with your history. For general education, rely on credible sources like national health services or specialty societies, and treat any AI-generated snippet as a jumping-off point, not a diagnosis.
When searching, question absolute statements regarding “normal” ranges without reference to age, sex, assay variation, or clinical context. If a summary doesn’t reference big-name institutions or glosses over caveats, it’s a red flag to slow down and cross-check.
The Larger Test for AI in Search: Safety Versus Speed
This episode highlights a fundamental tension in AI search: the push to answer quickly versus the obligation to first do no harm—especially in medicine. Unless and until models learn how to responsibly apply context and refrain sensitively, systems will require stricter triggers, tighter constraints, more pronounced uncertainty indicators.
So far, Google’s targeted rollback is a tacit acknowledgment that health searches are a testing ground that the technology has not yet perfected. Regaining trust will take fewer cocky shortcuts and more humility baked into the product.
