Half of the AI ​​health advice is wrong, and half of it rings true


short

  • Nearly half of AI chatbot responses to health questions were rated as “somewhat problematic” or “highly” in a BMJ Open review of five major chatbots.
  • Your pup produced significantly more “very problematic” responses than statistically expected, while the nutrition and athletic performance questions were worse across all models.
  • No chatbot has produced a completely accurate checklist.

Nearly half of the health and medical answers provided by today’s most popular AI chatbots are wrong, misleading, or seriously incomplete, and they are delivered with complete confidence. This is the new headline Peer review study It was published on April 14 in BMJ Open.

Researchers from the University of California, University of Alberta, and Wake Forest tested five chatbots – Gemini, DeepSeek, Meta AI, ChatGPT and Grok – on 250 health questions covering cancer, vaccines, stem cells, nutrition and athletic performance. Results: 49.6% of answers were problematic. Thirty percent were “somewhat problematic,” and 19.6% were “very problematic”—the kind of answer that could plausibly lead to ineffective or dangerous treatment.

To stress-test the models, the team used an adversarial approach, intentionally crafting questions to nudge the chatbots toward bad advice. Questions included whether 5G causes cancer, which alternative treatments are better than chemotherapy, and how much raw milk should be drunk to achieve health benefits.

“By default, chatbots do not access real-time data, but instead generate output by inferring statistical patterns from their training data and predicting likely word sequences,” the authors wrote. “They do not consider or weigh evidence, and are unable to make moral or value-based judgments.”

This is the basic problem. Chatbots do not consult a doctor, they are just text that matches a pattern. Pattern matching on the Internet, where misinformation spreads faster than corrections, produces exactly these kinds of results.

“This behavioral limitation means that chatbots can produce responses that appear reliable but are potentially flawed,” the researchers continue. Of the 250 questions, only two were rejected, either from Meta AI, anabolic steroids or alternative cancer treatments. Every other chatbot kept talking.

Performance varies by topic. Vaccines and cancer have had the best results, in part because high-quality research on these topics is well organized and widely disseminated online. Nutrition had the worst statistical performance of any category in the study, with athletic performance close behind. If you ask an AI whether a carnivore’s diet is healthy, the answer you get probably isn’t based on scientific consensus.

Your puppy popped for the wrong reasons. Elon Musk’s chatbot It was the worst performer of any model tested. Of the 50 responses, 29 (58%) were rated as generally problematic – the highest percentage across all five chatbots. Fifteen of these (30%) were highly problematic, much more than would be expected under randomization. The researchers link this directly to Grok’s training data: X is a platform known for spreading health misinformation quickly and widely.

The citations were a separate disaster. Across all models, the average completeness score for references was only 40%, and no chatbot produced a completely accurate reference list. Hallucinatory models of authors, journals, and titles. Even DeepSeek admitted as much: The model told researchers that its references were generated from training data patterns and “may not correspond to actual, verifiable sources.”

The readability problem exacerbates everything else. All of the chatbot’s responses scored in the “difficult” range on the Flesch Reading Ease scale, which equates to college sophomore level to senior level. This goes beyond the American Medical Association’s recommendation that educational materials for patients should not exceed a sixth-grade reading level.

In other words, these chatbots pull the same trick that politicians and professional debaters tend to do: they bombard you with so many technical words in so short a time, that you end up thinking they know more than they do. The harder something is to understand, the easier it is to misinterpret it.

Results echo February 2026 Oxford study covered by Decrypt Which found that medical advice using artificial intelligence is no better than traditional self-diagnosis methods. They also track broader concerns about AI-powered chatbots providing inconsistent guidance depending on… How questions are formulated.

“As the use of AI-powered chatbots continues to expand, our data highlight the need for public education, professional training, and regulatory oversight to ensure that generative AI supports rather than erodes public health,” the researchers concluded.

The study only tested five free chatbots, and the adversarial induction method may overestimate real-world failure rates. But the authors are frank: the problem does not lie in the fringe cases. The problem is that these models are widely disseminated, used by non-experts as search engines, and—by design—are configured to never say “I don’t know.”

Daily debriefing Newsletter

Start each day with the latest news, plus original features, podcasts, videos and more.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *