A large study has warned of the risks associated with chatbots giving medical advice, due to their tendency to provide inaccurate and inconsistent information.
The study, out of Oxford University,1 found that while large language models (LLMs) now excel at standardised tests of medical knowledge, they present risks to real users seeking help with their own medical symptoms.
the disconnect between benchmark scores and real-world performance should be a wake-up call for AI developers and regulators
In the study, participants used LLMs to identify health conditions and decide on an appropriate course of action, such as seeing a doctor, or going to the hospital, based on information provided in a series of specific medical scenarios.
The decisions made by people using LLMs were no better than people who relied on traditional models, like online searches or their own judgement.
The study revealed a two-way communication breakdown. Participants often didn’t know what information the LLMs needed to offer accurate advice, and the responses they received frequently combined good and poor recommendations, making it difficult to identify the best course of action.
Lead medical practitioner on the study, Dr Rebecca Payne said, “despite all the hype, AI (artificial intelligence) just isn’t ready to take on the role of the physician”.
“Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed,” she said.
“These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,” Dr Payne said.2
Senior author Associate Professor Adam Mahdi said the “disconnect between benchmark scores and real-world performance should be a wake-up call for AI developers and regulators”.
“We cannot rely on standardised tests alone to determine if these systems are safe for public use. Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare.”2
The study was published in Nature Medicine.1
References
- Bean AM, Payne RE, Mahdi A, et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nat Med. 2026. doi: 10.1038/s41591-025-04074-y.
- University of Oxford, New study warns of risks in AI chatbots giving medical advice, News 10 Feb 2026, available at: ox.ac.uk/news/2026-02-10-new-study-warns-risks-ai-chatbots-giving-medical-advice [accessed Feb 2026].
