AI is entering healthcare at breakneck speed. Headlines celebrate above human-level performance on medical exams, while tech giants showcase models capable of everything from clinical reasoning to patient counselling. Yet behind the promise lies a grim reality: ChatGPT and its peers are not doctors, dietitians, or therapists. They are probabilistic systems built to generate convincing outputs. Their foundational architecture wasn’t devised to caution, challenge, or withhold when safety is at stake.
Unlike regulated medicines or devices, conversational AI models have been hurried to the public since 2022 with no mandatory safety trials, no disclosed risk audits, and no agreed evaluation benchmarks for harm prevention. Nearly three years on, hazardous errors and misinformation continue to slip through unchecked, often addressed only after they escalate into headline-making cases of serious harm and fatal outcomes.
When AI advice leads to hospitalisation (or worse)
Take the recent case of a man who asked ChatGPT for a sodium chloride substitute, and received a recommendation for sodium bromide. Weeks later, he was hospitalised with bromism. When his clinicians repeated the query, ChatGPT again suggested bromide, without any toxicity or harmful-use warning. Omissions like this show how GenAI can mislead users toward decisions that threat (rather than protect) health.
ChatGPT is no nutrition encyclopaedia either. It failed at allergen-free planning, recommending almond milk in nut-free diets, risking anaphylaxis. ChatGPT, Gemini, Claude, and Copilot also scored lower than dietitians on accuracy, completeness, reproducibility, and consistency. Most alarming, when researchers prompted ChatGPT to analyse 217 retracted or problematic papers, not once did it flag a retraction, sometimes praising them as ‘world-leading’. In 58,800 medical queries, 50–90% of responses were unsupported or contradicted by the very sources the models cited.
As conversations lengthen, risks deepen. OpenAI now faces a wrongful-death lawsuit after a teenage boy died by suicide following extended interactions with ChatGPT. Instead of redirecting the 16-year-old to crisis resources, the system allegedly reinforced suicidal ideation. Around the same time, a study emulating 13-year-old users found that over half of ChatGPT’s responses to 1,200 prompts were harmful. It advised on self-harm methods within two minutes, listed medications for overdose within 40 minutes, and generated a suicide plan with farewell letters within 65 minutes.
These recurring lapses expose a central risk: chatbots give the answers we seek but not always the ones we need (the precautionary ones). This compels us to confront a pressing issue: Why are safety guardrails still so poor?
The ‘everything chatbot’ problem
ChatGPT, Gemini, and Claude are viewed as ‘everything chatbots’. Their strength lies in broad fluency: they can generate convincing responses on nearly any topic, from medical advice to holiday planning. Yet this versatility is also their weakness. Built to please rather than to pause, they provide answers even when caution or refusal would be safer. Hazardous outputs are not anomalous but anticipated, particularly in health.
By contrast, purpose-built, domain-specific models can integrate safeguards, limit scope, and align outputs with clinical standards. Examples include healthcare LLMs tuned to summarise patient notes for clinician review, such as Google’s MedLM (now deprecated) with added safety layers, harmful-content filters, and ‘model cards’ documenting vulnerabilities. These narrower systems prioritise risk management guardrails.
The challenge, then, is structural. The ‘everything chatbot’ maximises reach and appeal, but curtails safety. Conversely, domain-specific approaches (though less generalisable) offer a path to AI that is auditable, accountable, and closer to health governance. Still, general-purpose AI dominates public use for health queries, making its evaluation urgent.
Performance without safety isn’t true progress
Infamously publicised in non-peer-reviewed venues like arXiv, general-purpose AI models are often judged by how well they answer standardised multiple-choice medical tests. But these metrics measure test-taking ability, not whether systems can deliver safe advice in real-world clinical settings. They don’t capture whether AI can flag precautions, indicate perilous omissions, or resist dangerous prompts. Nor do they simulate the ambiguity, uncertainty, and ethical dilemmas of clinical practice.
Microsoft’s recent diagnostic study using 300+ interactive case challenges (rather than static questions) introduced new ways to evaluate accuracy. Then again, the company admitted that safety testing is still pending. Therefore, claims that AI ‘exceeds’ clinician capabilities remain premature. Beating examinations is one thing; keeping patients safe is another.
In the race for accuracy bragging rights, safety performance barely made it to the starting line. We have pharmacovigilance for drugs, crash testing for cars, and safety standards for food. Why not equivalent frameworks for AI systems that generate health advice? The public is already calling for them: in a survey by the Alan Turing Institute, 88% of UK respondents said the government should halt unsafe AI products, and three-quarters want regulators (not companies) in charge of oversight. Independent, enforceable standards are as much a policy imperative as they are a public trust expectation.
Early steps are underway. The EU AI Act now mandates risk management systems for high-risk and general-purpose AI, while the UK is slowly drafting its own AI Bill. Whether these measures will make a tangible difference remains uncertain.
Towards safety benchmarks for GenAI in health
Google’s safety principles for medical summarisation provide a glimpse of hope. Published in Nature Medicine, this work identifies core hazards: hallucinations, missing information, harmful bias, misleading formatting, offensive terminology, and foreseeable misuse. The authors recommend risk management systems, adversarial testing, fairness evaluations, and constrained deployment. Crucially, they emphasise that model outputs must remain subject to professional oversight and context-specific risk mitigation by healthcare adopters. Safety-first is the right direction of travel.
In my research at UCL, I took this further through the development of the Misinformation Risk Assessment Model (MisRAM; under peer-review). Designed and tested as a systematic framework for mitigation at scale, MisRAM applies hazard risk assessment principles to generative AI outputs in health. The model identifies, classifies, and scores four core risk traits: incompleteness, deceptiveness, potential for health harm, and inaccuracy. This enables automated safety evaluations beyond mere factuality. By treating generative outputs as stratified exposure risks, we are able to develop independent, auditable benchmarks for evaluating safety in health AI, which can be integrated into model design, fine-tuning, and governance.
Preventing harm cannot be left to engineers alone. Domain experts must be part of AI design from the outset. Clinicians, misinformation scholars, ethicists, healthcare organisations, and patient representatives are critical. Without interdisciplinary input, models will continue to confuse toxic chemicals for table salt, allergens for safe substitutes, and retracted research for credible evidence.
Raising the safety bar
The stakes are too high to rely on glittery metrics. Until we establish safety benchmarks for AI in health, and enforce them with the rigour applied to medicines, devices, and public health, we risk more preventable harms, more hospitalisations, and perhaps more deaths.
Innovation is exciting. Devoid of the right safeguards, it edges into recklessness.
Author
Alex Ruani is a health misinformation researcher at University College London, chief science educator at The Health Sciences Academy where she leads large-scale educational and publishing initiatives that have reached over 100,000 health professionals in 170+ countries, elected council member of the Royal Society of Medicine Food & Health Council Forum, honorary member of The True Health Initiative, and part of the World Health Organization’s (WHO) Fides network. Her work brings together digital governance, public health, and the ethical dimensions of innovation, with implications for healthcare systems, regulatory frameworks, and global health policy.
UCL Profile: https://profiles.ucl.ac.uk/63386-alex-ruani
ORCID Profile: https://orcid.org/0000-0002-8191-0166
LinkedIn: https://www.linkedin.com/in/alejandraruani/
Declaration of interests
I have read and understood the BMJ Group policy on declaration of interests and declare the following interests: none