Google’s New AI System Outperforms Physicians in Complex Diagnoses

Published in Nature, Google's new paper advances the future of AI-powered medicine: more automated thus reducing costs and relieving doctors' load so they can attend harder cases The post Google’s New AI System Outperforms Physicians in Complex Diagnoses appeared first on Towards Data Science.

Apr 17, 2025 - 22:05

Google’s New AI System Outperforms Physicians in Complex Diagnoses

Imagine going to the doctor with a baffling set of symptoms. Getting the right diagnosis quickly is crucial, but sometimes even experienced physicians face challenges piecing together the puzzle. Sometimes it might not be something serious at all; others a deep investigation might be required. No wonder AI systems are making progress here, as we have already seen them assisting increasingly more and more on tasks that require thinking over documented patterns. But Google just seems to have taken a very strong leap in the direction of making “AI doctors” actually happen.

AI’s “intromission” into medicine isn’t entirely new; algorithms (including many AI-based ones) have been aiding clinicians and researchers in tasks such as image analysis for years. We more recently saw anecdotal and also some documented evidence that AI systems, particularly Large Language Models (LLMs), can assist doctors in their diagnoses, with some claims of nearly similar accuracy. But in this case it is all different, because the new work from Google Research introduced an LLM specifically trained on datasets relating observations with diagnoses. While this is only a starting point and many challenges and considerations lie ahead as I will discuss, the fact is clear: a powerful new AI-powered player is entering the arena of medical diagnosis, and we better get prepared for it. In this article I will mainly focus on how this new system works, calling out along the way various considerations that arise, some discussed in Google’s paper in Nature and others debated in the relevant communities — i.e. medical doctors, insurance companies, policy makers, etc.

Meet Google’s New Superb AI System for Medical Diagnosis

The advent of sophisticated LLMs, which as you surely know are AI systems trained on vast datasets to “understand” and generate human-like text, is representing a substantial upshift of gears in how we process, analyze, condense, and generate information (at the end of this article I posted some other articles related to all that — go check them out!). The latest models in particular bring a new capability: engaging in nuanced, text-based reasoning and conversation, making them potential partners in complex cognitive tasks like diagnosis. In fact, the new work from Google that I discuss here is “just” one more point in a rapidly growing field exploring how these advanced AI tools can understand and contribute to clinical workflows.

The study we are looking into here was published in peer-reviewed form in the prestigious journal Nature, sending ripples through the medical community. In their article “Towards accurate differential diagnosis with large language models” Google Research presents a specialized type of LLM called AMIE after Articulate Medical Intelligence Explorer, trained specifically with clinical data with the goal of assisting medical diagnosis or even running fully autonomically. The authors of the study tested AMIE’s ability to generate a list of possible diagnoses — what doctors call a “differential diagnosis” — for hundreds of complex, real-world medical cases published as challenging case reports.

Here’s the paper with full technical details:

https://www.nature.com/articles/s41586-025-08869-4

The Surprising Results

The findings were striking. When AMIE worked alone, just analyzing the text of the case reports, its diagnostic accuracy was significantly higher than that of experienced physicians working without assistance! AMIE included the correct diagnosis in its top-10 list almost 60% of the time, compared to about 34% for the unassisted doctors.

Very intriguingly, and in favor of the AI system, AMIE alone slightly outperformed doctors who were assisted by AMIE itself! While doctors using AMIE improved their accuracy significantly compared to using standard tools like Google searches (reaching over 51% accuracy), the AI on its own still edged them out slightly on this specific metric for these challenging cases.

Another “point of awe” I find is that in this study comparing AMIE to human experts, the AI system only analyzed the text-based descriptions from the case reports used to test it. However, the human clinicians had access to the full reports, that is the same text descriptions available to AMIE plus images (like X-rays or pathology slides) and tables (like lab results). The fact that AMIE outperformed unassisted clinicians even without this multimodal information is on one side remarkable, and on another side underscores an obvious area for future development: integrating and reasoning over multiple data types (text, imaging, possibly also raw genomics and sensor data) is a key frontier for medical AI to truly mirror comprehensive clinical assessment.

AMIE as a Super-Specialized LLM

So, how does an AI like AMIE achieve such impressive results, performing better than human experts some of whom might have years diagnosing diseases?

At its core, AMIE builds upon the foundational technology of LLMs, similar to models like GPT-4 or Google’s own Gemini. However, AMIE isn’t just a general-purpose chatbot with medical knowledge layered on top. It was specifically optimized for clinical diagnostic reasoning. As described in more detail in the Nature paper, this involved:

Specialized training data: Fine-tuning the base LLM on a massive corpus of medical literature that includes diagnoses.
Instruction tuning: Training the model to follow specific instructions related to generating differential diagnoses, explaining its reasoning, and interacting helpfully within a clinical context.
Reinforcement Learning from Human Feedback: Potentially using feedback from clinicians to further refine the model’s responses for accuracy, safety, and helpfulness.
Reasoning Enhancement: Techniques designed to improve the model’s ability to logically connect symptoms, history, and potential conditions; similar to those used during the reasoning steps in very powerful models such as Google’s own Gemini 2.5 Pro!

Note that the paper itself indicates that AMIE outperformed GPT-4 on automated evaluations for this task, highlighting the benefits of domain-specific optimization. Notably too, but negatively, the paper does not compare AMIE’s performance against other general LLMs, not even Google’s own “smart” models like Gemini 2.5 Pro. That’s quite disappointing, and I can’t understand how the reviewers of this paper overlooked this!

Importantly, AMIE’s implementation is designed to support interactive usage, so that clinicians could ask it questions to probe its reasoning — a key difference from regular diagnostic systems.

Measuring Performance

Measuring performance and accuracy in the produced diagnoses isn’t trivial, and is interesting for you reader with a Data Science mindset. In their work, the researchers didn’t just assess AMIE in isolation; rather they employed a randomized controlled setup whereby AMIE was compared against unassisted clinicians, clinicians assisted by standard search tools (like Google, PubMed, etc.), and clinicians assisted by AMIE itself (who could also use search tools, though they did so less often).

The analysis of the data produced in the study involved multiple metrics beyond simple accuracy, most notably the top-n accuracy (which asks: was the correct diagnosis in the top 1, 3, 5, or 10?), quality scores (how close was the list to the final diagnosis?), appropriateness, and comprehensiveness — the latter two rated by independent specialist physicians blinded to the source of the diagnostic lists.

This wide evaluation provides a more robust picture than a single accuracy number; and the comparison against both unassisted performance and standard tools helps quantify the actual added value of the AI.

Why Does AI Do so Well at Diagnosis?

Like other specialized medical AIs, AMIE was trained on vast amounts of medical literature, case studies, and clinical data. These systems can process complex information, identify patterns, and recall obscure conditions far faster and more comprehensively than a human brain juggling countless other tasks. AMIE, in particualr, was specifically optimized for the kind of reasoning doctors use when diagnosing, akin to other reasoning models but in this cases specialized for gianosis.

For the particularly tough “diagnostic puzzles” used in the study (sourced from the prestigious New England Journal of Medicine), AMIE’s ability to sift through possibilities without human biases might give it an edge. As an observer noted in the vast discussion that this paper triggered over social media, it is impressive that AI excelled not just on simple cases, but also on some quite challenging ones.

AI Alone vs. AI + Doctor

The finding that AMIE alone slightly outperformed the AMIE-assisted human experts is puzzling. Logically, adding a skilled doctor’s judgment to a powerful AI should yield the best results (as previous studies with have shown, in fact). And indeed, doctors with AMIE did significantly better than doctors without it, producing more comprehensive and accurate diagnostic lists. But AMIE alone worked slightly better than doctors assisted by it.

Why the slight edge for AI alone in this study? As highlighted by some medical experts over social media, this small difference probably doesn’t mean that doctors make the AI worse or the other way around. Instead, it probably suggests that, not being familiar with the system, the doctors haven’t yet figured out the best way to collaborate with AI systems that possess more raw analytical power than humans for specific tasks and goals. This, just like we might not be interacting perfecly with a regular LLM when we need its help.

Again paralleling very well how we interact with regular LLMs, it might well be that doctors initially stick too closely to their own ideas (an “anchoring bias”) or that they do not know how to best “interrogate” the AI to get the most useful insights. It’s all a new kind of teamwork we need to learn — human with machine.

Hold On — Is AI Replacing Doctors Tomorrow?
Absolutely not, of course. And it is crucial to understand the limitations:

Diagnostic “puzzles” vs. real patients: The study presenting AMIE used written case reports, that is condensed, pre-packaged information, very different from the raw inputs that doctors have during their interactions with patients. Real medicine involves talking to patients, understanding their history, performing physical exams, interpreting non-verbal cues, building trust, and managing ongoing care — things AI cannot do, at least yet. Medicine even involves human connection, empathy, and navigating uncertainty, not just processing data. Think for example of placebo effects, ghost pain, physical tests, etc.
AI isn’t perfect: LLMs can still make mistakes or “hallucinate” information, a major problem. So even if AMIE were to be deployed (which it won’t!), it would need very close oversight from skilled professionals.
This is just one specific task: Generating a diagnostic list is just one part of a doctor’s job, and the rest of the visit to a doctor of course has many other components and stages, none of them handled by such a specialized system and potentially very difficult to achieve, for the reasons discussed.

Back-to-Back: Towards conversational diagnostic artificial intelligence

Even more surprisingly, in the same issue of Nature and following the article on AMIE, Google Research published another paper showing that in diagnostic conversations (that is not just the analysis of symptoms but actual dialogue between the patient and the doctor or AMIE) the model ALSO outperforms physicians! Thus, somehow, while the former paper found an objectively better diagnosis by AMIE, the second paper shows a better communication of the results with the patient (in terms of quality and empathy) by the AI system!

And the results aren’t by a small margin: In 159 simulated cases, specialist physicians rated the AI superior to primary care physicians on 30 out of 32 metrics, while test patients preferred the AMIE on 25 of 26 measures.

This second paper is here:

https://www.nature.com/articles/s41586-025-08866-7

Seriously: Medical Associations Need to Pay Attention NOW

Despite the many limitations, this study and others like it are a loud call. Specialized AI is rapidly evolving and demonstrating capabilities that can augment, and in some narrow tasks, even surpass human experts.

Medical associations, licensing boards, educational institutions, policy makers, insurances, and why not everybody in this world that might potentially be the subject of an AI-based health investigation, need to get acquainted with this, and the topic mist be place high on the agenda of governments.

AI tools like AMIE and future ones could help doctors diagnose complex conditions faster and more accurately, potentially improving patient outcomes, especially in areas lacking specialist expertise. It might also help to quickly diagnose and dismiss healthy or low-risk patients, reducing the burden for doctors who must evaluate more serious cases. Of course all this could improve the chances of solving health issues for patients with more complex problems, at the same time as it lowers costs and waiting times.

Like in many other fields, the role of the physician will evolve, sooner or later thanks to AI. Perhaps AI could handle more initial diagnostic heavy lifting, freeing up doctors for patient interaction, complex decision-making, and treatment planning — potentially also easing burnout from excessive paperwork and rushed appointments, as some hope. As someone noted on social media discussions of this paper, not every doctor finds it pleasnt to meet 4 or more patients an hour and doing all the associated paperwork.

In order to move forward with the inminent application of systems like AMIE, we need guidelines. How should these tools be integrated safely and ethically? How do we ensure patient safety and avoid over-reliance? Who is responsible when an AI-assisted diagnosis is wrong? Nobody has clear, consensual answers to these questions yet.

Of course, then, doctors need to be trained on how to use these tools effectively, understanding their strengths and weaknesses, and learning what will essentially be a new form of human-AI collaboration. This development will have to happen with medical professionals on board, not by imposing it to them.

Last, as it always comes back to the table: how do we ensure these powerful tools don’t worsen existing health disparities but instead help bridge gaps in access to expertise?

Conclusion

The goal isn’t to replace doctors but to empower them. Clearly, AI systems like AMIE offer incredible potential as highly knowledgeable assistants, in everyday medicine and especially in complex settings such as in areas of disaster, during pandemics, or in remote and isolated places such as overseas ships and space ships or extraterrestrial colonies. But realizing that potential safely and effectively requires the medical community to engage proactively, critically, and urgently with this rapidly advancing technology. The future of diagnosis is likely AI-collaborative, so we need to start figuring out the rules of engagement today.