A recent study in Science shows that LLMs can now outperform doctors in diagnosing. The research found that OpenAI’s o1 model correctly or nearly correctly identified diagnoses in 67% of early ER cases, compared to roughly 50-55% for physicians.
The model was tasked with various activities, such as analyzing medical profiles, suggesting diagnoses, determining next steps, and estimating the likelihood of future health changes. In all these tasks, it performed on par with or better than physicians.
The LLM achieved a perfect clinical reasoning score 98% of the time. For one task, o1 received a perfect score in 98% of cases for how well it explained diagnostic reasoning and proposed next steps, whereas physicians scored similarly in only 35% of cases. This indicates the model might be more consistent in documenting and articulating medical logic, especially under stress.
Researchers used real patient cases presenting information to the model in stages that mimic the patient experience: symptoms described to intake, evaluations by doctors, and clinician decision-making.
At early stages, when patients check into the ER, o1 identified an exact or close diagnosis 67% of the time, outperforming two physicians by over 10%. The model continued to outperform physicians by 2% to 10% as patient care progressed.
OpenAI’s o1, released in late 2024, is already somewhat outdated in the fast-paced AI industry. Future models are expected to perform even better in similar tests.
The study focused on short ER stays and didn’t assess how the LLM would perform with more comprehensive information, so the results can’t be directly compared with other diagnostic settings. Also, only written case data were used, without input from imaging.
The team is now conducting new experiments using longer-term, broader real-world data. Further research will clarify whether LLMs can enhance patient care across different clinical environments.