xAI‘s Grok-4.20 just posted the highest scores ever recorded on MedQA and PubMedQA benchmarks, outperforming GPT-4o, Med-PaLM 2, and Claude Opus on medical question answering. The results reignite the debate about whether AI should play a larger role in healthcare, and whether benchmark performance translates to clinical reliability.
The Benchmark Results
Grok-4.20 scored 94.2% on MedQA (the US Medical Licensing Exam question set), up from 91.1% for the previous best model. On PubMedQA, which tests ability to answer questions based on published medical literature, it scored 82.7%. These scores exceed the performance of the average physician on equivalent tests.
The model also showed strong results on clinical reasoning tasks that require synthesizing patient history, lab values, and symptoms into differential diagnoses. xAI attributes the improvement to a specialized training pipeline that includes structured medical knowledge and case study datasets.
Why Benchmarks Are Not the Full Picture
Medical benchmarks test a specific skill: answering multiple-choice or short-answer questions based on textbook knowledge. Real clinical practice involves ambiguity, incomplete information, patient communication, and decisions where the “correct” answer depends on context that benchmarks cannot capture.
A model that scores 94% on MedQA can still hallucinate drug interactions, misinterpret symptoms in atypical presentations, or provide confident answers that are dangerously wrong. The gap between benchmark performance and real-world reliability remains significant in medicine.
How Grok-4.20 Handles Health Questions
In practice, Grok-4.20 through the X platform provides detailed responses to health queries with citations to medical literature. It includes disclaimers about consulting healthcare professionals, but the responses themselves are thorough enough that users might skip that step. This is the core tension: making medical information more accessible while potentially encouraging people to self-diagnose.
Testing by medical professionals found that Grok-4.20 handles common conditions well and provides appropriate urgency signals for serious symptoms. Where it struggles is with rare conditions and complex multi-system diseases where clinical experience matters more than textbook knowledge.
The Regulatory Question
No AI model, including Grok-4.20, is approved by the FDA as a diagnostic tool. Using AI for medical advice exists in a regulatory gray zone. The technology outpaces regulation, and health AI companies are lobbying for clearer frameworks that would allow AI-assisted diagnosis under physician supervision.
Several hospital systems are already piloting AI triage tools that use models like Grok and Med-PaLM to prioritize emergency department patients. These clinical applications operate under institutional review boards rather than consumer-facing deployment.
Should You Use AI for Health Questions?
As a starting point for understanding symptoms or researching conditions before a doctor visit, AI models like Grok-4.20 are genuinely useful. They provide faster, more detailed responses than generic web searches and can translate medical jargon into plain language.
As a replacement for professional medical advice, no. Benchmark scores measure knowledge, not judgment. And in healthcare, judgment is what keeps people alive.
