doi:10.5137/1019-5149.JTN.49058-25.2

Beyond Human 'Eyes' in Neurosurgical Exams: Success of Artificial Intelligence (ChatGPT-4o, Grok, and Gemini) in the Image-Based Questions of Turkish Neurosurgical Society Proficiency Board Exams

Alperen SOZER, Gokberk EROL, Ozan Yavuz TUFEK, Batuhan SOZER, Merve BUKE SAHIN, Mustafa Caglar SAHIN

Page : 200-210

AIM: To evaluate the impact of generative artificial intelligence and large language models (LLMs) on medical training and neurosurgical education, specifically focusing on their emerging capabilities in image interpretation.

MATERIAL and METHODS: This study evaluated the performance of three major LLMs (ChatGPT-4o, Grok, and Gemini) on imagebased neurosurgical proficiency board questions and compared their latest versions.

RESULTS: Real-life candidates answered correctly 70.75% of the time. LLMs answered correctly 47.38% of the time and were significantly outperformed by the candidates. Prompt selection was found to significantly influence the performance of GPT and Grok, but not Gemini. Matching and significantly outperforming the candidates was only possible by combining the best answers from all three LLMs across four runs.

CONCLUSION: Although previous research has demonstrated strong capabilities of LLMs in text-only questions, this the results of the present study revealed that image analysis abilities of these models need further improvement when compared to actual candidates. Furthermore, the impact of prompt selection and repeated questioning should be emphasized, particularly when seeking correlation with the real-life exam results.

Keywords : Artificial Intelligence ChatGPT-4o Education Gemini Grok

Views : 521

Downloads : 0