Array; Array; Array; Array; Array; Array; Array; Array

Research

Can Artificial Intelligence Evaluate Clinical Practice Guidelines Like Human Experts? A Comparative Study Using the AGREE II Instrument

Mehmet Melih Karaaslan^✉ , Pelin Kuzucu , Burak Karaaslan ,

Tolga Turkmen , Seyma Tastemur , Nadira Zahirovic ,

Alp Ozgun Borcek ,

Mesut Emre Yaman

Department of Neurosurgery, Gazi University; Department of Neurosurgery, Guven Hospital; Neurosurgery, Gazi Üniversitesi; Neurosurgery, Gazi University Faculty of Medicine

Article in Press

Corresponding Author: Mehmet Melih Karaaslan (mehmet.karaaslanmd@gmail.com)

Abstract

Aim
Clinical practice guidelines (CPGs) are widely used in neurosurgery to support evidence-based clinical decisions and to promote consistency in patient management. However, their methodological quality and internal consistency vary substantially across publications. In the present study, we evaluated CPGs addressing brain metastases and, for the first time, compared guideline assessments performed by neurosurgical experts with those generated by artificial intelligence (AI) models using the AGREE II instrument.

Material and Methods
A systematic literature search identified five CPGs addressing the use of stereotactic radiosurgery for brain metastases. Each guideline was independently assessed by four neurosurgical experts as well as by two artificial intelligence models (ChatGPT-4.0 and DeepSeek R1) using the AGREE II framework. Domain scores were expressed as percentages, and interrater reliability was examined with the intraclass correlation coefficient (ICC).

Results
The scoring patterns obtained from human reviewers and AI models were largely comparable. The highest ratings were recorded in the domains of Scope and Purpose and Clarity of Presentation, while Applicability consistently received the lowest scores. Statistical analysis revealed no significant differences between the assessments of the human experts and the AI models (p > 0.05). Interrater agreement ranged from moderate to excellent (ICC 0.4910.908). In addition, AI models were less inclined to assign extreme scores, indicating a more conservative evaluation tendency.

Conclusion
AI-based evaluations demonstrated a level of performance comparable to that of human experts. These results indicate that AI could function as a supportive tool in the appraisal of clinical guidelines and may also have broader applications in clinical decision support and related medical tasks. Incorporating AI into guideline development processes may help improve efficiency, promote greater consistency, and enhance transparency within neurosurgical practice.

Keywords

Artificial Intelligence Clinical Practice Guidelines as Topic Brain Neoplasms/secondary ChatGPT AGREE II

Back to Article in Press