GPT-3.5-turbo versus GPT-4.0 for medical education

Authors

  • Dawei Gabriel YANG Department of Visual Ophthalmology and Visual Sciences, the Chinese University of Hong Kong
  • Hong Yu Ryan Fong The Chinese University of Hong Kong, Hong Kong, China
  • Sin Kiu Hui The Chinese University of Hong Kong, Hong Kong, China
  • Sum Yuet Ng The Chinese University of Hong Kong, Hong Kong, China
  • Chun Hei Chow The Chinese University of Hong Kong, Hong Kong, China
  • Yuk Ming Lai The Chinese University of Hong Kong, Hong Kong, China
  • Lok Hang To The Chinese University of Hong Kong, Hong Kong, China
  • Ruyue Shen Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong, China
  • Xiaoyan Hu Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong, China
  • Carol Cheung Department of Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong, China

Keywords:

Education, medical, Generative artificial intelligence, ‎Large language models

Abstract

Objectives: To compare the performances of GPT-3.5-turbo and GPT-4.0 for medical education.

Methods: The performances of GPT-3.5-turbo (via Poe) and GPT-4.0 (via Microsoft 365 Copilot) were compared in terms of medical terminology translation (283 medical terms from six specialties), situational judgment (30 situations in seven contexts), medical knowledge (80 multiple choice questions and 80 clinical scenario questions), medical ‎studies (10 case scenarios), and clinical communication (10 case scenarios).

Results: GPT-4.0 outperformed GPT-3.5-turbo in accuracy in terms of medical terminology translation (98.6% vs 91.5%), situational judgment (83.3% vs 63.3%), medical knowledge (93.8% vs 85.0% on multiple choice questions and 82.5% vs 72.5% on clinical scenario questions), medical studies (in three of ten case scenarios), and clinical communication (in four of 10 case scenarios).

Conclusions: GPT-3.5-turbo and GPT-4.0 performed reasonably well on medical knowledge and medical studies, with GPT-4.0 performing slightly better in medical terminology translation, situational judgment, and clinical communication. These results are promising for the incorporation of large language models in medical education. Nonetheless, overreliance should be avoided as responses can be inaccurate or irrelevant.

References

‎1.‎ Eysenbach G. The role of ChatGPT, generative language models, and artificial ‎intelligence ‎in medical education: a conversation with ChatGPT and a call for papers. ‎JMIR Med Educ ‎‎2023;9:e46885.‎

‎2‎. Toma A, Senkaiahliyan S, Lawler PR, Rubin B, Wang B. Generative AI could ‎revolutionize ‎health care - but not if control is ceded to big tech. Nature 2023;624:36-8.‎

‎3‎. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large ‎language models in medicine. Nat Med 2023;29:1930-40.‎

‎4‎. Clusmann J, Kolbinger FR, Muti HS, et al. The future landscape of large language models in ‎medicine. ‎Commun Med (Lond) 2023;3:141.‎

‎5‎. Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large language models in ‎medicine: the ‎potentials and pitfalls: a narrative review. Ann Intern Med 2024;177:‎‎210-20.‎

‎6‎. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential ‎for AI-assisted medical ‎education using large language models. PLOS Digit Health ‎‎2023;2:e0000198.‎

‎7‎. Choi JH, Hickman KE, Monahan AB, Schwarcz D. ChatGPT goes to law ‎school. J Legal ‎Educ 2022;71:387.‎

‎8‎. Terwiesch C. Would ChatGPT3 get a Wharton MBA. A prediction based on its ‎performance ‎in the operations management course. Mack Institute for Innovation Management at the ‎Wharton School, University of Pennsylvania, 2023.‎

‎9.‎ Kozachek D. Proceedings of the 15th Conference on Creativity and Cognition. Association ‎for Computing Machinery; 2023.‎

‎10.‎ Goodman RS, Patrinely JR, Stone CA Jr, et al. Accuracy and reliability of chatbot ‎responses to physician questions. JAMA Netw Open 2023;6:e2336483.‎

‎11‎. Delsoz M, Madadi Y, Munir WM, et al. Performance of ChatGPT in diagnosis of corneal ‎eye diseases. medRxiv 2023;2023.08.25.23294635.‎

‎12‎. Hu X, Ran AR, Nguyen TX, et al. What can GPT-4 do for diagnosing rare eye diseases? A ‎pilot study. ‎Ophthalmol Ther 2023;12:3395-402.‎

‎13‎. Waisberg E, Ong J, Zaman N, et al. GPT-4 for triaging ophthalmic symptoms. Eye (Lond) ‎‎2023;37:3874-5.‎

‎14‎. Grimm DR, Lee YJ, Hu K, et al. The utility of ChatGPT as a generative medical translator. ‎Eur ‎Arch Otorhinolaryngol 2024;281:6161-5.‎

‎15‎. Lyu Q, Tan J, Zapadka ME, et al. Translating radiology reports into plain language using ‎ChatGPT and GPT-4 ‎with prompt learning: results, limitations, and potential. Vis Comput Ind ‎Biomed Art 2023;6:9.‎

‎16‎. Medical Audio Glossary. The Chinese University of Hong Kong. Accessed 12 July 2024. ‎Available from: https://intranet.mfpo.cuhk.edu.hk/med/eweb/Terminology/med-term.html

‎17‎. Kulkarni S, Parry J, Sitch A. An assessment of the impact of formal preparation ‎activities on ‎performance in the University Clinical Aptitude Test (UCAT): a national study. ‎BMC Med ‎Educ 2022;22:747.‎

‎18‎. Klatt EC, Kumar V. Robbins and Cotran Review of Pathology. Elsevier Health ‎Sciences; ‎‎2014.‎

‎19‎. Blackbourne L. Surgical Recall. Wolters Kluwer Health; 2020.‎

‎20‎. Henry ES, Quinton N. How to prepare for the Membership of the Royal College of ‎‎Physicians (UK) Part 2 Clinical Examination (Practical Assessment of Clinical ‎Examination ‎Skills): an exploratory study. Educ Med J 2021;13:77-91.‎

‎21‎. Membership of the Royal Colleges of Physicians of the United Kingdom. PACES Station ‎‎4: Communication Skills & Ethics. Accessed 12 July 2024. Available from: ‎https://www.thefederation.uk/sites/default/files/documents/Station%204%20Scenario%20Pack‎%20%2824%29.pdf‎

‎22‎. Membership of the Royal Colleges of Physicians of the United Kingdom. PACES ‎Marksheets Sample 2023. Accessed 12 July 2024. Available from: ‎https://www.thefederation.uk/sites/default/files/PACES%20New%20Marksheets%20-‎‎%20for%20website.pdf‎

‎23‎. Hua HU, Kaakour AH, Rachitskaya A, Srivastava S, Sharma S, Mammo DA. Evaluation ‎and comparison of ophthalmic scientific abstracts and ‎references by current artificial ‎intelligence chatbots. JAMA Ophthalmol 2023;141:819-24.‎

‎24‎. Charnock D, Shepperd S, Needham G, Gann R. DISCERN: an instrument for ‎judging the ‎quality of written consumer health information on treatment choices. J ‎Epidemiol Community ‎Health 1999;53:105-11.‎

‎25‎. Garcia Valencia OA, Thongprayoon C, Jadlowiec CC, et al. AI-driven translations for ‎kidney transplant equity in Hispanic ‎populations. Sci Rep 2024;14:8511.‎

‎26‎. Ando K, Sato M, Wakatsuki S, et al. A comparative study of English and Japanese ‎ChatGPT responses to ‎anaesthesia-related medical questions. BJA Open 2024;10:100296.‎

‎27‎. Fang C, Wu Y, Fu W, et al. How does ChatGPT-4 perform on non-English national ‎medical licensing ‎examination? An evaluation in Chinese language. PLOS Digit Health ‎‎2023;2:e0000397.‎

‎28‎. Wang H, Wu W, Dou Z, He L, Yang, L. Performance and exploration of ChatGPT ‎in ‎medical examination, records and education in Chinese: pave the way for medical AI. ‎Int J Med ‎Inform 2023;177:105173.‎

‎29‎. Ballard DH. Inconsistently accurate: repeatability of GPT-3.5 and GPT-4 in answering ‎‎radiology board-style multiple choice questions. Radiology 2024;311:e241173.‎

‎30‎. Silverman J, Kinnersley P. Doctors’ non-verbal behaviour in consultations: look at the ‎‎patient before you look at the computer. Br J Gen Pract 2010;60:76-8.‎

‎31‎. Ayoub NF, Lee YJ, Grimm D, Divi V. Head‐to‐head comparison of ChatGPT ‎versus ‎Google search for medical knowledge acquisition. Otolaryngol Head Neck ‎Surg ‎‎2024;170:1484-91.‎

‎32‎. Jo E, Song S, Kim JH, et al. Assessing GPT-4’s performance in delivering medical advice: ‎comparative ‎analysis with human experts. JMIR Med Educ 2024;10:e51282.‎

‎33‎. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States ‎Medical Licensing ‎Examination (USMLE)? The implications of large language models for ‎medical ‎education and knowledge assessment. JMIR Med Educ 2023;9:e45312.‎

‎34‎. Oztermeli AD, Oztermeli A. ChatGPT performance in the medical specialty exam: an ‎‎observational study. Medicine (Baltimore) 2023;102:e34673.‎

‎35‎. Ramchandani R, Biglou SG, Gupta M, Guo E. Using AI to revolutionize ‎clinical training ‎through OSCE-GPT: a focused exploration of user feedback on ‎otolaryngology and neurology ‎cases. Can J Neurol Sci 2024;51:S35.‎

‎36‎. Misra SM, Suresh S. Artificial intelligence and Objective Structured Clinical ‎Examinations: ‎using ChatGPT to revolutionize clinical skills assessment in medical ‎education. J Med Educ ‎Curric Dev 2024;11:‎‎23821205241263475.‎

‎37‎. Ali R, Tang OY, Connolly ID, et al. Performance of ChatGPT, GPT-4, and Google Bard ‎on a neurosurgery oral ‎boards preparation question bank. Neurosurgery 2023;93:1090-8.‎

‎38‎. Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the ‎performance ‎of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci ‎Rep 2023;13:20512.‎

‎39‎. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ‎‎ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol ‎Sci ‎‎2023;3:100324.‎

Downloads

Published

2025-05-06

How to Cite

1.
YANG DG, Fong HYR, Hui SK, Ng SY, Chow CH, Lai YM, To LH, Shen R, Hu X, Cheung C. GPT-3.5-turbo versus GPT-4.0 for medical education. Hong Kong J Ophthalmol [Internet]. 2025May6 [cited 2025May13];29(1). Available from: https://hkjo.hk/index.php/hkjo/article/view/391

Issue

Section

Original Articles

Most read articles by the same author(s)