Artificial Intelligence (Large Language Models) and the Potential for Medical Education

From Clinfowiki
Revision as of 15:19, 1 May 2024 by Djkmd (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Artificial Intelligence (Large Language Models) and the Potential for Medical Education

The rise of Artificial Intelligence, and more specifically the advent of Large Language Models (LLMs), have generated a lot of excitement in the potential role of medical education. However, as much as it might enhance education, it also raises the prospect of overreliance and dependency, working against the original goal, and continues to be at risk for contamination with incomplete or misinformation, allowing for ‘hallucinations.’ Moreover, a model is only as good as information it was trained on, raising important ethical, legal, and philosophical questions about autonomy, privacy, and academic integrity.


What is a Large Language Model?

A Large Language Model (or LLM) is a new type of artificial intelligence algorithm trained to predict the likelihood of a given sequence of words based on the context of the words that come before it. Trained on large amounts of text data, these models are capable of generating novel sequences of words never observed previously by the model based on patterns and that represent plausible sequences based on natural human language.1,2 With immediate access to medical knowledge, research studies, and clinical guidelines, this technology has sparked a great deal of interest in the area of medical education for offering personalized learning, comprehensive feedback, intelligent tutoring, and enhancing self-learning.3 Some examples of common commercial models being studied for potential application include: ChatGPT by OpenAI, Gemini by Google, and Claude by Anthropic. These are examples of Generative Pre-Trained Transformers (GPT) series LLMs which are models designed to generate new texts, videos, images and codes based on the training data they have been exposed to.4


Opportunities and Challenges for Large Language Models in Medical Education

[figure]

Figure from: Lucas HC, Upperman JS, Robinson JR. A systematic review of large language models and their implications in medical education. Med Educ. Published online April 19, 2024:medu.15402. doi:10.1111/medu.15402


Opportunities

Personalized Learning

These LLMs are able to process natural language inputs and produce natural language responses allowing for immediate and personalized feedback, enhancing individual learning.5 Because learners are extremely variable with widely different needs, these programs are able to flex, allowing custom tutoring and individualized study plans to achieve learning objectives.4 Some examples how these models can create value include offering tailored explanations, interactive quizzes, or suggestions to additional references.4 In addition, because learning needs and goals vary based on level (such as from medical student to a seasoned attending) and also by specialty (such as dermatology versus neurosurgery), these models allow for domain adaptation able to work effectively across disciplines and levels without extensive retraining.6


Synthesis of Information and Access to Medical Knowledge and Research

One of the most significant contributions of LLMs to medical education is their ability to enhance learning resources, by providing immediate access to the entirety of current medical knowledge, research studies and clinical guidelines.3 As LLM models advance, a large number of studies have shown strong performance on standardized examinations of medical knowledge.3 For example, starting from Chat GPT 3.5 now to GPT 4, these LLM models are progressively and consistently scoring higher on the USMLE exams than human counterparts. 1,7–9 These findings have been reproduced across a wide range of generalized and specialized knowledge in many countries around the world: including the General Surgery Examination in South Korea, the European Exam in Core Cardiology, the German Medical Licensing Exam, the Polish Medical Final Examination, the Plastic Surgery and Orthopedics In Service Exam, the Canadian Head and Neck Surgery Examination, etc.10–16 Though the impact and importance of examination performance has been debated in medical education in long term clinical performance, standardized examinations has been shown to correlate with success in knowledge expertise. 17,18 Using this expertise, these models can generate comprehensive summaries, condensing vast amounts of information into digestible formats and synthesize key findings across studies. As long as an efficient system can be developed to prioritize and learn highly cited articles to optimize performance, the models can adapt to the most up to date standard of care for the question at hand.6

Clinical Decision Support and Reasoning

For especially novice/early-stage learners, LLMs can help facilitate clinical reasoning and skills development by simulating clinical scenarios in a realistic yet risk-free setting. 4 For example, these models can simulate realistic virtual patient encounters and generate interactive case studies with a virtual mentor providing real time feedback.5 Learners are able to practice history-taking, diagnosis, and communicating treatment plans in a risk-free environment. This optimizes reinforcement learning, where a student learns to make decisions by interacting with an environment, receiving feedback in the form of rewards or penalties, to a greater amount of learners with less resources than is usually needed to provide expertise observation and feedback.6 Early studies also suggest this may also help develop communication skills vital in diverse healthcare settings, helping develop high quality and empathetic responses by suggesting helpful vocabulary words and phrases. 5 These models can also help a seasoned professional through continuing medical education (CME) initiatives, facilitating up-to-date information on medical advancements, treatment guidelines, and best practices.


Course and Assessment Creation, Writing Tool

For educators, LLMs can also be used to assist curriculum development.4 For example, these models are always processing real time data and responses to gauge if learners are absorbing the material and meeting learning objectives. It can analyze literature on the most effective pedagogical strategies and incorporate innovate teaching strategies while gauging outcomes.5 Also, this technology can help teachers be more efficient and effective by streamlining curriculum planning and assessments and help efficiently allocate resources.3 In addition, this technology can learn from across institutions by drawing upon a spectrum of professional and academic experience to help develop and write the most effective teaching materials and helpful feedback to learners. 4


Challenges

Incorrect Responses or Hallucinations

Similar to challenges in other fields, LLMs struggle with ‘hallucinations’ – the ability to confidently and authoritatively deliver nonsensical, inaccurate, or blatantly untrue information.5 These models are only as effective as the information used to trained on and can produce definitive responses based on incomplete or biased training data.19 A major issue is that these program struggle with differentiating between evidence-based and non- evidence-based sources as well as delineating between the quality of sources provided.5 Furthermore, medicine is constantly evolving with practice patterns evolving as new data and literature is produced. If the models are not updating in real time and the training data is from a static cut off, it may generate outputs inferior to conventional search engines.5 To be effective as an educational tool, these models must incorporate dynamic training with continuous updating and training of a model to incorporate new data and knowledge as well as reinforcement with expert input. 6 Similar to Wikipedia model in the early days, any output will need to be verified and cross checked for the foreseeable future.3 In addition, though as detailed above, these models are able perform well on standardized exams and demonstrate medical knowledge, the practice of medicine is often full of nuance and most plans are not clearly binary. These models often struggle with complex scenarios without a clear solution, leaving it vulnerable to additional hallucinations. 20 Furthermore, these models are also dependent on the prompt that is given and may provide inconsistent responses based on the prompt, raising additional questions on the validity of the response.5 Ethical, Legal, and Privacy The use of LLMs in medical education raises a series of ethical and legal questions on its use and likely will generate many dilemmas that are unpredicted at the current time. The data used to the train these models are subject of controversy and litigation extensively outside the field of medical education.21 For example, the data used to train these models are largely from English language internet sources from high-income countries, which inherently biases (ex. ethnicity, income, etc.) the model and thus will produce responses reflective of these biases.5 To produce clinical scenarios, there is an assumption that patient data at some point may have been used to train the model. Ethical and legal questions arise about the use of private health data, the role of consent, and ability to verify data provenance.3,4,21 In addition, the legal questions of learning and teaching with potentially inaccurate hallucinations – who is to be held responsible? The law in this area remains unclear, and responsibility remains an open question since it will likely be difficult to hold LLMs accountable for their outputs, given these machines do not fulfill the obligations of human authors.2,5,21


Overreliance

For both educators and students, the prospect of generative technology able to reduce cognitive workload raises concerns for overreliance and excessive trust in the model. 3,22 The prospect of taking an easier road raises concerns about hindering problem solving, critical thinking, or questioning/verifying false, possibly hallucinated outputs.5 This overreliance also raises the possibility these tools work against the goal of enhancing education but rather diminishing clinical reasoning skills and lack of appreciating nuance and complexity within medicine.3 Ultimately, the concern arises as this dependency may lead to decreased care quality, compromised safety, and adverse patient outcomes.5


Academic Integrity

In tandem with the issues raised above, LLMs raise questions of preserving academic integrity. Being able to generate assignments and reports increases risks the potential for academic dishonesty and plagiarism.5 Similar to other areas of education, efforts are underway to combat and detect AI generated content (such as “watermarking”); however this still remains a large challenge and a major obstacle.5,23 One complaint raised is the lack of transparency in how the output gets generated which has intentionally remain hidden to preserve commercial interests, which make it more difficult to ensure outputs are given appropriate attributions.5


Sources

1. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. Dagan A, ed. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198

2. Gao Y, Baptista-Hon DT, Zhang K. The inevitable transformation of medicine and research by large language models: The possibilities and pitfalls. MedComm – Future Med. 2023;2(2):e49. doi:10.1002/mef2.49

3. Lucas HC, Upperman JS, Robinson JR. A systematic review of large language models and their implications in medical education. Med Educ. Published online April 19, 2024:medu.15402. doi:10.1111/medu.15402

4. Abd-alrazaq A, AlSaad R, Alhuwail D, et al. Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions. JMIR Med Educ. 2023;9:e48291. doi:10.2196/48291

5. Shorey S, Mattar C, Pereira TLB, Choolani M. A scoping review of ChatGPT’s role in healthcare education and research. Nurse Educ Today. 2024;135:106121. doi:10.1016/j.nedt.2024.106121

6. Karabacak M, Margetis K. Embracing Large Language Models for Medical Applications: Opportunities and Challenges. Cureus. Published online May 21, 2023. doi:10.7759/cureus.39305

7. Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2024;46(3):366-372. doi:10.1080/0142159X.2023.2249588

8. Brin D, Sorin V, Vaid A, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492. doi:10.1038/s41598-023-43436-9

9. Brin D, Sorin V, Konen E, Nadkarni G, Glicksberg BS, Klang E. How Large Language Models Perform on the United States Medical Licensing Examination: A Systematic Review. Published online September 3, 2023. doi:10.1101/2023.09.03.23294842

10. Meyer A, Riese J, Streichert T. Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR Med Educ. 2024;10:e50965. doi:10.2196/50965

11. Long C, Lowe K, Santos AD, et al. Evaluating ChatGPT-4 in Otolaryngology–Head and Neck Surgery Board Examination using the CVSA Model. Published online June 1, 2023. doi:10.1101/2023.05.30.23290758

12. Oh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 2023;104(5):269. doi:10.4174/astr.2023.104.5.269

13. Skalidis I, Cagnina A, Luangphiphat W, et al. ChatGPT takes on the European Exam in Core Cardiology: an artificial intelligence success story? Eur Heart J - Digit Health. Published online April 24, 2023:ztad029. doi:10.1093/ehjdh/ztad029

14. Humar P, Asaad M, Bengur FB, Nguyen V. ChatGPT Is Equivalent to First-Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-Service Examination. Aesthet Surg J. 2023;43(12):NP1085-NP1089. doi:10.1093/asj/sjad130

15. Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB. Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination. JBJS Open Access. 2023;8(3). doi:10.2106/JBJS.OA.23.00056

16. Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep. 2023;13(1):20512. doi:10.1038/s41598-023-46995-z

17. Harfmann KL, Zirwas MJ. Can performance in medical school predict performance in residency? A compilation and review of correlative studies. J Am Acad Dermatol. 2011;65(5):1010-1022.e2. doi:10.1016/j.jaad.2010.07.034

18. Casey PM, Palmer BA, Thompson GB, et al. Predictors of medical school clerkship performance: a multispecialty longitudinal analysis of standardized examination scores and clinical assessments. BMC Med Educ. 2016;16(1):128. doi:10.1186/s12909-016-0652-y

19. Bair H, Norden J. Large Language Models and Their Implications on Medical Education. Acad Med. 2023;98(8):869-870. doi:10.1097/ACM.0000000000005265

20. Gunawardene AN, Schmuter G. Teaching the Limitations of Large Language Models in Medical School. J Surg Educ. 2024;81(5):625. doi:10.1016/j.jsurg.2024.01.008

21. Ong JCL, Chang SYH, William W, et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit Health. Published online April 2024:S258975002400061X. doi:10.1016/S2589-7500(24)00061-X

22. Komorowski M, Del Pilar Arias López M, Chang AC. How could ChatGPT impact my practice as an intensivist? An overview of potential applications, risks and limitations. Intensive Care Med. 2023;49(7):844-847. doi:10.1007/s00134-023-07096-7

23. Oravec JA. Artificial Intelligence Implications for Academic Cheating: Expanding the Dimensions of Responsible Human-AI Collaboration with ChatGPT and Bard.

Submitted by David Kim