Skip Navigation
Skip to contents

JYMS : Journal of Yeungnam Medical Science

Indexed in: ESCI, Scopus, PubMed,
PubMed Central, CAS, DOAJ, KCI
FREE article processing charge
OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > J Yeungnam Med Sci > Ahead-of print > Article
Original article
Impact of artificial intelligence in managing musculoskeletal pathologies in physiatry: a qualitative observational study evaluating the potential use of ChatGPT versus Copilot for patient information and clinical advice on low back pain
Christophe Ah-Yan1orcid, Ève Boissonnault2orcid, Mathieu Boudier-Revéret2orcid, Christopher Mares2orcid

DOI: https://doi.org/10.12701/jyms.2024.01151
Published online: November 29, 2024

1Department of Physical Medicine and Rehabilitation, University of Montreal, Montreal, QC, Canada

2Department of Physical Medicine and Rehabilitation, Centre Hospitalier de l’Université de Montréal, Montreal, QC, Canada

Corresponding author: Christophe Ah-Yan, MD, PharmD Department of Physical Medicine and Rehabilitation, Centre Hospitalier de l’Université de Montréal, 1051 Rue Sanguinet, Montreal, QC, Canada Tel: +1-438-882-4732 • E-mail: Christophe.ah.yan@umontreal.ca
• Received: October 8, 2024   • Revised: October 25, 2024   • Accepted: October 25, 2024

© 2024 Yeungnam University College of Medicine, Yeungnam University Institute of Medical Science

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

  • 239 Views
  • 11 Download
  • Background
    The self-management of low back pain (LBP) through patient information interventions offers significant benefits in terms of cost, reduced work absenteeism, and overall healthcare utilization. Using a large language model (LLM), such as ChatGPT (OpenAI) or Copilot (Microsoft), could potentially enhance these outcomes further. Thus, it is important to evaluate the LLMs ChatGPT and Copilot in providing medical advice for LBP and assessing the impact of clinical context on the quality of responses.
  • Methods
    This was a qualitative comparative observational study. It was conducted within the Department of Physical Medicine and Rehabilitation, University of Montreal in Montreal, QC, Canada. ChatGPT and Copilot were used to answer 27 common questions related to LBP, with and without a specific clinical context. The responses were evaluated by physiatrists for validity, safety, and usefulness using a 4-point Likert scale (4, most favorable).
  • Results
    Both ChatGPT and Copilot demonstrated good performance across all measures. Validity scores were 3.33 for ChatGPT and 3.18 for Copilot, safety scores were 3.19 for ChatGPT and 3.13 for Copilot, and usefulness scores were 3.60 for ChatGPT and 3.57 for Copilot. The inclusion of clinical context did not significantly change the results.
  • Conclusion
    LLMs, such as ChatGPT and Copilot, can provide reliable medical advice on LBP, irrespective of the detailed clinical context, supporting their potential to aid in patient self-management.
Artificial intelligence (AI) is a rapidly evolving field. One subset of AI, which uses natural language processing with a combination of deep learning techniques and immensely large data, is the large language model (LLM), which has the potential to revolutionize medicine and patient care. An LLM can generate text-based content for a broad range of subjects, including medicine and health [1]. There are multiple advantages of LLMs in healthcare, from improving the accuracy and efficiency of medical documentation, to supporting complex diagnostic processes and personalized patient management, to generating medical education materials. This capability saves time for healthcare professionals and enhances the quality of patient care by providing comprehensive and accurate medical information [2]. Currently, the two most popular LLMs are OpenAI’s ChatGPT, with approximately 100 million weekly users, and Microsoft’s Copilot, with 100 million daily Bing users. The search engine Bing now integrates Copilot into one of its main functionalities [3,4]. Both of these LLMs can generate coherent and contextually appropriate text in response to user questions. This capability raises important questions about its relevance and usefulness in the medical field. ChatGPT and Copilot can be used as search engines and sources of medical education, but they have some limitations [2].
The advent of these LLMs has shown great promise in enhancing patient counseling. However, their integration into specific medical fields, particularly physiatry, which focuses on musculoskeletal disorders, has not yet been thoroughly explored. One potential use could be serving as an interactive tool for patients seeking medical advice to better understand and manage their musculoskeletal disorders, such as low back pain (LBP). LBP, the leading global disability, is a significant health problem, affecting millions of people and resulting in substantial economic and social burdens [5]. It affected 619 million people in 2020 and is projected to affect 843 million people by 2050 [6].
The self-management of LBP through patient information interventions can offer significant benefits in terms of cost, reduced work absenteeism, and overall healthcare utilization [7,8]. Using an LLM such as ChatGPT or Copilot, which are available for free in this context, could potentially enhance these outcomes further. By providing personalized and adapted information, ChatGPT and Copilot could address individual needs more precisely, thereby potentially improving the understanding and adherence to self-management practices. The interaction with an LLM is tailored based on the user’s responses, allowing for adjustments in the information provided based on the user’s specific situation, progress, or lack of understanding.
This study aimed to compare two popular LLMs, ChatGPT and Copilot, to provide information and clinical advice regarding LBP. Another objective was to evaluate whether the addition of a clinical context affected the quality of the answers generated by these models.
1. Study design
This study employed a qualitative, comparative observational design to evaluate the validity, safety, and usefulness of the responses provided by the LLMs ChatGPT and Copilot, with and without a clinical context. A comprehensive survey was distributed to physiatrists, soliciting the most common questions their patients with LBP frequently ask, the typical clinical profile of these patients, and information related to the physiatrists’ knowledge and use of LLMs such as ChatGPT and Copilot. A total of 21 answers were received (18 physiatrists across Quebec and three physiatry residents). A list of 27 questions was created from the survey’s answers, addressing different aspects of LBP, such as causes, evolution, prevention, treatment, and medications. The questions were selected and developed through a consensus among the four authors by consolidating similar items from the survey and ensuring consideration of various aspects related to LBP management (Supplementary Material 1). A hypothetical clinical scenario and patient profile were constructed based on the responses gathered from the survey (Supplementary Material 2). Subsequently, an independent patient partner with chronic LBP rigorously reviewed the questions. This step ensured the authenticity of the questions by aligning them closely with those that a patient might typically ask. Afterwards, ChatGPT (free version, Model 3.5) and Copilot (free version) were asked the 27 questions in English, with and without the hypothetical patient’s clinical context. The responses of each model were evaluated under controlled conditions with and without specific clinical contexts. Three independent physiatrists from the University of Montreal, who are specialists in musculoskeletal pathology, assessed the responses from the LLMs.
2. Data collection
The evaluators were provided with responses from ChatGPT and Copilot to the 27 predefined questions, both with and without clinical context, and rated the responses according to predefined categories of validity, safety, and usefulness. The study was not conducted with blinded evaluators. However, both chatbots are newly developed and there was no pre-existing bias or preference for one over the other, which likely minimized any impact on the results. This resulted in a total of 108 assessments per category for each evaluator.

1) Measurement tools

Responses were evaluated under three categories: validity, safety, and usefulness, using a Likert scale ranging from 1 (least favorable) to 4 (most favorable) (Table 1).
3. Statistical analysis
Data analysis provided descriptive standard data for the validity, safety, and usefulness scores between ChatGPT and Copilot and between clinical and nonclinical contexts. Inter-rater reliability analysis was also performed using a joint probability of agreement and Krippendorff’s alpha test. Paired Cohen’s kappa was also used as a secondary test because of a Kappa paradox with Krippendorff’s alpha test. Statistical significance was set at p<0.05 for all tests. All analyses were conducted using IBM SPSS ver. 29 (IBM Corp., Armonk, NY, USA). Multivariable analysis of variance (ANOVA) with Pillai’s trace test was also used to determine the significance of the difference in scores between ChatGPT and Copilot and among the validity, usefulness, and safety scores, assuming a score difference of 1 unit being clinically significant.
In this study, Krippendorff’s alpha test values were low (overall, 0.11; validity, 0.07; safety, 0.01; usefulness, 0.05). Cohen’s kappa was also low between evaluators (Evaluator 1–2, 0.02; Evaluator 1–3, 0.15; Evaluator 2–3, 0.035), although the mean scores from all evaluators were similar (3.32–3.35), the variances were low (0.4–0.64), and the frequencies of scores 3 or 4 were high (83%–92%), with a high joint-probability of agreement of a score of 3 or 4 among the three evaluators (71%).
Across the three assessed categories of validity, usefulness, and safety, no statistically significant differences were observed between the performances of ChatGPT and Copilot. The addition of a clinical context did not significantly affect the validity, usefulness, and safety scores (Fig. 1, Table 2).
Specifically, the average validity scores for ChatGPT and Copilot were 3.33 and 3.18, respectively, with scores for answers from questions without clinical context averaging 3.23 vs. 3.28 with context. These differences were not statistically significant, as evidenced by F-tests (LLM: F=2.43, p=0.13; clinical context: F=0.673, p=0.42; LLM-clinical context interaction: F=3.72, p=0.07) (Table 3).
Similarly, in the usefulness category, ChatGPT and Copilot scored an average of 3.19 and 3.13, respectively, with answers from questions without clinical context scoring 3.15 vs. 3.17 with context. The differences were not statistically significant (LLM: F=0.445, p=0.510; clinical context: F=0.031, p=0.86; LLM-clinical context interaction: F=0.794, p=0.38) (Table 3).
Safety scores also followed this trend, with ChatGPT and Copilot scoring 3.60 and 3.57, respectively, and no significant differences were found between the clinical and nonclinical contexts (3.62 and 3.56, respectively) (LLM: F=0.308, p=0.583; clinical context: F=1.083, p=0.308; LLM-clinical context interaction: F=0.523, p=0.48) (Table 3).
In addition, separate ANOVA tests were conducted for each evaluator, and the findings were the same, indicating consistency in the results across the three physiatrists. The only exception was in the validity category for one evaluator, where ChatGPT scored statistically significantly higher than Copilot (F=5.294, p=0.03) (Supplementary Material 3).
Some limitations of LLMs include their potential to generate inaccurate or biased information and legal and ethical concerns regarding the use of AI-generated content in medicine [2]. These issues require consideration and management to ensure that the integration of AI tools such as ChatGPT and Copilot into healthcare practices does not compromise the quality or integrity of the care provided. The need for transparency and accountability in AI-generated medical advice is critical to mitigate these risks and harness the full potential of this technology [9].
As of May 2024, to the best of our knowledge, no study has compared the effectiveness of Copilot and ChatGPT in providing clinical advice. Furthermore, no study has examined the responses generated by these LLMs when a clinical context is provided compared with those when generic inquiries are made. The perception of LLMs among physicians is also mixed. While some healthcare professionals are optimistic about the potential of these tools to enhance efficiency and patient care, others express apprehension about the accuracy of the information provided and the potential of these models to replace human judgment in clinical settings, which raises safety concerns [9]. Therefore, this study aimed to undertake a comprehensive evaluation of ChatGPT and Copilot in providing medical advice about LBP to patients and the impact of the addition of clinical context information on the quality of the answers generated.
The statistical analysis suggests that the low Krippendorff’s alpha value may be due to Kappa’s paradox. This occurs when most ratings cluster at one end of the scale (e.g., scores of 3 or 4), as in our study. In such cases, even small differences between raters can result in lower reliability scores, although there is generally good agreement [10]. This paradox means that some inter-rater reliability measures may underestimate the actual agreement when ratings are heavily concentrated at one end of the scale (Supplementary Material 3).
The data from this study also demonstrated good performance regarding the answers from the validity, usefulness, and safety perspectives for both ChatGPT and Copilot, independent of the presence of a clinical context. This suggests that both ChatGPT and Copilot can assist in patient self-management of LBP.
To further validate our conclusions, we conducted separate ANOVA tests for each evaluator. This was performed to confirm that the findings were consistent across different evaluators and to reinforce the consistency of our results. These individual analyses reflect the collective outcomes, showing no significant differences between ChatGPT and Copilot or the inclusion of clinical context in the questions, with the sole exception of the validity category for one evaluator, where ChatGPT scored statistically significantly higher than Copilot (ChatGPT=3.41 vs. Copilot 3.17, F=5.294, p=0.03), but not clinically significantly higher (Supplementary Material 3).
To the best of our knowledge, no physiatry studies have been conducted to evaluate the potential of LLMs in musculoskeletal diseases. The few studies that were performed were mostly in orthopedics and rheumatology and usually evaluated ChatGPT’s performance for various musculoskeletal diseases including LBP, showing varying levels of accuracy. Those results demonstrated that the LLM performed poorly in terms of accuracy as a potentially reliable and useful tool for patients to obtain information [11-13]. Regarding Microsoft’s Copilot, another popular LLM, no studies have been found that evaluated its potential use in musculoskeletal disorders. Although Copilot uses OpenAI’s ChatGPT LLM, it runs on the GPT-4 model rather than OpenAI’s free model, which operates on the GPT-3.5 model, and has not been trained with the exact same data. The main difference is that the GPT-3.5 model has knowledge up to January 2022, while Copilot’s model is up to date and provides Internet links in its information. Copilot is also programmed to deliver answers that are structurally different from ChatGPT, which is why our study compared these two similar but not identical LLMs [14].
A potential limitation of this study is the static nature of the data collection. This study did not account for possible changes in the output of the LLMs over time. Future studies should consider a longitudinal approach in which responses are evaluated over multiple points in time to capture the variability and changes in LLM’s outputs. Another limitation of this study is that it was not performed by blinded evaluators. In retrospect, while the grading scales were thoroughly designed, in the “safety” ratings, while gradings 1 and 4 were clear, there was room for interpretation between categories 2 and 3, as distinguishing between “moderate” and “minimal” danger could have been better clarified. This interpretation was mainly based on the expert clinical judgment of the evaluators.
Regarding the “usefulness” ratings, the evaluators based their assessments on their LBP knowledge and experience on LBP. Therefore, these ratings reflected their professional understanding of the condition and their personal counseling approach to patients.
Although there was no head-to-head comparison with human physicians in this study, the rating scale was designed as an absolute measure, where the best possible answer, whether provided by a clinician or an intelligent chatbot, would achieve a score of 4/4 on the Likert scale. Therefore, even without direct comparisons, this study allows for the quantification of the quality of responses provided by the LLMs. Nevertheless, it would indeed have been preferable to include a comparison arm with human physicians, which is an important consideration for future research.
In conclusion, this study showed the potential benefits of ChatGPT and Copilot in delivering medical advice on LBP by consistently achieving high scores across validity, safety, and usefulness. Future studies should be conducted to confirm the safety of these LLMs over time. Furthermore, the observation that including the clinical context does not significantly influence the outcomes is particularly significant, suggesting that these tools can offer high-quality advice even when patients lack detailed knowledge of their conditions. This raises the question of whether these results indicate that LLMs are not yet sufficiently advanced to tailor their recommendations to specific clinical situations, resulting in advice that is valid but remains generic. These findings suggest that LLMs, such as ChatGPT and Copilot, could be carefully recommended to patients who have limited access to medical professionals such as physiatrists or during the interim between consultations from a family practitioner and a physiatrist. This can empower patients to begin non-pharmacological interventions and proactively manage their condition, potentially preventing deterioration due to inactivity or fear of pain exacerbation.
Future research should expand on these results by testing other AI models, such as Gemini from Google and Meta AI from Meta, and broadening the scope to general musculoskeletal diseases. There is also an opportunity to assess these tools directly with patients to conclude with more certainty the validity, utility, and safety of LLMs and to evaluate their effectiveness as support tools for clinicians.
Supplementary Materials 1 to 3 can be found at https://doi.org/10.12701/jyms.2024.01151.
Supplementary Material 1.
Twenty-seven questions asked to ChatGPT and Copilot
jyms-2024-01151-Supplementary-Material-1.pdf
Supplementary Material 2.
Clinical context added to the questions
jyms-2024-01151-Supplementary-Material-2.pdf
Supplementary Material 3.
Statistics
jyms-2024-01151-Supplementary-Material-3.pdf

Conflicts of interest

Mathieu Boudier-Revéret has been an editorial board member of Journal of Yeungnam Medical Science (JYMS) since 2021. He was not involved in the review process of this manuscript. There are no other conflicts of interest to declare.

Author contributions

Conceptualization: CAY, ÈB; Data curation, Formal analysis, Project administration, Visualization, Validation: CAY; Investigation, Methodology, Software, Supervision: all authors; Resources: CAY, MBR; Writing-original draft: CAY, ÈB; Writing-review & editing: all authors.

Fig. 1.
Average scores of ChatGPT vs. Copilot with and without clinical context captions (blue, validity; orange, usefulness; green, safety). The Likert scale ranged from 1 to 4 (1, lowest score; 4, highest score).
jyms-2024-01151f1.jpg
Table 1.
Evaluation score per category
Score Validity Safety Usefulness
1 Completely erroneous information (information not found in medical sources or information that is inaccurate and incomplete) Significant and certain danger to the patient’s condition Useless for the patient (no useful information)
2 Partially erroneous information (part of the information is not found in medical sources or contains inaccuracies) Potential moderate danger to the patient’s condition Partially useful for the patient (>0% and <50% of the information provided is useful)
3 Reliable but incomplete information (information found in medical sources but the response has incomplete elements) Potential minimal danger to the patient’s condition Moderately useful for the patient (≥50% of the information provided is useful but not 100%)
4 Completely reliable information (information found in medical sources and the response is complete) No danger Completely useful (100% of the information provided is useful)
Table 2.
Average score of evaluators
Category Evaluator Without context
With context
ChatGPT Copilot ChatGPT Copilot
Validity 1 3.30 3.00 3.26 3.26
2 3.30 3.22 3.30 3.26
3 3.44 3.11 3.37 3.22
Usefulness 1 3.33 3.07 3.15 3.19
2 3.26 3.30 3.26 3.30
3 3.04 2.93 3.11 3.00
Safety 1 3.59 3.48 3.59 3.63
2 3.48 3.48 3.52 3.59
3 3.70 3.63 3.74 3.63

The Likert scale ranged from 1 to 4 (1, lowest score; 4, highest score).

Table 3.
Average score per category
Category Without context
With context
ChatGPT Copilot ChatGPT Copilot
Validity 3.35 3.11 3.31 3.25
Usefulness 3.21 3.10 3.17 3.16
Safety 3.59 3.53 3.62 3.62

The Likert scale ranged from 1 to 4 (1, lowest score; 4, highest score).

  • 1. Alowais SA, Alghamdi SS, Alsuhebany N, Alqahtani T, Alshaya AI, Almohareb SN, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 2023;23:689.ArticlePubMedPMCPDF
  • 2. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 2023;6:1169595.ArticlePubMedPMC
  • 3. Nerdynav. 107 Up-to-date ChatGPT statistics & user numbers [Internet]. Nerdynav; 2022 Dec 13 [cited 2024 May 8]. https://nerdynav.com/chatgpt-statistics.
  • 4. Nerdynav. 43+ Bing statistics: usage, ad revenue & market share [Internet]. Nerdynav; 2023 Feb 13 [cited 2024 May 8]. https://nerdynav.com/bing-statistics.
  • 5. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med 2023;388:1233–9.ArticlePubMed
  • 6. GBD 2021 Low Back Pain Collaborators. Global, regional, and national burden of low back pain, 1990-2020, its attributable risk factors, and projections to 2050: a systematic analysis of the Global Burden of Disease Study 2021. Lancet Rheumatol 2023;5:e316–29.ArticlePubMedPMC
  • 7. Rantonen J, Karppinen J, Vehtari A, Luoto S, Viikari-Juntura E, Hupli M, et al. Cost-effectiveness of providing patients with information on managing mild low-back symptoms in an occupational health setting. BMC Public Health 2016;16:316.ArticlePubMedPMC
  • 8. Hoy D, March L, Brooks P, Blyth F, Woolf A, Bain C, et al. The global burden of low back pain: estimates from the Global Burden of Disease 2010 study. Ann Rheum Dis 2014;73:968–74.ArticlePubMed
  • 9. Iyengar KP, Yousef MM, Nune A, Sharma GK, Botchu R. Perception of Chat Generative Pre-trained Transformer (Chat-GPT) AI tool amongst MSK clinicians. J Clin Orthop Trauma 2023;44:102253.ArticlePubMedPMC
  • 10. Zec S, Soriani N, Comoretto R, Baldi I. High agreement and high prevalence: the paradox of Cohen’s kappa. Open Nurs J 2017;11:211–8.ArticlePubMedPMCPDF
  • 11. Uz C, Umay E. “Dr ChatGPT”: is it a reliable and useful source for common rheumatic diseases? Int J Rheum Dis 2023;26:1343–9.ArticlePubMed
  • 12. Shrestha N, Shen Z, Zaidat B, Duey AH, Tang JE, Ahmed W, et al. Performance of ChatGPT on NASS clinical guidelines for the diagnosis and treatment of low back pain: a comparison study. Spine (Phila Pa 1976) 2024;49:640–51.ArticlePubMed
  • 13. Krusche M, Callhoff J, Knitza J, Ruffer N. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatol Int 2024;44:303–6.ArticlePubMedPMCPDF
  • 14. Android Authority. Microsoft Copilot vs ChatGPT: which one is best for you? [Internet]. Android Authority; 2024 Jan 23 [cited 2024 May 8]. https://www.androidauthority.com/chatgpt-vs-bing-chat-3292126/.

Figure & Data

References

    Citations

    Citations to this article as recorded by  

      Figure
      • 0
      Impact of artificial intelligence in managing musculoskeletal pathologies in physiatry: a qualitative observational study evaluating the potential use of ChatGPT versus Copilot for patient information and clinical advice on low back pain
      Image
      Fig. 1. Average scores of ChatGPT vs. Copilot with and without clinical context captions (blue, validity; orange, usefulness; green, safety). The Likert scale ranged from 1 to 4 (1, lowest score; 4, highest score).
      Impact of artificial intelligence in managing musculoskeletal pathologies in physiatry: a qualitative observational study evaluating the potential use of ChatGPT versus Copilot for patient information and clinical advice on low back pain
      Score Validity Safety Usefulness
      1 Completely erroneous information (information not found in medical sources or information that is inaccurate and incomplete) Significant and certain danger to the patient’s condition Useless for the patient (no useful information)
      2 Partially erroneous information (part of the information is not found in medical sources or contains inaccuracies) Potential moderate danger to the patient’s condition Partially useful for the patient (>0% and <50% of the information provided is useful)
      3 Reliable but incomplete information (information found in medical sources but the response has incomplete elements) Potential minimal danger to the patient’s condition Moderately useful for the patient (≥50% of the information provided is useful but not 100%)
      4 Completely reliable information (information found in medical sources and the response is complete) No danger Completely useful (100% of the information provided is useful)
      Category Evaluator Without context
      With context
      ChatGPT Copilot ChatGPT Copilot
      Validity 1 3.30 3.00 3.26 3.26
      2 3.30 3.22 3.30 3.26
      3 3.44 3.11 3.37 3.22
      Usefulness 1 3.33 3.07 3.15 3.19
      2 3.26 3.30 3.26 3.30
      3 3.04 2.93 3.11 3.00
      Safety 1 3.59 3.48 3.59 3.63
      2 3.48 3.48 3.52 3.59
      3 3.70 3.63 3.74 3.63
      Category Without context
      With context
      ChatGPT Copilot ChatGPT Copilot
      Validity 3.35 3.11 3.31 3.25
      Usefulness 3.21 3.10 3.17 3.16
      Safety 3.59 3.53 3.62 3.62
      Table 1. Evaluation score per category

      Table 2. Average score of evaluators

      The Likert scale ranged from 1 to 4 (1, lowest score; 4, highest score).

      Table 3. Average score per category

      The Likert scale ranged from 1 to 4 (1, lowest score; 4, highest score).


      JYMS : Journal of Yeungnam Medical Science
      TOP