Patients and doctors should not rely too much on OpenAI’s ChatGPT for cancer treatment advice, as a new study reveals the popular artificial intelligence (AI) technology often weaves incorrect and correct information together, rendering its recommendations unreliable.
Researchers from Brigham and Women's Hospital, part of the Mass General Brigham healthcare system aimed to shed light on ChatGPTs limitations when it comes to recommending cancer treatments. The findings, published in JAMA Oncology, have shown that the AI chatbot often provides recommendations that do not align with established guidelines, raising concerns about the reliability of its advice for not only cancer treatments, but potentially other medical questions as well.[1]
Co-author Danielle Bitterman, MD, from the department of radiation oncology at Brigham and Women's Hospital, and the Artificial Intelligence in Medicine (AIM) Program of Mass General Brigham, cautioned that while ChatGPT's responses seem authoritative and almost human, the technology is not a supplement for an actual physician.
“Patients should feel empowered to educate themselves about their medical conditions, but they should always discuss with a clinician, and resources on the Internet should not be consulted in isolation,” Bitterman said in a release. “ChatGPT responses can sound a lot like a human and can be quite convincing. But, when it comes to clinical decision-making, there are so many subtleties for every patient’s unique situation. A right answer can be very nuanced, and not necessarily something ChatGPT or another large language model can provide.”
The study specifically evaluated ChatGPT's adherence to the Nation Comprehensive Cancer Care Network (NCCN) guidelines for the treatment of the three most common cancers: breast, prostate, and lung cancer. The researchers prompted ChatGPT to provide treatment recommendations based on varying disease severities.
Out of the total 104 queries, approximately 98% of ChatGPT’s responses included at least one treatment approach aligned with the NCCN guidelines. However, complicating matters significantly, 34% of these responses also featured recommendations that deviated from the guidelines or were not in harmony with other recommendations, making it nearly impossible for users to discern correct advice from erroneous suggestions.
Hallucinations send mixed messages
Importantly, the study delved deeper into the nature of these non-concordant recommendations. Responses were “hallucinated” (not part of any NCCN recommended treatment) in 12.5% of outputs. Hallucinations were mostly associated with recommendations for localized treatment of advanced disease, targeted therapy, or immunotherapy. This raises concerns not only because the hallucinated answers are not only inaccurate, but because ChatGPT could also mislead patients and impact their trust in medical professionals.
Researchers were not fully aligned on how accurate the responses of ChatGPT are. The researchers reported that all three annotators agreed on 61.9% of the scores for the unique prompts. Disagreements were found to often stem from instances where the chatbot's output was unclear, particularly in cases where the output didn't specify which multiple treatments to combine. This aspect highlights the challenges of deciphering the descriptive output of large language models – the type of technology that includes platforms like ChatGPT – and the complexities that can lead to different interpretations of guidelines.
The study's lead author, Shan Chen, MS, also from the AIM Program, expressed the need to raise awareness about the limitations of LLMs.
“It is an open research question as to the extent LLMs provide consistent logical responses as oftentimes ‘hallucinations’ are observed,” Chen said in the same release. “Users are likely to seek answers from the LLMs to educate themselves on health-related topics – similarly to how Google searches have been used. At the same time, we need to raise awareness that LLMs are not the equivalent of trained medical professionals.”
Moving forward, the researchers plan to investigate the extent to which both patients and clinicians can distinguish between advice generated by AI models and medical professionals. Additionally, they intend to further evaluate ChatGPT's clinical knowledge by employing more detailed clinical cases.
The study utilized GPT-3.5-turbo-0301, one of the largest AI models available at the time of the research. The authors acknowledged that while results may differ with other LLMs and clinical guidelines, many LLMs share similar limitations in their responses.