Abstract
Objective
ChatGPT, an advanced conversational bot based on artificial intelligence (AI) and a large language model, is designed to understand and generate responses to inputs. This study aims to assess the accuracy of responses provided by ChatGPT to questions that might be asked by patients concerning thyroid cancer.
Methods
A total of 27 questions in Turkish, relevant to thyroid cancer and likely to be asked by non-healthcare professionals, were prepared under four categories (general information, diagnosis, treatment, follow-up). These questions were posed to the free public version of ChatGPT, version 3.5. Three experts in endocrine surgery (A.C.D., S.T., Ö.M.) were asked to evaluate the responses. The answers were classified into three categories: appropriate, inappropriate, and insufficient/incomplete.
Results
Upon evaluating the responses given by ChatGPT to the prepared questions across the four categories, 9 responses (33.3%) were considered “appropriate” by two of the three experts and “insufficient/incomplete” by one expert. Six responses (22.2%) were deemed “appropriate” by two experts and “inappropriate” by one. Overall, a total of 16 responses (59.25%) were considered “appropriate” by at least two experts.
Conclusion
At this stage, AI-based conversational programs like ChatGPT are not seen as capable of replacing a specialist from whom patients receive medical advice.
Introduction
Large language models (LLMs) are AI products that use deep learning techniques, such as artificial neural networks, to replicate human language processing capabilities(1). They are capable of learning and processing vast amounts of language data from various sources. ChatGPT, developed by OpenAI (OpenAI, L.L.C., San Francisco, CA, USA) as a non-profit initiative and released on November 30, 2022, is an advanced chatbot that uses a text interface to understand and generate responses(2). The accessibility of this application at no cost has encouraged people to use it as a tool for acquiring information on various topics, including health.
The operating principle of the application involves scanning all data on the internet using keywords from the queried questions. While the internet contains much accurate information, it also hosts incorrect or misleading data. Thus, the application can present erroneous information alongside accurate data. In sensitive topics such as health, especially concerning diseases like cancer, if the application provides incorrect information, it could cause medical and psychological distress to a patient(3, 4). Consequently, it is necessary for the responses given by the application to be evaluated by expert clinicians.
According to data from the World Health Organization, thyroid cancer is the second most common cancer among women in our country, a region endemic for goitre. It follows breast cancer, with approximately 13,500 cases annually (representing 5.9% of all cancers)(5). Given this prevalence, there is a clear need for the public to access information in Turkish about thyroid cancer, both from the internet and through the ChatGPT application. Our study aims to evaluate the appropriateness of the ChatGPT application in answering basic questions about thyroid cancer posed by non-healthcare professionals in Turkish.
Materials and Methods
A total of 27 Turkish questions were prepared in four different sections (general information, diagnosis, treatment and follow-up), which non-healthcare professionals could ask about thyroid cancer (Table 1). These questions were prompted twice in the free public version of ChatGPT 3.5 for consistency. The responses obtained were compiled using a survey via Google Forms. Three expert academicians in endocrine surgery, each holding a European Board of Endocrine Surgery certification and having a Web of Science H-index above 10, were asked to evaluate the responses provided by ChatGPT to the prepared Turkish questions on thyroid cancer.
Ethical approval was obtained for the study from the Ege University Medical Research Ethics Committee (decision no: 23-12T/32, date: 14.12.2023). An informed consent form was presented online to the three academicians, who provided their consent.
Statistical Analysis
The academicians categorized ChatGPT’s responses into three groups: Appropriate, Inappropriate, and Insufficient/Incomplete. The responses were recorded using Microsoft 365 Excel. Descriptive datasets were compiled from an excel spreadsheet for the ChatGPT answers to each question.
Results
When evaluating the responses provided by ChatGPT to the prepared questions across four groups, of 27 responses: 9 (33.3%) were considered “appropriate” by two of the three experts and “insufficient/incomplete” by one expert. Six responses (22.2%) were deemed “appropriate” by two experts and “inappropriate” by one expert. There were six responses (22.2%) categorized as “Insufficient/Incomplete” by two experts and “appropriate” by one. One response was judged as “appropriate” by all three experts, while another received a rating of 3 for “insufficient-incomplete”. One response was assessed differently by each of the three experts (Table 2).
The number of responses found “appropriate” by at least two experts was 16 (59.2%), while the responses considered “inappropriate” by at least two experts were only 2 (7.4%). Responses labelled as “insufficient/incomplete” by at least two experts totalled 8 (29.6%).
The response to the question “My ultrasound results mention “EU-TIRADS 4”. What does this mean?” under the Diagnosis section was evaluated as “appropriate” by all three experts (Figure 1). The response to “Who performs thyroid cancer surgery?” under the Treatment section was assessed differently by each expert (Figure 2). The response to “How is thyroid cancer surgery performed?” also under the Treatment section was rated as “insufficient/incomplete” by all three experts (Figure 3).
Discussion
When existing questions were asked to traditional search engines, advertisements were typically encountered within the first one or two links. Subsequently, it was observed that the information obtained by clicking on links other than these was often hosted on health-related websites that generally lacked proper citations of sources. Accessing accurate information through traditional search engines was found to be significantly more challenging and time-consuming compared to ChatGPT. While ChatGPT is a highly effective and efficient artificial intelligence chat program, especially in scientific research and healthcare professional training due to its ability to quickly access vast information in various languages, there are still ethical issues associated with its use at the community level in addressing health problems. These include biases in data and concerns over the privacy and security of personal data(6).
In the increasingly consumer-focused model of healthcare services, the unprecedented access to information could extend to patients using ChatGPT to form opinions on medical questions. A recent study revealed that 89% of people in the United States consult Google for health symptoms before visiting a doctor(7). The popularity of the LLM-based AI chat program ChatGPT has grown significantly over the past two years, demonstrating its potential use by patients as an access tool for health information. However, although the AI chat program ChatGPT can simplify our lives in many ways, its responses to Turkish questions about thyroid cancer, which requires specific expertise, were found to be appropriate by at least two experts in about 60% of the cases. On the other hand, a study by Zalzal et al.(8) reported that responses provided by ChatGPT to patient questions about ENT diseases were quite satisfactory. Furthermore, a study examining responses by ChatGPT to general questions found that approximately 87% of responses to 38 questions about colon cancer management were deemed appropriate and consistent by at least two independent experts(9). Another study evaluating the appropriateness of cardiovascular disease prevention recommendations provided by ChatGPT found that 84% of the recommendations were consistent and appropriate(10).
Köroğlu et al.(11) found the responses given by ChatGPT on the management of thyroid nodules to be mostly accurate and reliable when evaluated by two expert endocrinologists. However, they did not consider it appropriate as a primary source for physicians, suggesting it could guide patients. The lower appropriateness rates in our study compared to others might be due to our use of the Turkish language. The primary mechanism of the ChatGPT program is to analyze data sources on the internet to generate answers. The lesser availability of Turkish sources compared to sources in more frequently used languages like English could also be a factor in the inadequacy of responses(12). Deiana et al.(13) study on myths about vaccination asked questions of both ChatGPT 3.5 and ChatGPT 4.0, finding a 17% difference in clarity between the versions, which could be a factor contributing to the unsuitability of responses in our study. The paid 4.0 version, containing improvements and updates, is likely to provide more detailed and acceptable responses than the free version used in our study.
Some studies have shown that ChatGPT has similar accuracy rates in different languages(14), but no literature reviews have encountered a study comparing Turkish with other languages. Future studies should consider this and design their studies to include comparisons between Turkish and other languages to achieve more effective results. A study on thyroid nodules by Campbell et al.(15) found the accuracy rate of responses given by the ChatGPT application to be 69.2%, which aligns with our study, suggesting that lower accuracy rates in two studies could be topic-dependent. The appropriateness of responses given by ChatGPT on thyroid nodules and treatment may be lower than other health topics. This deficiency could be addressed over time by making more databases available to applications like ChatGPT. The readability level of responses provided by ChatGPT for community use has been indicated by international standards(16) to be higher than middle school level, which shows a gap in reaching the public effectively in health education. No study evaluating the language level of responses in Turkish has been found, although language level assessments have been researched in English. Future study designs are advised to consider this issue.
ChatGPT, a 24/7 accessible application, has become an increasingly popular AI-based chat program for medical advice, yet it must be remembered that ChatGPT is not a medical application. If used without proper oversight in the health sector, it is expected to lead to medical and paramedical problems.
Study Limitations
This study has limitations such as the limited access to accurate data of ChatGPT due to the fact that the questions are in Turkish but most of the resources accessible on the internet are in English, the small number of academics who check the questions and meet the selection criteria, and the inability to evaluate the access to ChatGPT 4.0 (paid version).
Conclusion
Although technological advancements are increasingly integrating artificial intelligence into our daily lives, and its usage by the public is growing, it is currently not deemed appropriate for AI-based chat programs like ChatGPT to replace medical professionals. Such programs should not provide advice on specific issues requiring professional health services, such as thyroid cancer. Certainly, as this AI program continues to evolve, it is expected to greatly benefit the health sector by providing doctors with the opportunity to save time in clinics, thereby reaching more patients effectively and offering patients 24/7 access to information. However, patients should obtain the most reliable and accurate information about their conditions through specialists in the field.