Effects of temperature settings on information quality of ChatGPT-3.5 responses: A prospective, single-blind, observational cohort study

Akihiko Akamine; Daisuke Hayashi; Atsushi Tomizawa; Yuya Nagasaki; Chikae Akamine; Takahiro Fukawa; Iori Hirosawa; Orie Saigo; Misa Hayashi; Mitsuru Nanaoya; Yuka Odate

doi:10.1101/2024.06.11.24308759

Abstract

Objective The effect of temperature settings on the quality of ChatGPT version 3.5 (OpenAI) responses related to drug information remains unclear. We investigated ChatGPT-3.5’s response quality on apixaban information with and without the temperature being set to 0.

Methods On 6 September 2023, 37 questions regarding apixaban, derived from the frequently asked questions on the Bristol–Myers Squibb’s website, were entered into ChatGPT in Japanese. The primary endpoint was the effect of temperature settings on ChatGPT-3.5’s responses to apixaban-related questions. The response accuracy, clarity, detail, and adequacy were rated on a 5-point Likert scale by 10 pharmacists, with higher scores indicating higher response quality. Cumulative score means were analyzed using the Mann–Whitney U test. In the subgroup analysis, evaluators were limited to pharmacists at university hospitals. Welch’s t-test was employed in sensitivity analysis to validate primary endpoint findings.

Results The mean scores for ChatGPT-3.5’s apixaban-related responses with (13.08) and without (14.40) the temperature being set to 0 were not significantly different (p = 0.064). Accuracy differed significantly (3.15 vs. 3.54, p = 0.045), whereas clarity, detail, and appropriateness were similar. Subgroup analysis (13.30 vs. 14.21, p = 0.394) and sensitivity analysis confirmed similar results (13.45 vs. 14.52, p = 0.105).

Conclusions ChatGPT-3.5 temperature setting does not significantly affect overall responses to apixaban-related inquiries. However, the variance in accuracy suggests that ChatGPT-3.5 is unable to consistently provide precise responses. Hence, it is more suitable as a supplementary tool rather than a primary medical resource.

Introduction

Recent advances in artificial intelligence (AI) have led to the development of sophisticated tools, such as ChatGPT [1], which are increasingly utilized in various fields, including pharmaceutical information services. ChatGPT, developed by OpenAI, has the potential to enhance patient care in the medical field by providing accurate information. Its efficacy in predicting drug–drug interactions highlights its important role in healthcare [2]. Furthermore, AI integration into medical safety education, including drug information services, is being actively explored. These investigations focus on addressing ethical and security issues associated with AI integration, ultimately aiming to provide comprehensive and personalized medical services [3]. In life sciences, AI has facilitated advances in research methods, protocols, and data analysis, enabling medical providers to make more effective decisions [4].

Despite these advances, the application of ChatGPT in drug information services remains challenging. In a previous real-world study, ChatGPT answered the majority of drug-related questions incorrectly or only partly correctly, highlighting the limitations of applying AI in the drug information field due to issues such as inaccurate content and a lack of references [5–9]. This performance inconsistency raises concerns regarding the reliability and robustness of AI-generated drug information. It is necessary to assess the accuracy and reliability of ChatGPT in providing pharmaceutical information, particularly under different operational settings, such as temperature, which can influence the model’s response style and content. A method for adjusting ChatGPT’s response quality by setting the temperature has been reported. ChatGPT temperature is a parameter that controls the diversity of the generated text and can be specified as a value between 0 and 1. At low temperatures, the generated text is more predictable and monotonous, whereas at high temperatures, the generated text may contain more diverse and random words and expressions [10].

The following limitations have been associated with previous studies: (i) a small number of ChatGPT response raters (<10) may have biased the ratings; (ii) drug information validation was lacking; (iii) responses with different temperature settings were not validated; and (iv) despite the evaluation of the response accuracy of ChatGPT, the clarity, detail, and appropriateness of the responses were not verified. Addressing these limitations would facilitate a more comprehensive evaluation of the quality of ChatGPT responses to pharmaceutical information.

This study compared and verified the quality of the answers provided by ChatGPT-3.5 regarding drug-related questions with and without a temperature setting of 0. The aim of this study was to evaluate the effect of temperature settings on the accuracy, clarity, detail, and appropriateness of ChatGPT-3.5 responses and to elucidate its reliability as a drug information tool. The drug of interest was oral apixaban (tablet; Eliquis®; Bristol–Myers Squibb, New York, NY, USA), the highest-selling oral drug in FY2022, excluding COVID-19 prophylaxis treatments and therapeutic modalities [11].

Methods

Ethics approval

This was an observational study. The Institutional Review Board for Observation and Epidemiological Study at the Doujin Hospital confirmed that no ethics approval was required (Date: 20 May 2023).

Consent to participate

Informed consent was not required for this study because it did not involve human subjects.

Study design

This prospective, single-blind, observational cohort study was conducted at eight hospitals, pharmacies, and pharmacy schools in Japan (Level of Evidence IV). Ten evaluators (D.H., A.T., Y.N., C.A., F.T., I.H., M.H., M.N., O.S., and Y.O.) participated in this study; among them, three pharmacists specialized in cancer, heart failure, and perioperative patient management. Data collection, creation of questions for ChatGPT, and analysis of ChatGPT responses were conducted between 1^st July and 30^th November 2023.

Eligibility criteria of the pharmacists

We included pharmacists with at least 3 years of experience in hospital or pharmacy practice, who were employed in a facility utilizing apixaban. We excluded pharmacists who, according to the principal researcher, were deemed incapable of adequately evaluating drug information as well as pharmacists who used ChatGPT to obtain content relevant to the study during the evaluation period.

Creating questions for ChatGPT

The principal researcher developed a total of 37 questions dissecting 35 frequently asked questions regarding apixaban posted in Japanese on the Bristol–Myers Squibb’s website, separating combined questions into individual queries, if necessary [12]. Three core researchers (D.H., A.T., and Y.N.) evaluated and approved the questions. In cases of disagreement, inputs were obtained from a fourth researcher (C.A.), and the principal investigator made the final decision.

Evaluation of ChatGPT’s responses

On 6 September 2023, the primary researcher posted questions to ChatGPT-3.5 in Japanese. To maintain novelty, each question was asked from the same account using a “New Chat.” To maintain the temperature setting of ChatGPT at 0, we noted “Please use temperature of 0” at the end of each question. The reproducibility of responses was not considered in this study, and the first response obtained was evaluated. All questions required written responses, and no multiple-choice questions were included. All the textual prompts are provided in S1 File. The principal researcher used a Google form with one question that did not specify the temperature setting and two options (set and not set). These responses were randomized, blinded, and presented to the evaluators.

The evaluators were provided with answers with and without the temperature setting, in a random order. They were instructed to read the questions and their answers prior to the evaluation. Evaluators were asked to rate 296 responses using a 5-point Likert scale using four dimensions, (accuracy, clarity, detail, and appropriateness) [13]. Each question was answered only once, and the response options were rated on a scale of 1–5, with higher numbers indicating higher quality. The evaluation criteria are presented in S2 File. Evaluators consulted reliable sources of apixaban drug information (e.g., package inserts, interview forms, and Bristol–Myers Squibb’s website), as needed.

Main analysis

The primary endpoint of the study was the quality of ChatGPT-3.5’s answers to apixaban-related questions with and without the temperature being set to 0. The accuracy, clarity, detail, and appropriateness of the responses were rated on a 5-point Likert scale (1–5), and the mean scores for all questions (4–20) were analyzed using Mann–Whitney U test. In addition, as secondary endpoints, the accuracy, clarity, detail, and appropriateness of ChatGPT-3.5’s answers were individually rated on a 5-point Likert scale (1–5) with and without a temperature setting of 0. The mean score for each category for all questions was analyzed using Mann–Whitney U test.

Subgroup analysis

To confirm the robustness of the primary outcomes, we conducted a subgroup analysis that included only pharmacists affiliated with university hospitals, to evaluate the influence of affiliations of pharmacists on outcomes similar to the main analysis. University hospitals have reported higher patient satisfaction than general hospitals [14], but no difference in mortality or readmission rates by disease has been noted. However, since no studies have compared the quality of pharmacists at different institutions, only pharmacists from university hospitals were included in this analysis.

Sensitivity analysis

To determine the effect of the statistical analysis method on the primary outcome, a sensitivity analysis was performed by changing the statistical analysis method to Welch’s t-test.

Statistical analysis

The Shapiro–Wilk test was used to test the normality of the distribution of the age and career data of the participating pharmacists as continuous variables. Continuous variables were expressed as medians and means, whereas categorical data were expressed as absolute values and percentages. Welch’s t-test was used to analyze the means of continuous variables, and Mann–Whitney U test was used to analyze the medians [15,16]. Pharmacists with missing study data were excluded from the univariate analyses. However, when ≥20% of the data were missing, multiple imputations were planned using chained equations to create 100 sets of corresponding data. All statistical analyses were performed using the EZR version 1.36 software (Saitama Medical Center, Jichi Medical University, Saitama, Japan) [17]. All the tests were two-tailed. Statistical significance was set at p < 0.050. As this was an exploratory study, the sample size was not calculated. Nominal p-values were used to account for the multiplicity of analyses.

Results

Characteristics of pharmacists

Ten pharmacists evaluated all the responses, and none of the evaluators met the exclusion criteria. The age (p = 0.649) and career length (p = 0.551) of the pharmacists showed normal distributions. Pharmacist characteristics are listed in Table 1.

View this table:

Table 1. Pharmacist characteristics evaluated at baseline (n = 10).

Primary outcome

All ChatGPT responses were in Japanese, eliminating the need to translate the responses. With the temperature set to 0, the median of the answers was 13.08 (interquartile range: 12.50–14.03), whereas it was 14.40 (interquartile range: 13.84–15.32) without a temperature setting, demonstrating no significant differences (p = 0.064). Answers with the temperature set at 0 had a lower rate of total scores of ≥16 (maximum: 20) than those without a temperature setting (7/37 [18.92%] vs. 17/37 [45.95%]; Fisher’s exact test, p = 0.024). The results of Mann–Whitney U test showed a significant difference between the mean scores for answers with and without the temperature being set to 0 (accuracy: 3.15 [interquartile range: 3.06–3.30] and 3.54 [interquartile range: 3.29–3.65]; p = 0.045). However, clarity, detail, and adequacy of answers were similar between groups (Table 2).

View this table:

Table 2. Comparisons of the scores of each endpoint for questions on apixaban drug information with and without the ChatGPT temperature setting (n = 74).

Results of the subgroup analysis

The university hospital pharmacist subgroup analysis yielded median values of 13.30 (interquartile range; 12.00–15.02) and 14.21 (interquartile range: 13.45–15.32) for answers with and without the temperature being set at 0, respectively, demonstrating no significant differences (p = 0.394). Hence, the subgroup analysis yielded results similar to those of the primary analysis.

Results of the sensitivity analysis

When the primary analysis method was revised to Welch’s t-test, the mean scores of the answers with and without the temperature being set to 0 were 13.45 (standard deviation; 1.51) and 14.52 (standard deviation; 1.28), respectively, which were not significantly different (p = 0.105). Thus, the sensitivity analysis yielded the same results as those of the primary analysis.

Discussion

Summary of key findings

This study yielded two important findings. First, the overall quality of ChatGPT-3.5’s responses in terms of accuracy, clarity, detail, and adequacy was consistent, regardless of the temperature setting, as evidenced by similar results across the primary endpoints, subgroup analyses, and sensitivity analyses. Second, responses with a temperature setting of 0 were less likely to have a total score ≥16 than those with no temperature setting (18.92% vs. 45.95%, Fisher’s exact test, p = 0.024). These findings provide a basis for further discussion on the implications of temperature settings on AI-generated drug information.

Although a previous study has shown that AI-based chatbots, including this version of ChatGPT, have robust search and information integration capabilities, particularly in clinical pharmacy [18], our study found a lower percentage of high-quality responses when the temperature was set to 0 than those without temperature settings. This finding is particularly interesting because it suggests a subtle effect of temperature setting on response quality, which has not been previously explored. Furthermore, it underscores the life-threatening consequences of using medication based on incorrect information. Thus, addressing and resolving this issue promptly is crucial. Additionally, users should ask detailed questions because the quality of ChatGPT answers depends on the phrasing of the questions.

Strengths and weaknesses

This study has several strengths. It contributes markedly to the field of clinical pharmacy and AI-based tools as it provides unique insights into the impact of temperature setting on the quality of pharmaceutical information provided by ChatGPT-3.5. This specific focus on temperature settings and their influence on AI response quality has not been extensively explored in previous research. Second, the study employed a robust methodology, including a clear primary endpoint and comprehensive statistical analyses.

In addition, the study included 10 pharmacists in the evaluation, whose diversity provided a broader perspective, thereby reducing bias and enhancing the representativeness of the ratings. During the evaluation, the evaluators were not informed about whether the responses from ChatGPT-3.5 were temperature-adjusted, thus reducing potential order bias. Subgroup and sensitivity analyses were performed to ensure the consistency of results for the primary endpoints, thereby enhancing result reliability. This enabled us to generalize the results across different settings and rater profiles. Moreover, this study acknowledges ChatGPT-3.5’s limitations, particularly its lack of internet search capabilities and reliance on preexisting datasets. Our critical evaluation highlights the importance of continuous updates and improvements in AI tools to ensure their effective use in healthcare settings, particularly in domains where current and accurate information, such as drug data, is crucial. Finally, this study focused on a specific drug, apixaban, allowing for a detailed and focused analysis of AI performance in providing medication information. Although this approach limits the generalizability of the findings, it enables a more in-depth understanding of AI capabilities and limitations in the context of a single, widely used medication.

The study limitations must also be acknowledged. First, evaluators were limited to pharmacists, primarily those working at university hospitals. This specific professional background may have influenced the perception and evaluation of AI-generated responses. Second, we used the Japanese version of ChatGPT-3.5, and the results may vary for other languages due to differences in language processing and available datasets in the AI model.

Interpretation

Consistent with our findings, the limitations of AI chatbots in effectively handling complex medical information have been highlighted in previous studies that have cited a lack of medicine-specific datasets and challenges in advanced reasoning [19]. Temperature settings designed to control response randomness may inadvertently affect the chatbots’ ability to access and integrate complex medical information effectively.

Although no significant difference was detected in the overall quality of responses from ChatGPT-3.5 across temperature settings, a lower percentage of high-quality responses was observed when the temperature was set to 0, thereby warranting further investigation. This emphasizes the importance of careful consideration of the AI chatbot settings in clinical applications and settings, ensuring they are optimized to provide accurate and relevant information.

The subgroup and sensitivity analyses conducted in our study provided additional insights into the robustness of ChatGPT-3.5’s responses to pharmaceutical inquiries. In the subgroup analysis limited to university hospital pharmacists, our findings remained consistent with the primary outcome. This consistency across different groups of raters reinforces the quality of ChatGPT-3.5 responses in a professional academic setting.

Furthermore, the sensitivity analysis performed using a different statistical test also supported the primary findings, showing non-significant differences in response quality. This methodological robustness enhances the credibility of our results, suggesting that the observed variance in accuracy is not a statistical anomaly but a characteristic of the AI model’s performance. However, the slight variance in accuracy observed in the primary analysis remains a matter of concern. Although this variance is not statistically significant, it could have implications in clinical settings where precise drug information is crucial. Previous studies have also indicated variability in AI responses in clinical scenarios, suggesting the need for the cautious application and continuous monitoring of AI tools in healthcare [2,20].

ChatGPT-3.5 does not have an internet search capability, constricting its ability to provide responses integrating the latest information, which is crucial in the drug information field. For example, although the package insert recommends the administration of Ondexxya in the event of life-threatening or difficult-to-staunch bleeding when consuming apixaban, no responses related to the administration of andexanet alfa (injection) (Ondexxya®; AstraZeneca, London, UK) were noted. This could be attributed to the fact that Ondexxya was not available in Japan until May 2022, and the ChatGPT data only extended until September 2021. This limitation is particularly relevant in clinical pharmacy practice where accurate and up-to-date information is paramount for patient safety. ChatGPT-3.5, however, contains limited learning data, which is a crucial factor to be considered in the field of drug information. Our findings underscore the importance of regularly updating and improving AI chatbots for their effective utilization in clinical pharmacies and healthcare settings.

Future research

Although we found no significant difference in the overall quality of responses from ChatGPT-3.5 across temperature settings, a lower percentage of high-quality responses was observed when the temperature was set to 0, suggesting the need for further investigation and careful consideration of AI chatbot settings in clinical applications. Settings that optimize the information accuracy and relevance should be provided. Future studies should focus on the inclusion of various drugs to obtain a deeper understanding of the capabilities and limitations of ChatGPT-3.5 in relation to various drug classes and their respective complexities.

Additionally, the participation of a diverse group of healthcare professionals is essential for future evaluations to obtain a broader perspective on AI performance and its utility across the healthcare ecosystem. Given the global applicability of AI tools, it is critical to conduct similar studies in various linguistic and cultural contexts. This approach will aid in understanding the impact of language processing and cultural nuances on ChatGPT-3.5, aiming to evaluate the global validity and reliability of this tool. Further investigation is also required to determine the reasons for the variation in AI response quality, particularly under different temperature settings. Understanding the mechanisms that lead to this variability will guide the development of more consistent and reliable AI tools for clinical use.

Research focusing on the impact of AI tools on patient safety is critical. Finally, studies should examine the ethical and legal aspects of AI in healthcare, particularly regarding privacy, data security, and liabilities. Understanding these implications is essential for the responsible incorporation of AI tools into clinical practice.

Conclusions

The use of temperature settings of ChatGPT-3.5 did not result in significant differences in the overall quality of responses to drug queries, specifically those related to apixaban. This suggests that ChatGPT-3.5 responses are not significantly affected by this setting. However, the variability in accuracy highlights the need for careful consideration when using this tool in clinical settings. Despite its potential as a supportive tool in pharmaceutical information retrieval, its limitations, including the lack of real-time internet access and the potential for the use of outdated information, must be acknowledged. Healthcare professionals should use ChatGPT-3.5 as a supplementary source, always verifying its output against current, evidence-based medical literature. Future research should aim to evaluate chatbot performance across a broader range of medications with larger and more diverse groups of healthcare professionals for a comprehensive understanding of the capabilities and limitations of chatbots in the context of clinical pharmacy.

Data Availability

datasets generated and/or analyzed during the current study are available in the Mendeley repository, https://data.mendeley.com/datasets/dw2993xm2n/1

https://data.mendeley.com/datasets/dw2993xm2n/1

Supporting information

S1 File. Questions entered in Japanese were translated into English for this study. S2 File. Criteria for evaluating the quality of responses used in this study.

Acknowledgments

We would like to thank Hiroki Yoshida for advice on statistical analysis and Editage (www.editage.com) for English language editing and journal submission support. The authors have authorized the submission of this manuscript through Editage.

References

1.↵
Hassani H, Silva ES. The role of ChatGPT in data science: How AI-assisted conversational interfaces are revolutionizing the field. Big Data Cogn Comput. 2023;7: 62. doi: 10.3390/bdcc7020062.
OpenUrl CrossRef
2.↵
Al-Ashwal FY, Zawiah M, Gharaibeh L, Abu-Farha R, Bitar AN. Evaluating the sensitivity, specificity, and accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools. Drug Healthc Patient Saf. 2023;15: 137–147.
OpenUrl
3.↵
Wang X, Liu XQ. Potential and limitations of ChatGPT and generative artificial intelligence in medical safety education. World J Clin Cases. 2023;11: 7935–7939. doi: 10.12998/wjcc.v11.i32.7935.
OpenUrl CrossRef
4.↵
Heck TG. What artificial intelligence knows about 70 kDa heat shock proteins, and how we will face this ChatGPT era. Cell Stress Chaperones. 2023;28: 225–229. doi: 10.1007/s12192-023-01340-1.
OpenUrl CrossRef
5.↵
Morath B, Chiriac U, Jaszkowski E, Deiß C, Nürnberg H, Hörth K, et al. Performance and risks of ChatGPT used in drug information: An exploratory real-world analysis. Eur J Hosp Pharm. 2023;31: 85–86. doi: 10.1136/ejhpharm-2023-003750.
OpenUrl Abstract
6.
Kusunose K, Kashima S, Sata M. Evaluation of the accuracy of ChatGPT in answering clinical questions on the Japanese Society of Hypertension Guidelines. Circ J. 2023;87: 1030–1033. doi: 10.1253/circj.CJ-23-0308.
OpenUrl CrossRef
7.
Lahat A, Shachar E, Avidan B, Glicksberg B, Klang E. Evaluating the utility of a large language model in answering common patients’ gastrointestinal health-related questions: Are we there yet? Diagnostics (Basel). 2023;13: 1950. doi: 10.3390/diagnostics13111950.
OpenUrl CrossRef
8.
Samaan JS, Yeo YH, Rajeev N, Hawley L, Abel S, Ng WH, et al. Assessing the accuracy of responses by the language Model ChatGPT to questions regarding bariatric surgery. Obes Surg. 2023;33: 1790–1796. doi: 10.1007/s11695-023-06603-5.
OpenUrl CrossRef
9.↵
Wagner MW, Ertl-Wagner BB. Accuracy of information and references using ChatGPT-3 for retrieval of clinical radiological information. Can Assoc Radiol J. 2024;75: 69–73. doi: 10.1177/08465371231171125.
OpenUrl CrossRef
10.↵
Open AI. API reference. [Cited 4 May 2023]. Available from: https://platform.openai.com/docs/api-reference/completions/create#completions/create-temperature.
11.↵
DRUG DISCOVERY: The 50 best-selling pharmaceuticals of 2022: COVID-19 vaccines poised to take a step back. [Cited 4 May 2023]. Available from: https://www.drugdiscoverytrends.com/50-of-2022s-best-selling-pharmaceuticals/.
12.↵
BMS. HEALTHCARE Japan: Inquiries about our products. [Cited 4 May 2023]. Available from: https://www.bmshealthcare.jp/medical/faq/pheq (in Japanese).
13.↵
Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183: 589–596. doi: 10.1001/jamainternmed.2023.1838.
OpenUrl CrossRef PubMed
14.↵
Chen AS, Revere L, Ratanatawan A, Beck CL, Allo JA. A comparative analysis of academic and nonacademic hospitals on outcome measures and patient satisfaction. Am J Med Qual. 2019;34: 367–375. doi: 10.1177/1062860618800586.
OpenUrl CrossRef
15.↵
West RM. Best practice in statistics: Use the Welch t-test when testing the difference between two groups. Ann Clin Biochem. 2021;58: 267–269. doi: 10.1177/0004563221992088.
OpenUrl CrossRef PubMed
16.↵
1. MacFarland TW,
2. Yates JM
MacFarland TW, Yates JM. Mann–Whitney U test. In: MacFarland TW, Yates JM, editors. Introduction to nonparametric statistics for the Biological Sciences using R. Cham. Springer International Publishing; 2016. pp. 103–132.
17.↵
Kanda Y. Investigation of the freely available easy-to-use software ‘EZR’ for medical statistics. Bone Marrow Transplant. 2013;48: 452–458. doi: 10.1038/bmt.2012.244.
OpenUrl CrossRef PubMed Web of Science
18.↵
Huang X, Estau D, Liu X, Yu Y, Qin J, Li Z. Evaluating the performance of ChatGPT in clinical pharmacy: A comparative study of ChatGPT and clinical pharmacists. Br J Clin Pharmacol. 2024;90: 232–238. doi: 10.1111/bcp.15896.
OpenUrl CrossRef
19.↵
Alowais SA, Alghamdi SS, Alsuhebany N, Alqahtani T, Alshaya AI, Almohareb SN, et al. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23: 689. doi: 10.1186/s12909-023-04698-z.
OpenUrl CrossRef
20.↵
Al-Dujaili Z, Omari S, Pillai J, Al Faraj A. Assessing the accuracy and consistency of ChatGPT in clinical pharmacy management: A preliminary analysis with clinical pharmacy experts worldwide. Res Social Adm Pharm. 2023;19: 1590–1594. doi: 10.1016/j.sapharm.2023.08.012.
OpenUrl CrossRef

View the discussion thread.

Posted June 12, 2024.

Download PDF

Data/Code

Citation Tools

Subject Area

Health Informatics

Subject Areas

All Articles

Addiction Medicine (349)
Allergy and Immunology (668)
Allergy and Immunology (668)
Anesthesia (181)
Cardiovascular Medicine (2648)
Dentistry and Oral Medicine (316)
Dermatology (223)
Emergency Medicine (399)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
Epidemiology (12228)
Forensic Medicine (10)
Gastroenterology (759)
Genetic and Genomic Medicine (4103)
Geriatric Medicine (387)
Health Economics (680)
Health Informatics (2657)
Health Policy (1005)
Health Systems and Quality Improvement (985)
Hematology (363)
HIV/AIDS (851)
Infectious Diseases (except HIV/AIDS) (13695)
Intensive Care and Critical Care Medicine (797)
Medical Education (399)
Medical Ethics (109)
Nephrology (436)
Neurology (3882)
Nursing (209)
Nutrition (577)
Obstetrics and Gynecology (739)
Occupational and Environmental Health (695)
Oncology (2030)
Ophthalmology (585)
Orthopedics (240)
Otolaryngology (306)
Pain Medicine (250)
Palliative Medicine (75)
Pathology (473)
Pediatrics (1115)
Pharmacology and Therapeutics (466)
Primary Care Research (452)
Psychiatry and Clinical Psychology (3432)
Public and Global Health (6527)
Radiology and Imaging (1403)
Rehabilitation Medicine and Physical Therapy (814)
Respiratory Medicine (871)
Rheumatology (409)
Sexual and Reproductive Health (410)
Sports Medicine (342)
Surgery (448)
Toxicology (53)
Transplantation (185)
Urology (165)

[1] 1.↵
Hassani H, Silva ES. The role of ChatGPT in data science: How AI-assisted conversational interfaces are revolutionizing the field. Big Data Cogn Comput. 2023;7: 62. doi: 10.3390/bdcc7020062.
OpenUrl CrossRef

[2] 2.↵
Al-Ashwal FY, Zawiah M, Gharaibeh L, Abu-Farha R, Bitar AN. Evaluating the sensitivity, specificity, and accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools. Drug Healthc Patient Saf. 2023;15: 137–147.
OpenUrl

[3] 3.↵
Wang X, Liu XQ. Potential and limitations of ChatGPT and generative artificial intelligence in medical safety education. World J Clin Cases. 2023;11: 7935–7939. doi: 10.12998/wjcc.v11.i32.7935.
OpenUrl CrossRef

[4] 4.↵
Heck TG. What artificial intelligence knows about 70 kDa heat shock proteins, and how we will face this ChatGPT era. Cell Stress Chaperones. 2023;28: 225–229. doi: 10.1007/s12192-023-01340-1.
OpenUrl CrossRef

[5] 5.↵
Morath B, Chiriac U, Jaszkowski E, Deiß C, Nürnberg H, Hörth K, et al. Performance and risks of ChatGPT used in drug information: An exploratory real-world analysis. Eur J Hosp Pharm. 2023;31: 85–86. doi: 10.1136/ejhpharm-2023-003750.
OpenUrl Abstract

[6] 6.
Kusunose K, Kashima S, Sata M. Evaluation of the accuracy of ChatGPT in answering clinical questions on the Japanese Society of Hypertension Guidelines. Circ J. 2023;87: 1030–1033. doi: 10.1253/circj.CJ-23-0308.
OpenUrl CrossRef

[7] 7.
Lahat A, Shachar E, Avidan B, Glicksberg B, Klang E. Evaluating the utility of a large language model in answering common patients’ gastrointestinal health-related questions: Are we there yet? Diagnostics (Basel). 2023;13: 1950. doi: 10.3390/diagnostics13111950.
OpenUrl CrossRef

[8] 8.
Samaan JS, Yeo YH, Rajeev N, Hawley L, Abel S, Ng WH, et al. Assessing the accuracy of responses by the language Model ChatGPT to questions regarding bariatric surgery. Obes Surg. 2023;33: 1790–1796. doi: 10.1007/s11695-023-06603-5.
OpenUrl CrossRef

[9] 9.↵
Wagner MW, Ertl-Wagner BB. Accuracy of information and references using ChatGPT-3 for retrieval of clinical radiological information. Can Assoc Radiol J. 2024;75: 69–73. doi: 10.1177/08465371231171125.
OpenUrl CrossRef

[10] 10.↵
Open AI. API reference. [Cited 4 May 2023]. Available from: https://platform.openai.com/docs/api-reference/completions/create#completions/create-temperature.

[11] 11.↵
DRUG DISCOVERY: The 50 best-selling pharmaceuticals of 2022: COVID-19 vaccines poised to take a step back. [Cited 4 May 2023]. Available from: https://www.drugdiscoverytrends.com/50-of-2022s-best-selling-pharmaceuticals/.

[12] 12.↵
BMS. HEALTHCARE Japan: Inquiries about our products. [Cited 4 May 2023]. Available from: https://www.bmshealthcare.jp/medical/faq/pheq (in Japanese).

[13] 13.↵
Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183: 589–596. doi: 10.1001/jamainternmed.2023.1838.
OpenUrl CrossRef PubMed

[14] 14.↵
Chen AS, Revere L, Ratanatawan A, Beck CL, Allo JA. A comparative analysis of academic and nonacademic hospitals on outcome measures and patient satisfaction. Am J Med Qual. 2019;34: 367–375. doi: 10.1177/1062860618800586.
OpenUrl CrossRef

[15] 15.↵
West RM. Best practice in statistics: Use the Welch t-test when testing the difference between two groups. Ann Clin Biochem. 2021;58: 267–269. doi: 10.1177/0004563221992088.
OpenUrl CrossRef PubMed

[16] 16.↵
MacFarland TW,
Yates JM
MacFarland TW, Yates JM. Mann–Whitney U test. In: MacFarland TW, Yates JM, editors. Introduction to nonparametric statistics for the Biological Sciences using R. Cham. Springer International Publishing; 2016. pp. 103–132.

[17] MacFarland TW,

[18] Yates JM

[19] 17.↵
Kanda Y. Investigation of the freely available easy-to-use software ‘EZR’ for medical statistics. Bone Marrow Transplant. 2013;48: 452–458. doi: 10.1038/bmt.2012.244.
OpenUrl CrossRef PubMed Web of Science

[20] 18.↵
Huang X, Estau D, Liu X, Yu Y, Qin J, Li Z. Evaluating the performance of ChatGPT in clinical pharmacy: A comparative study of ChatGPT and clinical pharmacists. Br J Clin Pharmacol. 2024;90: 232–238. doi: 10.1111/bcp.15896.
OpenUrl CrossRef

[21] 19.↵
Alowais SA, Alghamdi SS, Alsuhebany N, Alqahtani T, Alshaya AI, Almohareb SN, et al. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23: 689. doi: 10.1186/s12909-023-04698-z.
OpenUrl CrossRef

[22] 20.↵
Al-Dujaili Z, Omari S, Pillai J, Al Faraj A. Assessing the accuracy and consistency of ChatGPT in clinical pharmacy management: A preliminary analysis with clinical pharmacy experts worldwide. Res Social Adm Pharm. 2023;19: 1590–1594. doi: 10.1016/j.sapharm.2023.08.012.
OpenUrl CrossRef

Effects of temperature settings on information quality of ChatGPT-3.5 responses: A prospective, single-blind, observational cohort study

Abstract

Introduction

Methods

Ethics approval

Consent to participate

Study design

Eligibility criteria of the pharmacists

Creating questions for ChatGPT

Evaluation of ChatGPT’s responses

Main analysis

Subgroup analysis

Sensitivity analysis

Statistical analysis

Results

Characteristics of pharmacists

Primary outcome

Results of the subgroup analysis

Results of the sensitivity analysis

Discussion

Summary of key findings

Strengths and weaknesses

Interpretation

Future research

Conclusions

Data Availability

Supporting information

Acknowledgments

References

Citation Manager Formats

Subject Area