Empowering Personalized Pharmacogenomics with Generative AI Solutions
=====================================================================

* Mullai Murugan
* Bo Yuan
* Eric Venner
* Christie M. Ballantyne
* Katherine M. Robinson
* James C. Coons
* Liwen Wang
* Philip E. Empey
* Richard A. Gibbs

## Abstract

**Objective** This study evaluates an AI assistant developed using OpenAI’s GPT-4 for interpreting pharmacogenomic (PGx) testing results, aiming to improve decision-making and knowledge sharing in clinical genetics, and to enhance patient care with equitable access.

**Methods** The AI assistant employs Retrieval Augmented Generation (RAG) combining retrieval and generative techniques. It employs a Knowledge Base (KB) comprising Clinical Pharmacogenetics Implementation Consortium (CPIC) data, with context-aware GPT-4 generating tailored responses to user queries from this KB, refined through prompt engineering and guardrails.

**Results** Evaluated against a specialized PGx question catalog, the AI assistant showed high efficacy in addressing user queries. Compared with OpenAI’s ChatGPT 3.5, it demonstrated better performance, especially in provider-specific queries requiring specialized data and citations. Key areas for improvement include enhancing accuracy, relevancy, and representative language in responses.

**Discussion** The integration of context-aware GPT-4 with RAG significantly enhanced the AI assistant’s utility. RAG’s ability to incorporate domain-specific CPIC data, including recent literature, proved beneficial. Challenges persist, such as the need for specialized genetic/PGx models to improve accuracy and relevancy and addressing ethical, regulatory, and safety concerns.

**Conclusion** This study underscores generative AI’s potential for transforming healthcare provider support and patient accessibility to complex pharmacogenomic information. While careful implementation of large language models like GPT-4 is necessary, it is clear that they can substantially improve understanding of pharmacogenomic data. With further development, these tools could augment healthcare expertise, provider productivity, and the delivery of equitable, patient-centered healthcare services.

## INTRODUCTION

Clinical Genetics is a burgeoning field that has expanded as a result of technical developments in genomics.[1,2] As a result, clinical genetic testing via the generation of whole genome DNA sequences (WGS), exome sequencing (ES) or targeted gene panels, is now commonplace. These DNA sequence data can provide both definitive diagnoses for specific, acute genetic disorders and additional information related to genetic disease risk and to a predicted response to therapeutics. However, the complexity of genetics and genomics in clinical testing poses challenges for healthcare providers in understanding test results, developing personalized care plans, and effectively communicating implications.[3–5] The shortage of genetic experts further adds to these challenges, underscoring the need for innovative approaches to improve access to and interpretation of genetic information.[6] This is especially important in pharmacogenomics where there is a high proportion of actionable results and broad application beyond specialty clinics.[7,8]

Generative AI (GenAI), comprising advanced language models such as OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) and other large language models (LLMs),[9,10] holds tremendous potential for advancing clinical genetic translation, benefiting both healthcare providers and patients.[11–13] This transformative technology has the capacity to facilitate complex decision making for healthcare providers, enhancing their practice, while empowering patients with comprehensible information about their genetic test results, disease risks, and personalized therapeutic approaches. Applications of LLMs are being developed in many related arenas, including processing electronic health records,[14,15] powering healthcare chat-bots[16,17] and assisting with medical education.[18,19] In such vital contexts, developing approaches for applying LLMs responsibly and appropriately is of the utmost importance.[20]

The primary objective of this study was to explore the feasibility and potential of GenAI, specifically GPT-4, in augmenting genetic counseling and personalized care by improving the accessibility and interpretation of genetic test results. We particularly focused on pharmacogenomic testing (PGx) for predicted response to drug therapies in this study, capitalizing on the availability of open source, curated, evidence-based, peer-reviewed and standardized PGx clinical practice guidelines. Using PGx as a priming example, the study also addresses the critical task of mitigating risks associated with the adoption of GenAI and evaluating the practical implementation of safeguards to ensure patient safety. A comprehensive understanding of how GenAI can enhance personalized care, reduce disparities in accessing genetic information and enhance patient outcomes in the field of clinical genetics, can pave the way for the responsible integration of this innovative technology into clinical practice, promoting equitable access to personalized care.

## METHODS

For this study, GenAI was tailored to address a specific use case in PGx testing, with a focus on genes associated with the pharmacokinetics of statins. The objective was to develop an AI assistant that could fill knowledge and decision-making gaps in personalized care for clinical genetics, leveraging the advanced context-aware capabilities of GPT-4. The Retrieval Augmented Generation (RAG) approach, combining retrieval-based and generative methods, was adopted to provide contextually relevant and accurate answers beyond the capabilities of generative systems alone.[21] The AI assistant served as a proof of concept (POC) for PGx counseling, incorporating domain-specific guidelines.

The dataset for statins included the Clinical Pharmacogenetics Implementation Consortium (CPIC) guideline, the CPIC guideline supplement, and diplotype-phenotype translation tables,[22] the Dutch Pharmacogenomics Working Group recommendations; FDA labeling for rosuvastatin, and a recent review article[23] was used as the contextual knowledge base (KB) for the AI assistant. This KB[24] was transformed into numerical representations using an embedding language model and stored in a vector database. RAG, harnessing this converted dataset, retrieved pertinent information based on user queries from the KB using Maximal Marginal Relevance (MMR) search.[25] The retrieved information, along with the user’s question and appropriate prompts, were used to generate responses with GPT-4. The dataset, technical implementation details, code, results, and related data can be found on GitHub[26] and are represented in Figure 1.

![Figure 1:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/02/27/2024.02.21.24302946/F1.medium.gif)

[Figure 1:](http://medrxiv.org/content/early/2024/02/27/2024.02.21.24302946/F1)

Figure 1: 
Relevant data corresponding to the user’s query is extracted from a dedicated knowledge base utilizing Maximal Marginal Relevance search. This information is subsequently supplied to GPT-4 as contextual data, conjoined with the user’s question and suitable prompts. GPT-4 is prompted to generate responses to the user’s inquiry based on the provided context.

Multiple strategies were employed to ensure the accuracy, relevance, language and safety of the AI assistant. A curated catalog of questions tailored to PGx testing, specifically focused on the *SLCO1B1*, *ABCG2*, and *CYP2C9* datasets and statins, was created. This catalog covered various aspects of patient care, including fundamental information, dosing guidelines, and addressing patient concerns. Utilizing this question catalog and the responses generated by GPT-4, iterative refinement and continuous evaluation were performed to fine-tune the AI assistant, particularly in the areas of prompt engineering, context management, and setting guardrails.

Prompt engineering was used to optimize the language, tone, safety, and security of the AI-generated responses. Attention to the design of prompts facilitated accuracy, personalization, and adaptability to the user’s role.

For context management, we leveraged GPT-4’s context-aware capabilities. OpenAI’s "text-embedding-ada-002" embedding model was used for similarity search of the user’s query against the KB, enabling the retrieval of appropriate context for response generation.[27] This enabled GPT-4 to generate responses that were aligned with the retrieved context. Responses were assessed for accuracy and relevancy. Additional guardrails were set by optimizing parameters such as temperature and token count. The temperature parameter was set to zero, prioritizing accuracy over novelty, ensuring that the AI-generated responses were closely aligned with the given context. Furthermore, managing the token count prevented truncation and incomplete responses, enhancing the overall reliability of the AI assistant.

To evaluate the effectiveness of these strategies and their real-world applicability, an assessment of the AI assistant’s performance was conducted. This assessment was segmented to cater to two main user groups: patients/laypersons and healthcare providers, with customized questionnaires designed to reflect the spectrum of PGx inquiries related to statin therapy from both groups. The questionnaires covered a breadth of topics such as general PGx guidance, adherence to CPIC guidelines, therapeutic implications, and the delivery of unbiased communication. To establish a baseline for evaluation, responses to these questionnaires were gathered from both the AI assistant and OpenAI’s ChatGPT 3.5, utilizing ChatGPT 3.5 as a generative model benchmark.

The evaluation was conducted by a panel of four experts, who are also co-authors (PE, CB, KR, JC), with specialized expertise in pharmacogenomics, pharmaceutical sciences, lipid metabolism, and cardiology. Utilizing a Likert scale, the panel judged responses on accuracy, relevancy, risk management, language clarity, bias neutrality, empathetic sensitivity, citation support, and hallucination limitation. The evaluation involved two distinct survey sets—one for each user group—to methodically compare the AI assistant’s responses against those from ChatGPT 3.5. The completed surveys are available as supplementary materials.

## RESULTS

### 1. Context Management

Contextual accuracy and relevance are pivotal for the AI assistant’s responses, which are significantly influenced by GPT-4’s context-awareness and its adept use of relevant information. For context retrieval, we utilized OpenAI’s "text-embedding-ada-002" embedding model, conducting a similarity search of the user’s query against the KB to source context for GPT-4. Given GPT-4’s reliance on precise context for accurate responses, the integrity of this input was paramount. A significant challenge is that, while the embedding model was largely accurate and performed exceedingly well in general language searches, it was limited in recognizing PGx terminology. For example, diplotype terms like "\*1/\*1" were not recognized as distinct genetic entities, leading to inconsistent search results and occasionally unreliable contexts.

To evaluate the embedding model, we compared its performance against a well-established CPIC ground truth[22] for PGx queries, with a focus on diplotype and phenotype recognition. This evaluation aimed to ascertain the model’s capability to accurately identify and retrieve specialized PGx information. Through the analysis of similarity and MMR searches, we assessed the model’s performance by retrieving the top 5, 10, and 20 results—referred to as ’k’ values—from the KB. These varying ’k’ values allowed us to benchmark the retrieved context against the established ground truth at different levels of search depth. The results, included in the supplementary file ’Context Retrieval Recall Metrics’, disclosed challenges in recall accuracy, especially in diplotype recognition, with recall rates ranging from 0.61 to 0.72, highlighting the embedding model’s limitations in consistently interpreting complex biomedical terms.

However, the flexibility of GPT-4’s prompt settings partially mitigated these limitations, reducing the likelihood of inaccurate or irrelevant responses.

Additional information, related data, results and code for the ground truth evaluation is available in GitHub.[28]

### 2. Impact of Prompt Engineering

To establish a baseline for performance and to assess the need for prompt engineering to ensure the accuracy, safety, and comprehensibility of the AI assistant’s responses, we first performed an initial assessment on the responses generated by context-aware GPT-4 to inquiries from healthcare providers and patients/laypersons, devoid of any additional prompts. While the model’s responses aligned well with the provided context and were accurate, there were notable deficiencies, as illustrated in the exchanges shown in Figure 2. Specifically, the responses lacked essential guardrails indicating that they were generated by an AI assistant and that they should not be directly interpreted as constituting medical advice. The inclusion of dosing guidelines in the patient’s response raised concerns about the potential for harm. Moreover, the responses did not account for the user’s role, lacked simplicity and clarity of language, neglected other relevant patient factors, and lacked reference sources for information verification. These deficiencies highlighted the need for additional methodological enhancements, to improve the safety, comprehensibility, and accuracy of the AI assistant’s responses. To bridge these gaps and improve response comprehensiveness and safety, we introduced prompts that encompassed the following key aspects:[29] 

1.  *Role and instructions for the AI assistant*: OpenAI’s system and user roles were utilized to define behavior and boundaries, with instructions tailored to the user’s role.

2.  *Context-based responses*: Emphasis was placed on using the provided context or reference text to ensure accurate and relevant responses.

3.  *Citing sources*: Relevant citations were included to promote transparency and enable users to validate the information provided by the AI assistant.

4.  *Safety measures*: Guardrails were implemented to limit hallucination and reduce risk. Other factors that could impact care were also taken into consideration.

![Figure 2:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/02/27/2024.02.21.24302946/F2.medium.gif)

[Figure 2:](http://medrxiv.org/content/early/2024/02/27/2024.02.21.24302946/F2)

Figure 2: 
This figure illustrates individual queries posed to the GPT-4 AI assistant by a healthcare provider and a patient, along with its responses. The AI assistant, without role-specific prompts and guidance, shows limitations such as the absence of necessary disclaimers clarifying that the responses are not medical advice and are AI-generated. Additionally, the need for tailored language and the inclusion of potentially harmful dosing information in the patient’s response underscores the importance of context-sensitive AI communication in healthcare scenarios.

The inclusion of such tailored prompts resulted in significant improvements in the AI assistant’s responses. Notably, prompt engineering had a substantial impact on improving the responses for both provider and patient/layperson questions, as evidenced by the enhanced responses showcased in Figure 3 following the inclusion of additional prompts. The inclusion of explicit language indicating that the information provided by the AI assistant does not constitute medical advice, along with the inclusion of literature citations for healthcare providers, and the utilization of patient-friendly language, such as mapping the statin atorvastatin to its brand name Lipitor and providing clear explanations of SLCO1B1 decreased function and its effects on the patient’s prescription, exemplify the effectiveness of prompts. It should also be noted that the patient prompt instruction "You should not provide information such as prescription or dosing guidance." ensures that such information is not displayed in the patient’s response, mitigating potential harm (see Figure 3). Moreover, prompts were utilized to ensure adherence to designated roles and for safety and reliability.

![Figure 3:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/02/27/2024.02.21.24302946/F3.medium.gif)

[Figure 3:](http://medrxiv.org/content/early/2024/02/27/2024.02.21.24302946/F3)

Figure 3: 
AI assistant’s response to questions from a healthcare provider and a patient, respectively, after the inclusion of appropriate roles and instructions in the prompt.

### 2.1 Language, Sensitivity and Bias

The accessibility of the AI assistant to users from diverse backgrounds, including different age groups, educational levels, genders, races, and ethnicities, was of paramount consideration. The objective was to ensure that GPT-4’s responses, encompassing language and sentiment, exhibited attributes such as friendliness, clarity, understandability, supportiveness, and empathy, while explicitly clarifying that it does not constitute medical advice. Conducting a comprehensive language and sentiment analysis on the results was beyond the scope of this study and we primarily relied on manual assessment and iteratively modified the prompt to improve the language, sensitivity, and empathy of the generated responses. Figure 4 showcases a GPT-4 response with an updated prompt, resulting in a more tailored and empathic answer in response to Patient1’s question. It is important to note that refining the prompt involved multiple iterations to elicit the desired response. This iterative process, coupled with the collection of multiple responses from GPT-4 for the same question to facilitate comparison, proved instrumental in shaping the tone and language to align with the best match to the chosen requirements. Figure 4 further underscores the nuanced sensitivity and linguistic adaptability of the responses, showcasing the AI assistant’s capability to communicate in Spanish in accordance with Patient2’s preference. Significantly, the assistant’s recognition of the patient’s distress, translated into English here for readability as “Hello! I understand that you are going through a difficult time”, manifests sensitivity, exemplifying successful empathetic prompting. This approach ensured cultural sensitivity and impartial information, while avoiding stereotyping and medical advice, and encouraging professional consultation.

![Figure 4:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/02/27/2024.02.21.24302946/F4.medium.gif)

[Figure 4:](http://medrxiv.org/content/early/2024/02/27/2024.02.21.24302946/F4)

Figure 4: 
AI assistant’s responses to the questions posed by Patient1 and Patient2, following an updated prompt, resulting in more tailored and empathic responses.

## 3. Performance Evaluation

The AI assistant’s performance, post-enhancements, was critically analyzed against ChatGPT 3.5’s responses to the same set of PGx-related questionnaires. This comparison, carried out by the expert panel, focused on key criteria: 

*   ● Accuracy: The degree to which the responses align with CPIC guidelines, indicative of the reliability of information for PGx decision-making.

*   ● Relevancy: Tailored and contextually appropriate responses, meeting the nuanced needs of healthcare providers and patients/laypersons.

*   ● Risk Management: Effective incorporation of risk mitigation strategies, emphasizing patient safety.

*   ● Language & Bias: The clarity and neutrality of the responses, ensuring that the content was understandable and devoid of biases.

*   ● Sensitivity: Ability to engage with patient concerns in an empathetic manner, fostering a supportive interaction.

*   ● Citations and Guidelines: References to established publications, guidelines and research that support the responses.

*   ● Hallucination Mitigation: Limiting hallucinations (information that is fabricated, or unsupported by evidence) in the responses.

The results of the evaluation were processed by converting individual Likert scale responses for each expert into numerical values - 5 for ‘Strongly Agree’, 4 for ‘Agree’, 3 for ‘Neutral’, 2 for ‘Disagree and 1 for ‘Strongly Disagree’ - and calculating a median response for every question to represent the expert panel’s consensus. Median responses were then aggregated for each Likert scale category across criteria, creating a dataset that encapsulated response distribution for patient/layperson and provider groups, as represented in Figure 5 for both the AI assistant and ChatGPT 3.5. Weighted scores for each criterion were computed by multiplying the frequency of responses within each Likert category by their corresponding weights, ranging from 5 (’Strongly Agree’) to 1 (’Strongly Disagree’). The maximum attainable score was computed by multiplying the aggregate number of responses by the highest Likert value of 5. These scores were then normalized to percentages by dividing the weighted scores by the maximum possible score and multiplying by 100, yielding a percentage-based overview that summarized both overall and specific category performances.

![Figure 5:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/02/27/2024.02.21.24302946/F5.medium.gif)

[Figure 5:](http://medrxiv.org/content/early/2024/02/27/2024.02.21.24302946/F5)

Figure 5: 
This figure presents the quantitative distribution of performance by the AI assistant (top) and ChatGPT 3.5 (bottom) in answering questions from healthcare providers and patients/laypersons. Evaluation criteria encompass accuracy, relevancy, risk management, language clarity, bias neutrality, citation support, and hallucination mitigation, assessed on a Likert scale-based rubric by an expert panel.

The performance of the AI assistant was evaluated and compared with ChatGPT 3.5 using these weighted scores, as depicted in Figure 6. For provider-focused queries (n=47), the AI assistant significantly outperformed ChatGPT 3.5, achieving 85% effectiveness versus 69%. This significant difference, underscored by a Wilcoxon Signed-Rank Test p-value of 8.11×10−20 and a Cohen’s d effect size of 0.84, indicates a large effect size.[30] Notably, the AI assistant scored higher in accuracy (85% vs. 58%), citations (80% vs. 40%), and relevancy (81% vs. 62%).

![Figure 6:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2024/02/27/2024.02.21.24302946/F6.medium.gif)

[Figure 6:](http://medrxiv.org/content/early/2024/02/27/2024.02.21.24302946/F6)

Figure 6: 
Performance comparison of the AI assistant and ChatGPT 3.5 on key criteria for healthcare provider (top) and patient/layperson (bottom) questions. Criteria include accuracy, relevancy, risk management, language clarity, bias neutrality, citation support, and hallucination mitigation. Percentages reflect performance levels, with higher values indicating superior performance. The AI assistant demonstrates enhanced performance relative to ChatGPT 3.5 across both query types, with a particularly marked improvement in provider-specific questions.

For patient/layperson queries (n=33), the AI assistant’s performance was marginally better at 82% compared to ChatGPT 3.5’s 78%, with a smaller yet significant statistical difference (Wilcoxon Signed-Rank Test p-value of 0.000643; Cohen’s d effect size: 0.26). The AI assistant showed a slight improvement in accuracy and relevancy, but both systems performed similarly in patient communications.

Overall weighted scores for the AI assistant were 85% for providers and 82% for patients/laypersons, revealing potential areas for enhancement in accuracy, relevancy and inclusion of citations. Strengths were noted in risk assessment, language and a low incidence of hallucinations, indicating the AI assistant’s reliability in clinical communication.

Related code, input/output files, results, and visualizations, including data for Figures 5 and 6 and statistical calculations are available on GitHub.[31]

It should be noted that although GPT-4 inherently operates in a deterministic manner, the platforms facilitating GPT-4 may introduce variability. Therefore, responses used in this study might vary in subsequent queries. We also note that all data employed for the purposes of this research are synthetic; no real-time patient data were utilized.

## DISCUSSION

This study aimed to assess the potential of GenAI, specifically GPT-4, in enhancing access to and interpretation of genetic test results. We employed innovative GenAI approaches, including the integration of context-aware GPT-4 using the RAG approach, prompt engineering, and the implementation of guardrails.

The RAG approach, blending retrieval-based and generative methods, was a significant innovation that greatly enhanced the performance of the AI assistant. This method allowed the AI assistant to utilize specialized knowledge bases, such as CPIC guidelines, and to access current publications beyond the confines of GPT-4’s initial training dataset, thereby ensuring the delivery of more accurate and contextually relevant answers. In comparison, ChatGPT 3.5, primarily a generative model, lacks the capability to integrate updates or external databases after its initial training, highlighting the added value of RAG in delivering tailored and current responses.

Prompt engineering was another key innovation that greatly contributed to the effectiveness of the AI assistant. By tailoring information delivery based on user roles, such as providing detailed dosing guidelines for healthcare providers and information tailored to the understanding and needs of patients, the AI assistant facilitated more accurate, personalized, and effective interactions. Prompt engineering emphasized the importance of patient safety and the involvement of human expertise in clinical decision- making. The incorporation of guardrails further enhanced the language, tone, and safety of the AI assistant’s responses, ensuring a higher level of reliability.

The integration of these innovative approaches collectively contributed to significant improvements in the effectiveness of the AI assistant. As evidenced in Figure 6, expert evaluations showed that the AI assistant outperformed ChatGPT 3.5, particularly for healthcare provider queries, achieving an 85% overall effectiveness rating—substantially higher than ChatGPT 3.5’s 69%. Notably, there was also a reduction in hallucinations—a common challenge with AI responses—demonstrating the AI assistant’s reliability in delivering accurate information. This is attributed to RAG’s ability to draw upon specialized, up-to-date knowledge bases, yielding responses with greater accuracy, relevance, and well-supported citations. Such materials, often not included in the pre-trained data of language models such as GPT-4 or GPT-3.5, contributed to the enhanced accuracy and relevancy of the responses.

For patient/layperson queries, though exhibiting a statistically significant difference (p-value: 0.000643) the AI assistant’s performance closely paralleled that of ChatGPT 3.5, showing only marginal gains across all evaluation criteria. This outcome of near parity suggests inherent challenges in addressing a broad spectrum of general patient inquiries, particularly in the context of limited domain-specific knowledge within the KB. However, achieving outcomes comparable to ChatGPT 3.5—a chatbot developed from the GPT-3 model family, which is specifically trained and fine-tuned for conversational contexts—in areas like language clarity, risk management, and the reduction of hallucinations, underscores the AI assistant’s capability to effectively adapt to healthcare communication needs, despite the constraints posed by the existing KB.

The contrast in performance between provider-focused and patient-oriented queries further illustrates the importance of domain-specific information. Provider queries benefit from the AI assistant’s access to detailed responses supported by CPIC guidelines, enhancing its accuracy and relevancy. In contrast, the broader nature of patient queries, often lacking detailed information in the KB, leads both systems to rely on their general training data, sometimes resulting in inaccuracies or hallucinations. For instance, the expert panel noted discrepancies like the *SLCO1B1* being incorrectly identified as a metabolism gene, and not as a transporter gene – an error that could be mitigated by enriching the knowledge base with more comprehensive publications on PGx testing and gene function data.

Expert feedback emphasized the need to enhance the AI assistant’s medical terminology to be more patient/layperson-friendly. Terms like ’liver toxicity’, ’drug exposure’, and ’genotypes’ among others, were not sufficiently accessible to patients/laypersons, underscoring the importance of fine-tuning the model to better suit typical inquiries and responses. Furthermore, the AI assistant’s reading level for patient/layperson queries, documented at a Flesch-Kincaid grade of 8.5 (see GitHub for data and results),[32] approaches but does not meet the American Medical Association’s (AMA) recommended 6th to 7th-grade reading level.[33] While this represents an improvement over ChatGPT’s college-level reading grade of 13.5 for similar queries, it highlights an opportunity for further language optimization to enhance comprehension and accessibility for patients.

The evaluation also underscored the need to improve accuracy and relevance, with the AI assistant scoring in the 70s and 80s percentage range. Challenges including gaps in context retrieval and the GPT-4 model’s inherent limitations regarding specialized biomedical data highlight the importance of developing specialized biomedical language models, fine-tuned with relevant data to bolster contextual understanding and response precision.[34–37] Other limitations relate to the precise safety guardrails that are appropriate for AI tools in general. While efforts were made to implement safety guardrails for AI responses, defining and enforcing these boundaries remains complex and proper constraint outside of drug dose recommendations can be much more challenging.[38–40]

Ethical considerations and regulatory frameworks are additional, well recognized challenges for AI deployment in health care, that need to be addressed.[41–45] Here, we applied methods to reduce the propensity for language biases, inaccuracies, and potential for hallucinations; however, they will nevertheless occur at some frequency. When combined with privacy considerations that arise when data are shared in non-restricted environments in order to enable the language models to function, there are clear needs to develop additional approaches to protect patient rights and data security, and maintaining the overall safety and effectiveness of AI applications in healthcare.[46–49]

Incorporating these insights, the results of our study highlight the significant potential of the AI assistant in genetic counseling and personalized care, enhancing information accessibility for both healthcare providers and patients/laypersons. Despite the need for improvement, these findings support the AI assistant’s role in enriching patient care through advanced technology.

## CONCLUSION

This study underscores the immense potential of GenAI, particularly GPT-4, for augmenting genetic counseling and personalized care. It also highlights the challenges of improving language models and their practical performance by modulating methods and setting boundaries, in order that providers and patients are served with relevant and accurate information that is both palatable and does not overstep any ethical or regulatory boundaries.[50] Overall, it shows that these technologies can provide valuable support by addressing the challenges encountered by healthcare providers and improving accessibility for patients. While GenAI technologies are not currently ready for widespread clinical deployment, with additional development they can serve as invaluable tools that complement and enhance human expertise in delivering high-quality, equitable, and patient-centric healthcare services.

## Supporting information

Completed Expert Panel Surveys [[supplements/302946_file09.zip]](pending:yes)

Context Retrieval Recall Metrics [[supplements/302946_file10.pdf]](pending:yes)

## Data Availability

The data supporting the findings of this article can be accessed within the article, through the referenced GitHub links, and in the supplementary materials.

[https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant](https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant) 

*   Received February 21, 2024.
*   Revision received February 21, 2024.
*   Accepted February 27, 2024.


*   © 2024, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## References

1.  Manolio TA, Chisholm RL, Ozenberger B, et al. Implementing genomic medicine in the clinic: the future is here. Genet Med 2013;15:258–67. doi:10.1038/gim.2012.157
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/gim.2012.157&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23306799&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.21.24302946.atom) 

2.  Manolio TA, Narula J, Bult CJ, et al. Genomic Medicine Year in Review: 2022. Am J Hum Genet 2022;109:2101–4. doi:10.1016/j.ajhg.2022.11.003
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2022.11.003&link_type=DOI) 

3.  Donohue KE, Gooch C, Katz A, et al. Pitfalls and challenges in genetic test interpretation: An exploration of genetic professionals experience with interpretation of results. Clin Genet 2021;99:638–49. doi:10.1111/cge.13917
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/cge.13917&link_type=DOI) 

4.  Berrios C, Hurley EA, Willig L, et al. Challenges in genetic testing: clinician variant interpretation processes and the impact on clinical care. Genet Med 2021;23:2289– 99. doi:10.1038/s41436-021-01267-x
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41436-021-01267-x&link_type=DOI) 

5.  Farmer MB, Bonadies DC, Pederson HJ, et al. Challenges and Errors in Genetic Testing: The Fifth Case Series. Cancer J 2021;27:417–22. doi:10.1097/PPO.0000000000000553
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/PPO.0000000000000553&link_type=DOI) 

6.  Amendola LM, Golden-Grant K, Scollon S. Scaling Genetic Counseling in the Genomics Era. Annu Rev Genomics Hum Genet 2021;22:339–55. doi:10.1146/annurev-genom-110320-121752
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1146/annurev-genom-110320-121752&link_type=DOI) 

7.  Hicks JK, El Rouby N, Ong HH, et al. Opportunity for Genotype-Guided Prescribing Among Adult Patients in 11 US Health Systems. Clin Pharmacol Ther 2021;110:179–88. doi:10.1002/cpt.2161
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/cpt.2161&link_type=DOI) 

8.  Verma SS, Keat K, Li B, et al. Evaluating the frequency and the impact of pharmacogenetic alleles in an ancestrally diverse Biobank population. J Transl Med 2022;20:550. doi:10.1186/s12967-022-03745-5
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12967-022-03745-5&link_type=DOI) 

9.  OpenAI. GPT-4 Technical Report. arXiv [cs.CL]. 2023.[http://arxiv.org/abs/2303.08774](http://arxiv.org/abs/2303.08774)
    
    
10. Zhao WX, Zhou K, Li J, et al. A Survey of Large Language Models. arXiv [cs.CL]. 2023.[http://arxiv.org/abs/2303.18223v11](http://arxiv.org/abs/2303.18223v11)
    
    
11. Aslam MS, Nisar S. Artificial Intelligence Applications Using ChatGPT in Education: Case Studies and Practices: Case Studies and Practices. IGI Global 2023. [https://play.google.com/store/books/details?id=4ZnUEAAAQBAJ](https://play.google.com/store/books/details?id=4ZnUEAAAQBAJ)
    
    
12. Uprety D, Zhu D, West HJ. ChatGPT-A promising generative AI tool and its implications for cancer care. Cancer 2023;129:2284–9. doi:10.1002/cncr.34827
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/cncr.34827&link_type=DOI) 

13. Wachter RM, Brynjolfsson E. Will Generative Artificial Intelligence Deliver on Its Promise in Health Care? JAMA 2024;331:65–9. doi:10.1001/jama.2023.25054
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2023.25054&link_type=DOI) 

14. Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med 2022;5:194. doi:10.1038/s41746-022-00742-2
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41746-022-00742-2&link_type=DOI) 

15. Jiang LY, Liu XC, Nejatian NP, et al. Health system-scale language models are all-purpose prediction engines. Nature Published Online First: 7 June 2023. doi:10.1038/s41586-023-06160-y
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-023-06160-y&link_type=DOI) 

16. Sezgin E, Sirrianni J,  Linwood SL. Operationalizing and  Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3 (GPT-3) as a Service Model. JMIR Med Inform 2022;10:e32875. doi:10.2196/32875
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/32875&link_type=DOI) 

17. Lee P, Bubeck S,  Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med 2023;388:1233–9. doi:10.1056/NEJMsr2214184
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMsr2214184&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36988602&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.21.24302946.atom) 

18. Lee H. The rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ Published Online First: 14 March 2023. doi:10.1002/ase.2270
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/ase.2270&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=. PMID: 3691&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.21.24302946.atom) 

19. Khan RA, Jawaid M, Khan AR, et al. ChatGPT - Reshaping medical education and clinical management. Pak J Med Sci Q 2023;39:605–7. doi:10.12669/pjms.39.2.7653
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.12669/pjms.39.2.7653&link_type=DOI) 

20. Harrer S. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. EBioMedicine 2023;90:104512. doi:10.1016/j.ebiom.2023.104512
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ebiom.2023.104512&link_type=DOI) 

21. Lewis P, Perez E, Piktus A, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv [cs.CL]. 2020.[http://arxiv.org/abs/2005.11401](http://arxiv.org/abs/2005.11401)
    
    
22. CPIC® guideline for statins and SLCO1B1, ABCG2, and CYP2C9. [https://cpicpgx.org/guidelines/cpic-guideline-for-statins/](https://cpicpgx.org/guidelines/cpic-guideline-for-statins/) (accessed 4 Jul 2023).
    
    
23. Lamoureux F, Duflot T, French Network of Pharmacogenetics (RNPGX). Pharmacogenetics in cardiovascular diseases: State of the art and implementation-recommendations of the French National Network of Pharmacogenetics (RNPGx). Therapie 2017;72:257–67. doi:10.1016/j.therap.2016.09.017
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.therap.2016.09.017&link_type=DOI) 

24. PGx Statins KB. [https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant/tree/main/data/slco1b1](https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant/tree/main/data/slco1b1) (accessed 9 Dec 2023).
    
    
25. Carbonell J, Goldstein J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. Published Online First: 27 June 1999. doi:10.1145/290941.291025
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1145/290941.291025&link_type=DOI) 

26. GitHub - PGx AI Assistant. [https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant](https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant)
    
    
27. Neelakantan A, Xu T, Puri R, et al. Text and Code Embeddings by Contrastive Pre- Training. arXiv [cs.CL]. 2022.[http://arxiv.org/abs/2201.10005](http://arxiv.org/abs/2201.10005)
    
    
28. GitHub OpenAI Ada Embedding ground truth evaluation. [https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant/tree/main/groundtruth-eval/openai](https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant/tree/main/groundtruth-eval/openai)
    
    
29. OpenAI platform. [https://platform.openai.com/docs/guides/gpt-best-practices](https://platform.openai.com/docs/guides/gpt-best-practices) (accessed 5 Jul 2023).
    
    
30. Sullivan GM, Feinn R. Using Effect Size-or Why the P Value Is Not Enough. J Grad Med Educ 2012;4:279–82. doi:10.4300/JGME-D-12-00156.1
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.4300/JGME-D-12-00156.1&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23997866&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F02%2F27%2F2024.02.21.24302946.atom) 

31. PGx AI and ChatGPT 3.5 Survey Results Analysis and Visualization. Github [https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant/tree/main/pgxai\_chatgpt\_results\_evaluation](https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant/tree/main/pgxai_chatgpt_results_evaluation) (accessed 1 Feb 2024).
    
    
32. PGx AI assistant reading level results. [https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant/tree/main/gpt4-eval/patient\_reading\_level\_assessment](https://github.com/BCM-HGSC/PGx4Statins-AI-Assistant/tree/main/gpt4-eval/patient_reading_level_assessment)
    
    
33. AMA Health Literacy. [http://www.hhvna.com/files/Courses/HealthLiteracy/Health\_Literacy\_Manual\_AMA\_Revised.pdf](http://www.hhvna.com/files/Courses/HealthLiteracy/Health\_Literacy_Manual_AMA_Revised.pdf)
    
    
34. Jin Q, Yang Y, Chen Q, et al. GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information. ArXiv Published Online First: 16 May 2023.[https://www.ncbi.nlm.nih.gov/pubmed/37131884](https://www.ncbi.nlm.nih.gov/pubmed/37131884)
    
    
35. Mahbub M, Srinivasan S, Begoli E, et al. BioADAPT-MRC: adversarial learning-based domain adaptation improves biomedical machine reading comprehension task. Bioinformatics 2022;38:4369–79. doi:10.1093/bioinformatics/btac508
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btac508&link_type=DOI) 

36. Lai TM, Zhai C, Ji H. KEBLM: Knowledge-Enhanced Biomedical Language Models. J Biomed Inform 2023;143:104392. doi:10.1016/j.jbi.2023.104392
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jbi.2023.104392&link_type=DOI) 

37. Peng K, Yin C, Rong W, et al. Named Entity Aware Transfer Learning for Biomedical Factoid Question Answering. IEEE/ACM Trans Comput Biol Bioinform 2022;19:2365–76. doi:10.1109/TCBB.2021.3079339
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/TCBB.2021.3079339&link_type=DOI) 

38. Johnson KB, Wei W-Q, Weeraratne D, et al. Precision Medicine, AI, and the Future of Personalized Health Care. Clin Transl Sci 2021;14:86–93. doi:10.1111/cts.12884
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/cts.12884&link_type=DOI) 

39. Zhang Z, Wei X. Artificial intelligence-assisted selection and efficacy prediction of antineoplastic strategies for precision cancer therapy. Semin Cancer Biol 2023;90:57–72. doi:10.1016/j.semcancer.2023.02.005
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.semcancer.2023.02.005&link_type=DOI) 

40. Guo J, Hu J, Zheng Y, et al. Artificial intelligence: opportunities and challenges in the clinical applications of triple-negative breast cancer. Br J Cancer 2023;128:2141–9. doi:10.1038/s41416-023-02215-z
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41416-023-02215-z&link_type=DOI) 

41. Murdoch B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics 2021;22:122. doi:10.1186/s12910-021-00687-3
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12910-021-00687-3&link_type=DOI) 

42. Haupt CE, Marks M. AI-Generated Medical Advice-GPT and Beyond. JAMA 2023;329:1349–50. doi:10.1001/jama.2023.5321
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2023.5321&link_type=DOI) 

43. Pujari S, Reis A, Zhao Y, et al. Artificial intelligence for global health: cautious optimism with safeguards. Bull World Health Organ 2023;101:364–364A. doi:10.2471/BLT.23.290215
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2471/BLT.23.290215&link_type=DOI) 

44. Gerke S, Minssen T, Cohen G. Chapter 12 - Ethical and legal challenges of artificial intelligence-driven healthcare. In: Bohr A, Memarzadeh K, eds. Artificial Intelligence in Healthcare. Academic Press 2020. 295–336. doi:10.1016/B978-0-12-818438-7.00012-5
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/B978-0-12-818438-7.00012-5&link_type=DOI) 

45.  Redrup Hill E, Mitchell C, Brigden T, et al. Ethical and legal considerations influencing human involvement in the implementation of artificial intelligence in a clinical pathway: A multi-stakeholder perspective. Front Digit Health 2023;5:1139210. doi:10.3389/fdgth.2023.1139210
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3389/fdgth.2023.1139210&link_type=DOI) 

46. Challen R, Denny J, Pitt M, et al. Artificial intelligence, bias and clinical safety. BMJ Qual Saf 2019;28:231–7. doi:10.1136/bmjqs-2018-008370
    
    [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6MzoicWhjIjtzOjU6InJlc2lkIjtzOjg6IjI4LzMvMjMxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMDIvMjcvMjAyNC4wMi4yMS4yNDMwMjk0Ni5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

47. Gudis DA, McCoul ED, Marino MJ, et al. Avoiding bias in artificial intelligence. Int Forum Allergy Rhinol 2023;13:193–5. doi:10.1002/alr.23129
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/alr.23129&link_type=DOI) 

48. Blumenthal-Barby J. An AI Bill of Rights: Implications for Health Care AI and Machine Learning-A Bioethics Lens. Am J Bioeth 2023;23:4–6. doi:10.1080/15265161.2022.2135875
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/15265161.2022.2135875&link_type=DOI) 

49. Ellahham S, Ellahham N, Simsekler MCE. Application of Artificial Intelligence in the Health Care Safety Context: Opportunities and Challenges. Am J Med Qual 2020;35:341–8. doi:10.1177/1062860619878515
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1177/1062860619878515&link_type=DOI) 

50. Wornow M, Xu Y, Thapa R, et al. The Shaky Foundations of Clinical Foundation Models: A Survey of Large Language Models and Foundation Models for EMRs. arXiv [cs.LG]. 2023.[http://arxiv.org/abs/2303.12961](http://arxiv.org/abs/2303.12961)