Large-language-model-based 10-year risk prediction of cardiovascular disease: insight from the UK biobank data
==============================================================================================================

* Changho Han
* Dong Won Kim
* Songsoo Kim
* Seng Chan You
* SungA Bae
* Dukyong Yoon

## Abstract

**Background** Conventional cardiovascular risk prediction models provide insights into population-level risk factors and have been widely adopted in clinical practice. However, these models have limited generalizability and flexibility. Large language models (LLMs) have demonstrated remarkable proficiency for use in various industries.

**Methods** In this study, we have investigated the feasibility of Large Language Models (LLMs) such as ChatGPT-3.5, ChatGPT-4, and Bard for predicting 10-year cardiovascular risk of a patient. We used data from the UK Biobank Cohort, a major biomedical database in the UK, and the Korean Genome and Epidemiology Study (KoGES), a large-scale prospective study in Korea, for additional validation and multi-institutional research. These databases provided a wide array of information including age, sex, medical history, lipid profile, blood pressure, and physical measurement. Based on these data, we generated language sentences for individual analysis and input these into the LLM to derive results. The performance of the LLMs was then compared with the Framingham Risk Score (FRS), a conventional risk prediction model, using this real-world data. We confirmed the model performance of both the LLMs and FRS, evaluating their accuracy, sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), and F1 score. Their performance in predicting 10-year cardiovascular risk was compared through Kaplan-Meier survival analysis and Cox-hazard ratio analysis.

**Findings** GPT-4 achieved performance comparable to the FRS in cardiovascular risk prediction in both the UK Biobank {accuracy (0·834 vs· 0·773) and F1 score (0·138 vs· 0·132)} and KoGES {accuracy (0·902 vs· 0·874)}. The Kaplan–Meier survival analysis of GPT-4 demonstrated distinct survival patterns among groups, which revealed a strong association between the GPT risk prediction output and survival outcomes. The additional analysis of limited variables using GPT-3·5 indicated that ChatGPT’s prediction performance was preserved despite the omission of a few variables in the prompt, especially without physical measurement data

**Interpretation** This study proposed that ChatGPT can achieve performance comparable to conventional models in predicting cardiovascular risk. Furthermore, ChatGPT exhibits enhanced accessibility, flexibility, and the ability to provide user-friendly outputs. With the evolution of LLMs, such as ChatGPT, studies should focus on applying LLMs to various medical scenarios and subsequently optimizing their performance.

## Introduction

Cardiovascular disease (CVD) is the leading cause of morbidity and mortality worldwide, accounting for a considerable proportion of healthcare costs and posing a substantial public health risk1. The accurate and timely prediction of an individual’s risk of developing CVD can facilitate early intervention and prevention strategies, which reduces the incidence and impact of CVD2. Although conventional CVD risk prediction models, such as the Framingham risk score (FRS)3, American College of Cardiology/American Heart Association (ACC/AHA) Pooled Cohort Equations4, and the QRISK3 score5 provide insights into population-level risk factors and have been widely adopted in clinical practice, these models have several limitations. First, these models have limited generalizability to diverse populations with varying demographic, clinical, and genetic characteristics6. Second, conventional models may not incorporate novel risk factors or consider complex interactions between risk factors, leading to potential underestimation or risk overestimation7. Third, the implementation of conventional risk models can be challenging because of complex calculations and variable requirements, which hinders their use in clinical settings8. Finally, these risk models do not have adequate personalization and rely on population-level data, which may not accurately capture individual-level variations in risk factors. This phenomenon limits their ability to provide tailored risk assessments for patients9.

Advances in artificial intelligence (AI) can overcome these limitations10. Large language models (LLMs), particularly the generative pretrained transformer 4 (GPT-4) model developed by OpenAI, exhibit remarkable proficiency in producing human-like languages, and have potential for application in other industries11, 12. However, despite their growing popularity, reliability concerns severely affect the use of LLMs in the medical field, which requires precise and accurate information. A potential problem with LLMs is “AI hallucination.” This phenomenon occurs when AI confidently generates an impressive-sounding response that may not be justified by its training data or may even be factually incorrect13. The presence of AI hallucinations raises reliability and accuracy concerns on information produced by these models, particularly in domains such as medicine, which requires precise and trustworthy information. Although such problems have been reduced in GPT412, 14, only a few studies have quantified or analyzed this topic.

Although the use of language models in the medical field has attracted considerable attention, limited quantitative evaluation of their performance and accuracy has been conducted in specific medical tasks15. Therefore, we evaluated LLMs in predicting CVD risk for 10 years and compared their performance with that of conventional risk prediction models using UK Biobank and Korean Genome and Epidemiology Study (KoGES) data16, 17.

## Methods

### Data source and outcome assessment

We used data from the UK Biobank cohort, a large-scale biomedical database of UK general population. Established in 2006, the UK Biobank cohort is one of major international health resources that has collected extensive data and biological samples from approximately 500,000 participants aged between 40 and 69 years at the time of assessment. We used the UK Biobank database to extract data pertaining to age, sex, diabetes diagnosed by a doctor, blood pressure medication, smoking status, total cholesterol, high density lipoprotein (HDL) cholesterol, low density lipoprotein (LDL) direct, triglycerides, systolic blood pressure, diastolic blood pressure, standing height, weight, date of attending the assessment center, and date of death (Supplementary Table 1). We excluded cases with any missing values among these variables, except death date and ethnicity. After exclusion, 10,000 individuals were selected through random sampling. Additionally, individuals who previously experienced major cardiovascular adverse events (MACE) were excluded from the study (Fig. 1).

![Fig. 1 |](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/05/24/2023.05.22.23289842/F1.medium.gif)

[Fig. 1 |](http://medrxiv.org/content/early/2023/05/24/2023.05.22.23289842/F1)

Fig. 1 | Selection of study population.
MACE: major adverse cardiovascular events.

The FRS was originally developed for predicting coronary heart disease (CHD) but has since been evolved for use in forecasting not only CHD but also cerebrovascular disease, peripheral artery disease (PAD), and heart failure (HF)3. Using this approach, we assessed patient outcomes using MACE, which represents the most fatal and predominant occurrence of CVD. MACE is defined as follows: it was defined by the earliest recorded event of myocardial infarction (International Classification of Diseases [ICD]9 codes 410, 411·0, 412, 429·79, or ICD10 codes I21, I22, I23, I24.1, I25.2 or UK Biobank Self Report field 20002 codes 1075) or ischemic stroke (ICD9 codes 434, 436 or ICD10 codes I63, I64 or UK Biobank Self Report field 20002 codes 1583)18. The outcome was obtained through a category called an algorithmically defined outcome, which included information on the likely instances of specific health issues, derived from the algorithmic integration of coded data from the UK Biobank’s initial assessment data compilation (incorporating data from participants regarding their self-reported medical conditions, surgeries, and medications), in conjunction with associated data from hospital admissions (diagnoses and procedures) and death records.

In addition to the UK Biobank dataset, we used KoGES data as an additional validation cohort. The KoGES is a large-scale prospective study designed to investigate the genetic and environmental factors contributing to chronic diseases in the Korean population16. We used baseline data from the KoGES cohort collected between 2001 and 2002 to extract variables analogous to those used in the UK Biobank. These variables included age, sex, diagnosis of diabetes by a physician, blood pressure medication use, smoking status, total cholesterol, HDL, triglycerides, systolic blood pressure, diastolic blood pressure, height, and weight. LDL levels were calculated based on total cholesterol, HDL, and triglyceride levels (Supplementary Table 2). In this cohort, we defined patients as those who experienced a disease event (myocardial infarction or cerebrovascular disease) at least once during the 10-year follow up, resulting in 176 patients. Detailed inclusion and exclusion criteria on KoGES population selection are described in Supplementary figure 1.

### Cardiovascular risk calculation—the conventional score (Framingham Risk Score, FRS)

The FRS is a widely recognized and well-established algorithm that is used to estimate an individual’s 10-year risk of developing CVD.3 This score considered various factors, including age, sex, blood pressure, cholesterol levels, smoking status, and diabetes. The details are listed in Supplementary Table 1 and Supplementary Table 2. In this study, we used the FRS to calculate the cardiovascular risk percentage for each individual. Based on these percentages, we classified their risks into distinct categories (low, moderate, or high) to facilitate a comprehensive understanding of their potential for developing CVD.

### Cardiovascular risk calculation—LLMs (GPT-3·5/GPT-4/bard)

To predict the incidence of CVDs using LLMs (ChatGPT and Bard), we reformatted the variables into a sentence structure, as detailed in Fig. 2, prior to feeding them into LLMs. The decision to use this conversion was based on the inherent language model nature of LLMs. This approach allowed us to specify our output to represent individual risk percentages rather than extensive text narratives. Furthermore, the approach enabled the systematic classification of risk into low, moderate, and high categories within the provided output. Information on each participant (age, sex, diabetes, hypertension, smoking status, total cholesterol, LDL cholesterol, HDL cholesterol, triglycerides, systolic blood pressure, diastolic blood pressure, and body mass index [BMI, calculated from height and weight]) was provided to the LLMs, and the 10-year CVD risk percentage was extracted using regular expressions from the corresponding answers. Based on the 10-year CVD risk percentage, 10% or less was classified as low risk, 10%–20% as moderate risk, and > 20% as high risk.

![Fig. 2 |](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/05/24/2023.05.22.23289842/F2.medium.gif)

[Fig. 2 |](http://medrxiv.org/content/early/2023/05/24/2023.05.22.23289842/F2)

Fig. 2 | Example of a ChatGPT prompt and response for risk stratification.
Tabular data extracted from the UK biobank and KoGES were organized and queried into a sentence format like the example above. The 10-year CVD risk percentage was extracted using regular expressions from the corresponding answers.

For GPT-3·5, we used the OpenAI ChatGPT API (GPT-3.5-turbo, March 23 version) in a Python environment to streamline the extraction of results. However, in the cases of GPT-4 and Bard, the lack of open-source APIs comparable to GPT-3·5 necessitated an alternative approach. GPT-4 and Bard enable accounts operating in online environments to iteratively input data and generate output text for each new chat instance.

### Model comparison between scoring systems

In the proposed methodology, the statistical significance of ChatGPT-3·5, ChatGPT-4, Bard, and FRS was investigated by calculating Pearson correlation coefficients. This numerical measure was used to assess the linearity between the output of these models and the observed data, providing an indication of both the strength and direction of these relationships.

### Performance evaluation

To assess the performance of each model, we calculated sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score. Sensitivity details the ability of the model to correctly identify true-positive cases, whereas specificity evaluates the accuracy of the model in identifying true-negative cases. PPV represents the proportion of true-positive cases among predicted positives, whereas NPV denotes the proportion of true-negative cases among predicted negatives. The F1 score is the harmonic mean of sensitivity and PPV, which provides a single metric for model performance, particularly in situations with imbalanced class distributions.

### LLM model performance using limited information

To evaluate the robustness of the LLMs in an environment where all input data cannot be investigated, we conducted additional experiments by constructing prompts using limited information and then querying GPT-3·5 with the UK Biobank cohort. This evaluation involved the use of an-omitting section, in which particular categories of patient data were excluded from the existing prompt. In the first experiment, data related to patient history (history of diabetes, blood pressure medication, and smoking) were excluded. In the second experiment, the data related to lipid profiles (total cholesterol, HDL cholesterol, LDL cholesterol, and triglycerides) were excluded. In the third experiment, data related to physical measurements (blood pressure and BMI) were excluded. These defined groupings are the foundation for analyzing the model performance under various conditions.

### Statistical analysis

To assess the statistical significance of differences in baseline characteristics among the risk groups, first, we performed a normality test using the Shapiro–Wilk method. After determining that our data did not satisfy the conditions of normality, we proceeded with nonparametric tests. We used the chi-squared test and Kruskal–Wallis test. By incorporating these tests into our analysis, we evaluated the statistical significance of the differences in baseline characteristics among the risk groups. Here, P-value < 0·05 was considered as significant in all tests.

The Kaplan–Meier method was applied to plot survival curves for the low-, moderate-, and high-risk groups based on 10-year mortality data from the UK Biobank. Furthermore, we used the Cox proportional hazards model to compare the survival function associated with MACE within each risk group in both the LLMs and FRS. The predictors in the model were the risk categories (low, moderate, and high) derived from the scoring systems, with the low-risk group used as a reference.

## Results

The UK Biobank study included 502,396 participants aged 40–69 years at the time of assessment, recruited between 2006 and 2010 (Fig. 1). A total of 103,817 participants with missing data were excluded. Of the remaining participants, after randomly selecting 10,000 participants, 274 patients who had previously experienced MACE (Major Adverse Cardiovascular events) were excluded, leaving 9726 subjects for the analysis.

Table 1 shows the baseline characteristics of the participants and the cardiovascular risk scores derived from the LLMs when grouped by GPT-4 category. Among a total of 9726 individuals for analysis, the participants had an overall median age of 58 years (IQR 50–63) with 4359 (44·8%) men and 331 (3·4%) experienced MACE within 10 years. When grouped by the GPT-4 category, 4222 individuals were classified as low-risk, 3957 as moderate risk, and 1547 as high risk. The higher-risk group had older individuals, a higher proportion of men, higher incidence of 10-year MACE, more diabetes mellitus, received more antihypertensive treatment, smoked more, more unfavorable lipid profiles, and higher blood pressure and BMI (all p < 0·001).

View this table:
[Table 1 |](http://medrxiv.org/content/early/2023/05/24/2023.05.22.23289842/T1)

Table 1 | 
Baseline characteristics (grouped by ChatGPT-4.0 category) and derived cardiovascular risk scores from large language models (LLMs)

Table 2 presents the performance comparison of the scores (derived from the GPT-4, GPT-3·5, Bard, and Framingham risk scores) in predicting 10-year MACE. The 10-year MACE performance prediction for people classified as high risk in each scoring system were detailed. In the UK Biobank cohort, theGPT-4 score had the highest accuracy of 0·834, specificity of 0·849, PPV of 0·084, and F1 score of 0·849, 0·084, and 0·138, respectively. The GPT-3·5 score had the highest sensitivity (0·598) and NPV (0·980). Overall, the performance of the GPT-4 score was comparable to that of the Framingham risk score, whereas the Bard score exhibited the worst performance. In the KoGES cohort, the GPT-4 score had the highest accuracy of 0·902 and specificity of 0·926. Table 3 details the clinical examples of the participants’ data, Framingham risk scores calculated using the data, and risk scores derived from LLMs. Table 4 shows the LLM model performance using limited information. GPT-3·5’s prediction performance was preserved despite omitting a few variables in the prompt, particularly without physical measurement data.

View this table:
[Table 2 |](http://medrxiv.org/content/early/2023/05/24/2023.05.22.23289842/T2)

Table 2 | 
Performance comparison of Framingham, Bard, and ChatGPT Risk Score

View this table:
[Table 3 |](http://medrxiv.org/content/early/2023/05/24/2023.05.22.23289842/T3)

Table 3 | 
Examples of participants’ data and the scores derived from LLMs

View this table:
[Table 4 |](http://medrxiv.org/content/early/2023/05/24/2023.05.22.23289842/T4)

Table 4 | 
LLM model performance using limited information

Fig. 3 displays scatterplots and Pearson correlation coefficients (Pearson’s r) for the different scoring systems. GPT-4 had the highest correlation with the Framingham risk score (Pearson’s r = 0·753), followed by GPT-3·5 (Pearson’s r = 0·709), and Bard (Pearson’s r = 0·446). Pearson’s r between GPT-4 and GPT-3·5 was 0·626.

![Fig. 3 |](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/05/24/2023.05.22.23289842/F3.medium.gif)

[Fig. 3 |](http://medrxiv.org/content/early/2023/05/24/2023.05.22.23289842/F3)

Fig. 3 | Scatterplots and Pearson correlation coefficient for various scoring methods.
GPT-4 and Framingham risk score exhibit satisfactory correlation between each other. All pairs show statistically significant correlation.

Fig. 4 displays the Kaplan–Meier curves stratified by risk using the different scoring systems. All pairwise comparisons (with or without Bonferroni correction) between curves using the log-rank test were statistically significant. Fig. 5 and Supplementary Table 3 show the hazard ratios (HRs) for 10-year MACE of the moderate- and high-risk groups compared to the low-risk group in each scoring system using the Cox proportional hazards model. The HRs of GPT-4 were comparable to that of the Framingham risk score (GPT-4 moderate risk HR 2·94, 95% CI 2·15-4·02, GPT-4 high risk HR 6·81, 95% CI 4·96-9·36, Framingham moderate risk HR 3·17, 95% CI 2·27-4·45, Framingham high risk HR 6·96, 95% CI 5·05-9·60). The HRs of GPT-3·5 were 2·44 (95% CI 1·70-3·50) for moderate risk and 5·05 (95% CI 3·64-7·00) for high risk, and the HRs of Bard were 1·80 (95% CI 1·32-2·47) for moderate risk and 2·84 (95% CI 2·09-3·87) for high risk.

![Fig. 4 |](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/05/24/2023.05.22.23289842/F4.medium.gif)

[Fig. 4 |](http://medrxiv.org/content/early/2023/05/24/2023.05.22.23289842/F4)

Fig. 4 | Kaplan–Meier curves stratified by cardiovascular risk on LLMs and Framingham risk scoring models.
GPT-4 demonstrated distinct survival patterns among the groups, which revealed a strong association between the GPT’s risk prediction output and survival outcomes. All pairwise comparisons between curves in other models with the log-rank test were statistically significant.

![Fig. 5 |](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/05/24/2023.05.22.23289842/F5.medium.gif)

[Fig. 5 |](http://medrxiv.org/content/early/2023/05/24/2023.05.22.23289842/F5)

Fig. 5 | Comparison of hazard ratio between different scoring systems using Cox proportional hazards model.
The hazard ratios for 10-year MACE of moderate risk and high risk were compared with the low-risk group in each scoring system using the cox proportional hazards model. The HRs of GPT-4 were comparable to that of the Framingham risk score. MACE, major adverse cardiovascular event; HR, hazard ratio; CI, confidence interval.

## Discussion

### Summary of findings

This study compared the performance of LLMs in cardiovascular risk prediction with that of the Framingham risk model and validated the output using real-world data. The findings of this study are as follows. GPT-4 achieved performance comparable to the FRS in cardiovascular risk prediction in both the UK Biobank {accuracy (0·834 vs· 0·773) and F1 score (0·138 vs· 0·132)} and KoGES {accuracy (0·902 vs· 0·874)}. The Kaplan–Meier survival analysis of GPT-4 demonstrated distinct survival patterns among groups, which revealed a strong association between the GPT risk prediction output and survival outcomes. The additional analysis of limited variables using GPT-3·5 indicated that ChatGPT’s prediction performance was preserved despite the omission of a few variables in the prompt, especially without physical measurement data (Table 4).

### Advantages for using LLM-based cardiovascular risk prediction

Since its release, ChatGPT has attracted considerable attention worldwide because of its exceptional ability to generate plausible responses across various topics. In some cases, ChatGPT has outperformed existing prediction models, encouraging studies on the potential of ChatGPT for use in various applications11. For instance, in the financial sector, compared with conventional analysis methods, ChatGPT has demonstrated superior performance in predicting stock prices19. However, limited studies have been conducted on the use of language models in healthcare. To the best of our knowledge, this study is the first to reveal that ChatGPT exhibited performance comparable with the conventional risk score model in predicting cardiovascular risk using large real-world medical data. These findings provide insights into the potential applicability of ChatGPT in medical practice.

Despite being a language-generation model, ChatGPT exhibits performance similar to the conventional model in predicting the cardiovascular risk of patients. Conventional prediction algorithms typically rely on multivariate regression of well-established CVD risk factors3, 4, typically limiting the number of risk factors and assuming linear relationships between them with minimal or no interaction between various factors20. By contrast, ChatGPT, which derives its answers by learning from large amounts of textual datasets to generate the most probable human-like responses rather than through mathematical calculations, achieves similar values and performances as linear computation-based regression models. Furthermore, conventional models frequently rely on old cohorts and have only been validated for specific cohort groups within individual countries, which limit their applicability to a broad population21. Furthermore, widely used calculators for renowned models such as the FRS have slightly distinct formulas based on different references, which results in heterogeneous prediction methods22–24. By contrast, ChatGPT can learn from multiple guidelines and select the most suitable guideline for prediction, which enhances its generalizability across a diverse population. ChatGPT has achieved satisfactory performance in the UK Biobank, Western database, KoGES, and Asian database, which highlights the generalizability and robustness of the model in diverse populations. ChatGPT provides improved accessibility and flexibility compared with conventional prediction models. Users can easily access ChatGPT by visiting a website, without requiring any particular application. The model rapidly delivered satisfactory outcomes without strict input constraints by accepting numerous input formats. ChatGPT can understand the context and range of the input values, even when precise units are not provided25. Unlike conventional prediction models, ChatGPT provides answers with limited input data. Because ChatGPT is a model that learns existing texts and derives results from them, cardiovascular risk can be determined from existing text data that may not contain the patient’s blood test results, medical history, or physical measurements. Therefore, unlike existing models, GPT can train any combination of variables and produce outputs through various types of input combinations. Furthermore, by estimating risks without requiring physical measurements such as height, weight, or systolic blood pressure data, ChatGPT predictions could potentially decrease hospital visits, enhance convenience, and promote advancements in telemedicine.

Conventional risk-stratification guidelines tend to be complex and require precise numerical values for each risk stratification, rendering the guidelines unsuitable for use in brief outpatient settings. However, with ChatGPT, users can request outputs in a specific format, which allows faster interpretation of medical records. For instance, users can rapidly access only the necessary information by asking ChatGPT to provide a patient’s cardiovascular risk score, utilized guidelines, and corresponding risk group.

### Future impact on LLM-based research

The potential benefits of ChatGPT, as demonstrated in the present analysis, may facilitate large-scale and adaptable cardiovascular risk assessments in the general population in the near future. The findings of this study indicate that ChatGPT can compute cardiovascular risk with reasonable accuracy using only facts expressed in natural language, even in the absence of certain data. Consequently, this approach facilitates the monitoring of CVD risk in a larger population, which promotes earlier interventions and management of at-risk patients. Moreover, the features of ChatGPT observed in this study have considerable implications in both clinical practice and research. ChatGPT can enable the semantic extraction of targets of interest. The operational definitions for determining the study subjects vary across institutions and studies26. Therefore, selecting patients with consistent meanings from various institutions using conventional methods is challenging given the discrepancies in data formats and meanings. However, ChatGPT can function independently of data formats and adherence to standards. Thus, by converting existing data into text, semantically extracting the patients of interest becomes feasible. This approach facilitated the integration and analysis of disparate data from multiple institutions. For example, by leveraging the findings of this study, individuals with specific levels of cardiovascular risk can be promptly identified.

Finally, this study has a few limitations. First, the unavailability of API for GPT-4 and Bard limited our analysis to a subset of 10,000 UK Biobank participants. However, this result was sufficient to validate the findings of this study. Second, the inner workings of GPT-4 remain challenging because the model and the code of ChatGPT have not been fully disclosed, and because of the complex structure of LLM, fully explaining the working principle becomes difficult. Third, the performance of GPT-4 is yet to be extensively validated for various medical conditions, necessitating additional research to generalize our findings to other conditions such as diabetes or cancer. Studies are required to optimize the performance of GPT-4 through fine-tuning and prompt engineering of specific tasks.

LLMs, such as ChatGPT, can achieve performance comparable to conventional models in predicting cardiovascular risk. Furthermore, ChatGPT exhibits enhanced accessibility, flexibility, and ability to provide user-friendly outputs. With the continuous evolution of LLMs, such as ChatGPT, future studies should focus on applying the models to various medical scenarios and optimizing their performance.

## Supporting information

Supplementary Materials [[supplements/289842_file02.docx]](pending:yes)

## Data Availability

All data produced in the present work are contained in the manuscript.

## Funding

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number : HI22C0452).

## Contributors

Conceptualization: SungA Bae,

Methodology: Dukyong Yoon,

Validation: Seng Chan You,

Investigation: Changho Han, Dong Won Kim, Songsoo Kim,

Writing—original draft preparation: Changho Han, Dong Won Kim, Songsoo Kim,

Writing—review and editing: Dukyong Yoon, SungA Bae, Seng Chan You,

Supervision: Dukyong Yoon, SungA Bae

## Declaration of interests

None to declare

## Data sharing

Deidentified data, from both the UK biobank and KoGES studies, is publicly available on request. Further details of UK biobank and KoGES studies are available online.

## Acknowledgements

This study was conducted using data from the UK Biobank (application number: 85037). Data in this study were from the Korean Genome and Epidemiology Study (KOGES; 6635-302), National Research Institute of Health, Korea Disease Control and Prevention Agency, Republic of Korea.

*   Received May 22, 2023.
*   Revision received May 22, 2023.
*   Accepted May 24, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at [http://creativecommons.org/licenses/by-nd/4.0/](http://creativecommons.org/licenses/by-nd/4.0/)

## References

1.  1.Cardiovascular diseases (CVDs). [https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)](https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)).
    
    
2.  2.Piepoli, M. F. et al. 2016 European Guidelines on cardiovascular disease prevention in clinical practice. Rev. Esp. Cardiol. 69, 939 (2016).
    
    
3.  3.D’Agostino, R. B., Sr. et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation 117, 743–753 (2008).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTQ6ImNpcmN1bGF0aW9uYWhhIjtzOjU6InJlc2lkIjtzOjk6IjExNy82Lzc0MyI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzA1LzI0LzIwMjMuMDUuMjIuMjMyODk4NDIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

4.  4.Goff, D. C., Jr. et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation 129, S49–73 (2014).
    
    [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6MTQ6ImNpcmN1bGF0aW9uYWhhIjtzOjU6InJlc2lkIjtzOjE4OiIxMjkvMjVfc3VwcGxfMi9TNDkiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8wNS8yNC8yMDIzLjA1LjIyLjIzMjg5ODQyLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

5.  5.Hippisley-Cox, J., Coupland, C. & Brindle, P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ 357, j2099 (2017).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNTcvbWF5MjNfNS9qMjA5OSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzA1LzI0LzIwMjMuMDUuMjIuMjMyODk4NDIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

6.  6.Kavousi, M. et al. Comparison of application of the ACC/AHA guidelines, Adult Treatment Panel III guidelines, and European Society of Cardiology guidelines for cardiovascular disease prevention in a European cohort. JAMA 311, 1416–1423 (2014).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2014.2632&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24681960&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F24%2F2023.05.22.23289842.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000333935000017&link_type=ISI) 

7.  7.Damen, J. A. A. G. et al. Prediction models for cardiovascular disease risk in the general population: systematic review. BMJ 353, i2416 (2016).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNTMvbWF5MTZfOC9pMjQxNiI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzA1LzI0LzIwMjMuMDUuMjIuMjMyODk4NDIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

8.  8.Smolina, K., Wright, F. L., Rayner, M. & Goldacre, M. J. Determinants of the decline in mortality from acute myocardial infarction in England between 2002 and 2010: linked national database study. BMJ 344, d8059 (2012).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNDQvamFuMjVfMi9kODA1OSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzA1LzI0LzIwMjMuMDUuMjIuMjMyODk4NDIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

9.  9.Van Calster, B. et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 17, 230 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12916-019-1466-7&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31842878&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F24%2F2023.05.22.23289842.atom) 

10. 10.Rajkomar, A., Dean, J. & Kohane, I. Machine Learning in Medicine. N. Engl. J. Med. 380, 1347–1358 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMra1814259&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F24%2F2023.05.22.23289842.atom) 

11. 11.Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
    
    
12. 12.GPT-4. [https://openai.com/research/gpt-4](https://openai.com/research/gpt-4).
    
    
13. 13.Ji, Z. et al. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 55, 1–38 (2023).
    
    
14. 14.Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N. Engl. J. Med. 388, 1233–1239 (2023).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMsr2214184&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36988602&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F24%2F2023.05.22.23289842.atom) 

15. 15.Sallam, M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel) 11, (2023).
    
    
16. 16.Kim, Y., Han, B.-G. & KoGES group. Cohort Profile: The Korean Genome and Epidemiology Study (KoGES) Consortium. Int. J. Epidemiol. 46, 1350 (2017).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F24%2F2023.05.22.23289842.atom) 

17. 17.Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pmed.1001779&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25826379&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F24%2F2023.05.22.23289842.atom) 

18. 18.Steinfeldt, J. et al. Neural network-based integration of polygenic and clinical information: development and validation of a prediction model for 10-year risk of major adverse cardiac events in the UK Biobank cohort. Lancet Digit Health 4, e84–e94 (2022).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S2589-7500(21)00249-1&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35090679&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F24%2F2023.05.22.23289842.atom) 

19. 19.Lopez-Lira, A. & Tang, Y. Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models. (2023) doi:10.2139/ssrn.4412788.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2139/ssrn.4412788&link_type=DOI) 

20. 20.Alaa, A. M., Bolton, T., Di Angelantonio, E., Rudd, J. H. F. & van der Schaar, M. Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants. PLoS One 14, e0213653 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0213653&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31091238&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F24%2F2023.05.22.23289842.atom) 

21. 21.Hemann, B. A., Bimson, W. F. & Taylor, A. J. The Framingham Risk Score: an appraisal of its benefits and limitations. Am. Heart Hosp. J. 5, 91–96 (2007).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1541-9215.2007.06350.x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17478974&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F24%2F2023.05.22.23289842.atom) 

22. 22.Framingham risk score for hard coronary heart disease. MDCalc [https://www.mdcalc.com/calc/38/framingham-risk-score-hard-coronary-heart-disease](https://www.mdcalc.com/calc/38/framingham-risk-score-hard-coronary-heart-disease).
    
    
23. 23.UpToDate. [https://www.uptodate.com/contents/calculator-cardiovascular-risk-assessment-10-year-men-patient-education](https://www.uptodate.com/contents/calculator-cardiovascular-risk-assessment-10-year-men-patient-education).
    
    
24. 24.Bouchard, B. Framingham risk calculator - objective health. [https://objectivehealth.ca/clinicians/framingham/](https://objectivehealth.ca/clinicians/framingham/).
    
    
25. 25.Dai, D. et al. Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers. arXiv [cs.CL] (2022).
    
    
26. 26.Brant, E. B. et al. Developing a shared sepsis data infrastructure: a systematic review and concept map to FHIR. NPJ Digit Med 5, 44 (2022).