PT - JOURNAL ARTICLE AU - Sharabiani, Mansour AU - Mahani, Alireza AU - Bottle, Alex AU - Srinivasan, Yadav AU - Issitt, Richard AU - Stoica, Serban TI - GenAI Exceeds Clinical Experts in Predicting Acute Kidney Injury following Paediatric Cardiopulmonary Bypass<sup>2</sup> AID - 10.1101/2024.05.14.24307372 DP - 2024 Jan 01 TA - medRxiv PG - 2024.05.14.24307372 4099 - http://medrxiv.org/content/early/2024/09/02/2024.05.14.24307372.short 4100 - http://medrxiv.org/content/early/2024/09/02/2024.05.14.24307372.full AB - The emergence of large language models (LLMs) offers new opportunities to leverage, often unused, information in clinical text. This study examines the utility of text embeddings generated by LLMs in predicting postoperative acute kidney injury (AKI) in paediatric cardiopulmonary bypass (CPB) patients using electronic health record (EHR) text, and to explore methods for explaining their output. AKI is a significant complication in paediatric CPB and its prediction can significantly improve patient outcomes by enabling timely interventions. We evaluate various text embedding algorithms such as Doc2Vec, top-performing sentence transformers on Hugging Face, and commercial LLMs from Google and OpenAI. We benchmark the out-of-sample predictive performance of these ‘AI models’ against a ‘baseline model’ as well as an established clinically-defined ‘expert model’. The baseline model includes patient gender, age, height, body mass index and length of operation. The majority of AI models surpass, not only the baseline model, but also the expert model. An ensemble of AI and clinical-expert models improves discriminative performance by nearly 23% compared to the baseline model. Consistency of patient clusters formed from AI-generated embeddings with clinical-expert clusters - measured via the adjusted rand index and adjusted mutual information metrics - illustrates their medical validity. We use text-generating LLMs to explain the output of embedding LLMs, e.g., by summarising the differences between AI and expert clusters, and/or by providing descriptive labels for the AI clusters. Such ‘explainability’ can increase medical practitioners’ trust in the AI applications, and help generate new hypotheses, e.g., by correlating cluster memberships with outcomes of interest.HighlightsLLMs outperform clinical experts in predicting risk of AKI after paediatric CPB.LLMs generate clinically plausible explanations and hypotheses using embeddings.Successful application of LLMs in paediatric CPB suggests potential in other specialised fields.Fine-tuning LLMs on domain data and forming ensembles of AI and clinical experts may boost accuracy.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any funding.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Ethics committee of Great Ormond Street Hospital for Children, London gave ethical approval for this work (audit number 3045).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data produced in the present study are available upon reasonable request to the authors.Acute Kidney Injury (AKI)A sudden decrease in kidney function, often occurring after surgery, particularly in paediatric patients undergoing cardiopulmonary bypass (CPB).Adjusted Mutual Information (AMI)A measure of agreement between two clusterings, adjusted for chance, based on the mutual information between the clusterings.Adjusted Rand Index (ARI)A metric used to measure the similarity between two data clusterings, adjusted for the chance grouping of elements.Area Under the Receiver Operating Characteristic Curve (AUC)A performance measurement for classification models at various threshold settings, indicating the ability of the model to distinguish between classes.Bag-of-Codes (BoC)A text embedding technique where each medical code in a patient’s record is represented as a binary indicator in a vector.Cardiopulmonary Bypass (CPB)A technique used during heart surgery where a machine temporarily takes over the function of the heart and lungs, allowing surgeons to operate on a still heart.Cross-Validation (CV)A statistical method used to estimate the performance of machine learning models, where the data is split into multiple folds, and the model is trained and validated on different folds.Doc2VecA text embedding technique that learns distributed representations of documents, allowing for the transformation of entire documents into fixed-length vectors.Ensemble ModelA machine learning technique that combines the predictions of multiple models to improve accuracy and robustness.ExplainabilityTechniques used to interpret and understand the predictions made by complex machine learning models, often to increase trust and provide insights into the decision-making process.Fine-TuningThe process of adjusting a pre-trained model on a new dataset, typically with a smaller learning rate, to adapt the model to a specific task or domain.HyperparametersParameters of a machine learning model that are set before training and control the learning process, such as the number of clusters in k-means or the learning rate in neural networks.KDIGOKidney Disease Improving Global Outcomes; a set of guide-lines used to define and classify the severity of acute kidney injury.Large Language Models (LLMs)Advanced machine learning models, often based on transformer architectures, that are trained on vast amounts of text data and can perform a variety of natural language processing tasks.Partial Risk Adjustment in Surgery (PRAiS)A model used in the UK to predict 30-day mortality risk after paediatric heart surgery, incorporating various clinical variables.Spherical K-MeansA variant of the k-means clustering algorithm that uses cosine distance instead of Euclidean distance, making it suitable for clustering high-dimensional data like text embeddings.Text EmbeddingA method of converting text into numeric vectors that capture the semantic meaning of the text, used in machine learning models for various predictive tasks.