Abstract
The emergence of large language models (LLMs) offers new opportunities to leverage, often unused, information in clinical text. This study examines the utility of text embeddings generated by LLMs in predicting postoperative acute kidney injury (AKI) in paediatric cardiopulmonary bypass (CPB) patients using electronic health record (EHR) text, and to explore methods for explaining their output. AKI is a significant complication in paediatric CPB and its prediction can significantly improve patient outcomes by enabling timely interventions. We evaluate various text embedding algorithms such as Doc2Vec, top-performing sentence transformers on Hugging Face, and commercial LLMs from Google and OpenAI. We benchmark the out-of-sample predictive performance of these ‘AI models’ against a ‘baseline model’ as well as an established clinically-defined ‘expert model’. The baseline model includes patient gender, age, height, body mass index and length of operation. The majority of AI models surpass, not only the baseline model, but also the expert model. An ensemble of AI and clinical-expert models improves discriminative performance by nearly 23% compared to the baseline model. Consistency of patient clusters formed from AI-generated embeddings with clinical-expert clusters - measured via the adjusted rand index and adjusted mutual information metrics - illustrates their medical validity. We use text-generating LLMs to explain the output of embedding LLMs, e.g., by summarising the differences between AI and expert clusters, and/or by providing descriptive labels for the AI clusters. Such ‘explainability’ can increase medical practitioners’ trust in the AI applications, and help generate new hypotheses, e.g., by correlating cluster memberships with outcomes of interest.
Highlights
LLMs outperform clinical experts in predicting risk of AKI after paediatric CPB.
LLMs generate clinically plausible explanations and hypotheses using embeddings.
Successful application of LLMs in paediatric CPB suggests potential in other specialised fields.
Fine-tuning LLMs on domain data and forming ensembles of AI and clinical experts may boost accuracy.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study did not receive any funding.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethics committee of Great Ormond Street Hospital for Children, London gave ethical approval for this work (audit number 3045).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
2 AKI = Acute Kidney Injury; CPB: Cardiopulmonary Bypass; KDIGO: Kidney Disease Improving Global Outcomes; BDG: Broad Diagnosis Grouping; TSP: Transformed Specific Procedure
- Added sentence transformers and Google LLMs to list of benchmarked embedding algorithms - Expanded explainability section
Data Availability
All data produced in the present study are available upon reasonable request to the authors.
10. Glossary
- Acute Kidney Injury (AKI)
- A sudden decrease in kidney function, often occurring after surgery, particularly in paediatric patients undergoing cardiopulmonary bypass (CPB).
- Adjusted Mutual Information (AMI)
- A measure of agreement between two clusterings, adjusted for chance, based on the mutual information between the clusterings.
- Adjusted Rand Index (ARI)
- A metric used to measure the similarity between two data clusterings, adjusted for the chance grouping of elements.
- Area Under the Receiver Operating Characteristic Curve (AUC)
- A performance measurement for classification models at various threshold settings, indicating the ability of the model to distinguish between classes.
- Bag-of-Codes (BoC)
- A text embedding technique where each medical code in a patient’s record is represented as a binary indicator in a vector.
- Cardiopulmonary Bypass (CPB)
- A technique used during heart surgery where a machine temporarily takes over the function of the heart and lungs, allowing surgeons to operate on a still heart.
- Cross-Validation (CV)
- A statistical method used to estimate the performance of machine learning models, where the data is split into multiple folds, and the model is trained and validated on different folds.
- Doc2Vec
- A text embedding technique that learns distributed representations of documents, allowing for the transformation of entire documents into fixed-length vectors.
- Ensemble Model
- A machine learning technique that combines the predictions of multiple models to improve accuracy and robustness.
- Explainability
- Techniques used to interpret and understand the predictions made by complex machine learning models, often to increase trust and provide insights into the decision-making process.
- Fine-Tuning
- The process of adjusting a pre-trained model on a new dataset, typically with a smaller learning rate, to adapt the model to a specific task or domain.
- Hyperparameters
- Parameters of a machine learning model that are set before training and control the learning process, such as the number of clusters in k-means or the learning rate in neural networks.
- KDIGO
- Kidney Disease Improving Global Outcomes; a set of guide-lines used to define and classify the severity of acute kidney injury.
- Large Language Models (LLMs)
- Advanced machine learning models, often based on transformer architectures, that are trained on vast amounts of text data and can perform a variety of natural language processing tasks.
- Partial Risk Adjustment in Surgery (PRAiS)
- A model used in the UK to predict 30-day mortality risk after paediatric heart surgery, incorporating various clinical variables.
- Spherical K-Means
- A variant of the k-means clustering algorithm that uses cosine distance instead of Euclidean distance, making it suitable for clustering high-dimensional data like text embeddings.
- Text Embedding
- A method of converting text into numeric vectors that capture the semantic meaning of the text, used in machine learning models for various predictive tasks.