RT Journal Article SR Electronic T1 Mimicking Clinical Trials with Synthetic Acute Myeloid Leukemia Patients Using Generative Artificial Intelligence JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2023.11.08.23298247 DO 10.1101/2023.11.08.23298247 A1 Eckardt, Jan-Niklas A1 Hahn, Waldemar A1 Röllig, Christoph A1 Stasik, Sebastian A1 Platzbecker, Uwe A1 Müller-Tidow, Carsten A1 Serve, Hubert A1 Baldus, Claudia D. A1 Schliemann, Christoph A1 Schäfer-Eckart, Kerstin A1 Hanoun, Maher A1 Kaufmann, Martin A1 Burchert, Andreas A1 Thiede, Christian A1 Schetelig, Johannes A1 Sedlmayr, Martin A1 Bornhäuser, Martin A1 Wolfien, Markus A1 Middeke, Jan Moritz YR 2023 UL http://medrxiv.org/content/early/2023/11/08/2023.11.08.23298247.abstract AB Clinical research relies on high-quality patient data, however, obtaining big data sets is costly and access to existing data is often hindered by privacy and regulatory concerns. Synthetic data generation holds the promise of effectively bypassing these boundaries allowing for simplified data accessibility and the prospect of synthetic control cohorts. We employed two different methodologies of generative artificial intelligence – CTAB-GAN+ and normalizing flows (NFlow) – to synthesize patient data derived from 1606 patients with acute myeloid leukemia, a heterogeneous hematological malignancy, that were treated within four multicenter clinical trials. Both generative models accurately captured distributions of demographic, laboratory, molecular and cytogenetic variables, as well as patient outcomes yielding high performance scores regarding fidelity and usability of both synthetic cohorts (n=1606 each). Survival analysis demonstrated close resemblance of survival curves between original and synthetic cohorts. Inter-variable relationships were preserved in univariable outcome analysis enabling explorative analysis in our synthetic data. Additionally, training sample privacy is safeguarded mitigating possible patient re-identification, which we quantified using Hamming distances. We provide not only a proof-of-concept for synthetic data generation in multimodal clinical data for rare diseases, but also full public access to synthetic data sets to foster further research.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any funding.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:All studies were previously approved by the Institutional Review Board of the Technical University Dresden (EK 98032010).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe final synthetic data sets generated and analyzed for the purpose of this study are publicly available at https://zenodo.org/record/8334265 https://zenodo.org/record/8334265