Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Combining structured and unstructured data for predictive models: a deep learning approach

Dongdong Zhang, Changchang Yin, Jucheng Zeng, Xiaohui Yuan, View ORCID ProfilePing Zhang
doi: https://doi.org/10.1101/2020.08.10.20172122
Dongdong Zhang
1Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Drive, 43210 Columbus, Ohio, USA
2School of Computer Science and Technology, Wuhan University of Technology, 430070 Wuhan, Hubei, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Changchang Yin
3Department of Computer Science and Engineering, The Ohio State University, 2015 Neil Ave, 43210 Columbus, Ohio, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jucheng Zeng
1Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Drive, 43210 Columbus, Ohio, USA
2School of Computer Science and Technology, Wuhan University of Technology, 430070 Wuhan, Hubei, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xiaohui Yuan
2School of Computer Science and Technology, Wuhan University of Technology, 430070 Wuhan, Hubei, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ping Zhang
1Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Drive, 43210 Columbus, Ohio, USA
3Department of Computer Science and Engineering, The Ohio State University, 2015 Neil Ave, 43210 Columbus, Ohio, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ping Zhang
  • For correspondence: zhang.10631{at}osu.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Background The broad adoption of Electronic Health Records (EHRs) provides great opportunities to conduct health care research and solve various clinical problems in medicine. With recent advances and success, methods based on machine learning and deep learning have become increasingly popular in medical informatics. However, while many research studies utilize temporal structured data on predictive modeling, they typically neglect potentially valuable information in unstructured clinical notes. Integrating heterogeneous data types across EHRs through deep learning techniques may help improve the performance of prediction models.

Methods In this research, we proposed 2 general-purpose multi-modal neural network architectures to enhance patient representation learning by combining sequential unstructured notes with structured data. The proposed fusion models leverage document embeddings for the representation of long clinical note documents and either convolutional neural network or long short-term memory networks to model the sequential clinical notes and temporal signals, and one-hot encoding for static information representation. The concatenated representation is the final patient representation which is used to make predictions.

Results We evaluate the performance of proposed models on 3 risk prediction tasks (i.e., in-hospital mortality, 30-day hospital readmission, and long length of stay prediction) using derived data from the publicly available Medical Information Mart for Intensive Care III dataset. Our results show that by combining unstructured clinical notes with structured data, the proposed models outperform other models that utilize either unstructured notes or structured data only.

Conclusions The proposed fusion models learn better patient representation by combining structured and unstructured data. Integrating heterogeneous data types across EHRs helps improve the performance of prediction models and reduce errors.

Availability The code for this paper is available at: https://github.com/onlyzdd/clinical-fusion.

Competing Interest Statement

PZ is the member of the editorial board of BMC Medical Informatics and Decision Making. The authors declare that they have no other competing interests.

Funding Statement

This project was funded in part under a grant with Lyntek Medical Technologies, Inc.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study uses the MIMIC-III dataset. We are using the MIMIC IRB. This study was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA, USA), the Massachusetts Institute of Technology (Cambridge, MA, USA). Requirement for individual patient consent was waived because the study did not impact clinical care and all protected health information was de-identified. De-identification was performed in compliance with Health Insurance Portability and Accountability Act (HIPAA) standards in order to facilitate public access to MIMIC-III. Deletion of protected health information (PHI) from structured data sources (e.g., database fields that provide patient name or date of birth) was straightforward.

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Data Availability

MIMIC-III database analyzed in the study is available on PhysioNet repository.

https://mimic.physionet.org/about/mimic

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted August 14, 2020.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Combining structured and unstructured data for predictive models: a deep learning approach
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Combining structured and unstructured data for predictive models: a deep learning approach
Dongdong Zhang, Changchang Yin, Jucheng Zeng, Xiaohui Yuan, Ping Zhang
medRxiv 2020.08.10.20172122; doi: https://doi.org/10.1101/2020.08.10.20172122
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Combining structured and unstructured data for predictive models: a deep learning approach
Dongdong Zhang, Changchang Yin, Jucheng Zeng, Xiaohui Yuan, Ping Zhang
medRxiv 2020.08.10.20172122; doi: https://doi.org/10.1101/2020.08.10.20172122

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)