Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Multi-faceted Semantic Clustering With Text-derived Phenotypes

View ORCID ProfileLuke T Slater, John A Williams, View ORCID ProfileAndreas Karwath, Hilary Fanning, Simon Ball, View ORCID ProfilePaul Schofield, View ORCID ProfileRobert Hoehndorf, View ORCID ProfileGeorgios V Gkoutos
doi: https://doi.org/10.1101/2021.05.26.21257830
Luke T Slater
1College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham
2Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust
6MRC Health Data Research UK (HDR UK) Midlands
9University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Luke T Slater
  • For correspondence: l.slater.1{at}bham.ac.uk
John A Williams
1College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham
2Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust
9University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Andreas Karwath
1College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham
2Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust
6MRC Health Data Research UK (HDR UK) Midlands
9University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Andreas Karwath
Hilary Fanning
2Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust
9University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Simon Ball
2Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust
6MRC Health Data Research UK (HDR UK) Midlands
9University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Paul Schofield
7Dept of Physiology, Development, and Neuroscience, University of Cambridge
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Paul Schofield
Robert Hoehndorf
8Computer, Electrical and Mathematical Sciences & Engineering Division,Computational Bioscience Research Center, King Abdullah University of Science and Technology
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Robert Hoehndorf
Georgios V Gkoutos
1College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham
2Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust
3NIHR Experimental Cancer Medicine Centre
4NIHR Surgical Reconstruction and Microbiology Research Centre
5NIHR Biomedical Research Centre
6MRC Health Data Research UK (HDR UK) Midlands
9University Hospitals Birmingham NHS Foundation Trust, Edgbaston, Birmingham, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Georgios V Gkoutos
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Identification of ontology concepts in clinical narrative text enables the creation of phenotype profiles that can be associated with clinical entities, such as patients or drugs. Constructing patient phenotype profiles using formal ontologies enables their analysis via semantic similarity, in turn enabling the use of background knowledge in clustering or classification analyses. However, traditional semantic similarity approaches collapse complex relationships between patient phenotypes into a unitary similarity scores for each pair of patients. Moreover, single scores may be based only on matching terms with the greatest information content (IC), ignoring other dimensions of patient similarity. This process necessarily leads to a loss of information in the resulting representation of patient similarity, and is especially apparent when using very large text-derived and highly multi-morbid phenotype profiles. Moreover, it renders finding a biological explanation for similarity very difficult; the black box problem. In this article, we explore the generation of multiple semantic similarity scores for patients based on different facets of their phenotypic manifestation, which we define through different sub-graphs in the Human Phenotype Ontology. We further present a new methodology for deriving sets of qualitative class descriptions for groups of entities described by ontology terms. Leveraging this strategy to obtain meaningful explanations for our semantic clusters alongside other evaluation techniques, we show that semantic clustering with ontology-derived facets enables the representation, and thus identification of, clinically relevant phenotype relationships not easily recoverable using overall clustering alone. In this way, we demonstrate the potential of faceted semantic clustering for gaining a deeper and more nuanced understanding of text-derived patient phenotypes.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

GVG and LTS acknowledge support from support from the NIHR Birmingham ECMC, NIHR Birmingham SRMRC, Nanocommons H2020-EU (731032) and the NIHR Birmingham Biomedical Research Centre and the MRC HDR UK (HDRUK/CFC/01), an initiative funded by UK Research and Innovation, Department of Health and Social Care (England) and the devolved administrations, and leading medical research charities. The views expressed in this publication are those of the authors and not necessarily those of the NHS, the National Institute for Health Research, the Medical Research Council or the Department of Health. RH, PNS and GVG were supported by funding from King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/3790-01-01. AK was supported by by the Medical Research Council (MR/S003991/1) and the MRC HDR UK (HDRUK/CFC/01). PNS and GVG acknowledge the support of the Alan Turing Institute, UK.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This work makes use of the MIMIC-III dataset, which was approved for construction, de-identification, and sharing by the BIDMC and MIT institutional review boards (IRBs). Further details on MIMIC-III ethics are available from its original publication (DOI:10.1038/sdata.2016.35). Work was undertaken in accordance with the MIMIC-III guidelines.

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Data Availability

Patient data is available via MIMIC. Software is available by the attached link.

https://github.com/reality/facetsim

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted May 29, 2021.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Multi-faceted Semantic Clustering With Text-derived Phenotypes
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Multi-faceted Semantic Clustering With Text-derived Phenotypes
Luke T Slater, John A Williams, Andreas Karwath, Hilary Fanning, Simon Ball, Paul Schofield, Robert Hoehndorf, Georgios V Gkoutos
medRxiv 2021.05.26.21257830; doi: https://doi.org/10.1101/2021.05.26.21257830
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Multi-faceted Semantic Clustering With Text-derived Phenotypes
Luke T Slater, John A Williams, Andreas Karwath, Hilary Fanning, Simon Ball, Paul Schofield, Robert Hoehndorf, Georgios V Gkoutos
medRxiv 2021.05.26.21257830; doi: https://doi.org/10.1101/2021.05.26.21257830

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)