Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Gene Sequence to 2D Vector Transformation for Virus Classification

Ignacio Sanchez-Gendriz, Karolayne S. Azevedo, Luísa C. de Souza, Matheus G. S. Dalmolin, View ORCID ProfileMarcelo A. C. Fernandes
doi: https://doi.org/10.1101/2024.03.12.24304158
Ignacio Sanchez-Gendriz
1Federal University of Rio Grande do Norte Natal/RN, Brazil 59078-970
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: ignaciogendriz{at}gmail.com
Karolayne S. Azevedo
2InovAI Lab, nPITI/IMD, Federal University of Rio Grande do Norte, Natal/RN, Brazil 59078-970
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Luísa C. de Souza
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Matheus G. S. Dalmolin
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marcelo A. C. Fernandes
3InovAI Lab, nPITI/IMD, Bioinformatics Multidisciplinary Environment (BioME), Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal/RN, Brazil 59078-970
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Marcelo A. C. Fernandes
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Background DNA sequences harbor vital information regarding various organisms and viruses. The ability to analyze extensive DNA sequences using methods amenable to conventional computer hardware has proven invaluable, especially in timely response to global pandemics such as COVID-19.

Objectives This study introduces a new representation that encodes DNA sequences in unit vector transitions in a 2D space, extracted from the 2019 repository Novel Coronavirus Resource (2019nCoVR). The main objective is to elucidate the potential of this method to facilitate virus classification using minimal hardware resources. It also aims to demonstrate the feasibility of the technique through dimensionality reduction and the application of machine learning models.

Methods DNA sequences were transformed into two-nucleotide base transitions (referred to as ‘transitions’). Each transition was represented as a corresponding unit vector in 2D space. This coding scheme allowed DNA sequences to be efficiently represented as dynamic transitions. After applying a moving average and resampling, these transitions underwent dimensionality reduction processes such as Principal Component Analysis (PCA). After subsequent processing and dimensionality reduction, conventional machine learning approaches were applied, obtaining as output a multiple classification among six species of viruses belonging to the coronaviridae family, including SARS-CoV-2.

Results and Discussions The implemented method effectively facilitated a careful representation of the sequences, allowing visual differentiation between six types of viruses from the Coronaviridae family through direct plotting. The results obtained by this technique reveal values accuracy, sensitivity, specificity and F1-score equal to or greater than 99%, applied in a stratified cross-validation, used to evaluate the model. The results found produced performance comparable, if not superior, to the computationally intensive methods discussed in the state of the art.

Conclusions The proposed coding method appears as a computationally efficient and promising addition to contemporary DNA sequence coding techniques. Its merits lie in its simplicity, visual interpretability and ease of implementation, making it a potential resource in complementing existing strategies in the field.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This study did not receive any funding

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study uses DNA sequences extracted from the 2019 Novel Coronavirus Resource (2019nCoVR), which is an open repository accessible at https://bigd.big.ac.cn/ncov/.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

  • ignaciogendriz{at}dca.ufrn.br

  • This version of the manuscript has undergone revision to enhance certain elements. Specifically, figure captions have been amended to rectify initial incompleteness, additional references have been incorporated to bolster the foundational literature of the study, and minor textual revisions have been implemented throughout the manuscript.

Data Availability

All data produced in the present study are available upon reasonable request to the authors

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted April 01, 2024.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Gene Sequence to 2D Vector Transformation for Virus Classification
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Gene Sequence to 2D Vector Transformation for Virus Classification
Ignacio Sanchez-Gendriz, Karolayne S. Azevedo, Luísa C. de Souza, Matheus G. S. Dalmolin, Marcelo A. C. Fernandes
medRxiv 2024.03.12.24304158; doi: https://doi.org/10.1101/2024.03.12.24304158
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Gene Sequence to 2D Vector Transformation for Virus Classification
Ignacio Sanchez-Gendriz, Karolayne S. Azevedo, Luísa C. de Souza, Matheus G. S. Dalmolin, Marcelo A. C. Fernandes
medRxiv 2024.03.12.24304158; doi: https://doi.org/10.1101/2024.03.12.24304158

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)