PT - JOURNAL ARTICLE AU - Krauer, Fabienne AU - Schmid, Boris V. TI - Mapping the plague through natural language processing AID - 10.1101/2021.04.27.21256212 DP - 2022 Jan 01 TA - medRxiv PG - 2021.04.27.21256212 4099 - http://medrxiv.org/content/early/2022/07/20/2021.04.27.21256212.short 4100 - http://medrxiv.org/content/early/2022/07/20/2021.04.27.21256212.full AB - Pandemic diseases such as plague have produced a vast amount of literature providing information about the spatiotemporal extent of past epidemics, circumstances of transmission, symptoms, or countermeasures. However, the manual extraction of such information from running text is a tedious process, and much of this information has therefore remained locked into a narrative format. Natural Language processing (NLP) is a promising tool for the automated extraction of epidemiological data from texts, and can facilitate the establishment of datasets. In this paper, we explore the utility of NLP to assist in the creation of a plague outbreak dataset. We first produced a gold standard list of toponyms by manual annotation of a German plague treatise published by Sticker in 1908. We then investigated the performance of five pre-trained NLP libraries (Google NLP, Stanford CoreNLP, spaCy, germaNER and Geoparser.io) for the automated extraction of location data from a compared to the gold standard. Of all tested algorithms, spaCy performed best (sensitivity 0.92, F1 score 0.83), followed closely by Stanford CoreNLP (sensitivity 0.81, F1 score 0.87). Google NLP had a slightly lower performance (F1 score 0.72, sensitivity 0.78). Geoparser and germaNER had a poor sensitivity (0.41 and 0.61) From the gold standard list we produced a plague dataset by linking dates and outbreak places with GIS coordinates. We then evaluated how well automated geocoding services such as Google geocoding, Geonames and Geoparser located these outbreaks correctly. All geocoding services performed poorly and returned the correct GIS information only in 60.4%, 52.7% and 33.8% of all cases. The rate of correct matches was particularly low when it came to historical regions and places. Finally, we compared our newly digitized plague dataset to a re-digitized version of the plague treatise by Biraben and provide an update of the spatio-temporal extent of the second pandemic plague outbreaks. We conclude that NLP tools have their limitations, but they are potentially useful to accelerate the collection of data and the generation of a global plague outbreak database.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was supported by funding from the Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo, and the Research Council of Norway (FRIMEDBIO project 288551).Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study does not contain clinical or person-related data and is exempt from IRB approvalI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe R code and the digitised plague datasets are available in a public repository. https://doi.org/10.5281/zenodo.6587267