Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Developing a research ready population-scale linked data ethnicity-spine in Wales

View ORCID ProfileAshley Akbari, View ORCID ProfileFatemeh Torabi, View ORCID ProfileStuart Bedston, View ORCID ProfileEmily Lowthian, View ORCID ProfileHoda Abbasizanjani, View ORCID ProfileRichard Fry, Jane Lyons, View ORCID ProfileRhiannon K Owen, View ORCID ProfileKamlesh Khunti, View ORCID ProfileRonan A Lyons
doi: https://doi.org/10.1101/2022.11.28.22282810
Ashley Akbari
1Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University, Swansea, Wales, UK
MSc
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ashley Akbari
  • For correspondence: A.Akbari{at}Swansea.ac.uk
Fatemeh Torabi
1Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University, Swansea, Wales, UK
MSc
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Fatemeh Torabi
Stuart Bedston
1Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University, Swansea, Wales, UK
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stuart Bedston
Emily Lowthian
1Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University, Swansea, Wales, UK
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Emily Lowthian
Hoda Abbasizanjani
1Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University, Swansea, Wales, UK
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Hoda Abbasizanjani
Richard Fry
1Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University, Swansea, Wales, UK
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Richard Fry
Jane Lyons
1Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University, Swansea, Wales, UK
MSc
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rhiannon K Owen
1Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University, Swansea, Wales, UK
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Rhiannon K Owen
Kamlesh Khunti
2University of Leicester, Leicester, England, UK
FMedSci
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Kamlesh Khunti
Ronan A Lyons
1Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University, Swansea, Wales, UK
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ronan A Lyons
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Introduction Ethnicity information is recorded routinely in electronic health records (EHRs); however, to date, there is no national standard or framework for harmonisation of the existing records.

Methods and analysis The national ethnicity-spine uses anonymised individual-level population-scale ethnicity data from 26 EHR available through the Secure Anonymised Information Linkage (SAIL) Databank. A total of 46 million ethnicity records for 4,297,694 individuals in Wales-UK over 22 years (between 2000 and 2021) have been compiled in a harmonised, deduplicated longitudinal research ready data asset. We serialised this data and compared distribution of records over time for four selection approaches (Latest, Mode, Weighted-Mode and Composite) across age bands, sex, deprivation quintiles, health board, and residential location, against the ONS census 2011. The distribution of the dominant group (White) is minimally affected based on the four different selection approaches. Across all other ethnicity categorisations, the Mixed group was most susceptible to variation in distribution depending on the selection approach used and varied from a 0.6% prevalence across the Latest and Mode approach to a 1.1% prevalence for the Weighted-Mode, compared to the 3.1% prevalence for the Composite approach. Substantial alignment was observed with ONS census with the Latest group method (kappa= 0.68, 95% CI [0.67,0.71]) across all sub-groups.

Conclusion We provides a reproducible EHR based resource enabling the investigation and evaluation of health inequalities related to ethnic groups in Wales. This generalisable method informs opportunities for the transferability of this methodology across the UK to platforms with comparable routine data sources.

Ethics and dissemination This work was supported by the Con-COV team funded by Medical Research Council, Health Data Research UK, ADR Wales funded by ADR UK through the Economic and Social Research Council, and the Wales COVID-19 Evidence Centre, funded by Health and Care Research Wales.

Strengths and limitations of this study

  • This is the first comprehensive approach for retrieving ethnicity records from linked EHR.

  • The methodology presented here creates an anonymised longitudinal, population-scale individual-level linked ethnicity spine for the population of Wales.

  • Enabling investigation and evaluation of health inequalities related to ethnic groups in Wales.

  • Ethnicity spine is derived from 26 data sources across 22 years, complementing existing ethnicity data in census by providing a reproducible maintainable RRDA.

  • As this is a data driven approach, the algorithm is limited to the reported ethnicity and actual diversity of the population.

Introduction

Describing disease dynamics across and within different subgroups of a population and addressing ethnic health disparities in the context of population health and biomedical research is of high importance due to its potential to depict areas of health inequalities and inform decisions towards a fair distribution of services. Recent debates around how diversity in data can enable generalisable research findings highlights to a greater extent the value of internal and external data diversity can provide to research (1). Improving existing ethnicity measures and strategies for enriching these data for research has become an urgent public health priority, specifically post-observation of disparities during the COVID-19 pandemic (2,3). This improvement can be made through harnessing the available ethnicity information in routinely collected health data. While reports of COVID-19 outcomes demonstrated severe outcomes for minority ethnic groups (4) (4–8), ethnicity is not always consistently recorded, nor are they available across all data sources and for the entire population. The availability of ethnicity data directly affects our ability to describe differences for a variety of health and social outcomes (2). Understanding the dynamics of COVID-19 and its outcomes have further revealed the enduring disparities experienced by ethnic minority groups and the need for improvements in the readiness of these data (9).

The importance of research readiness for investigating the effectiveness of interventions across and within different ethnic groups emphasises the value of ethnicity data. The availability of ethnicity data across the UK has been documented (10). Enabling ethnicity level characterisation of the population has been highlighted across various disease groups (11), in previous pandemics (12) and amongst the elderly (13). Research into the inequalities faced by different ethnic groups has continually referenced the diverse manifestation of coded ethnicity in data, including clinical and administrative data sources, and its impact on the researchers’ ability to estimate disparities (9,14). While the census provides a current snapshot of the geographical distribution of ethnicity records across the population, the culmination of the census data from 2011 means that embracing the recapturing opportunity through linkage of routinely collected ethnicity information in EHR data provides a current measure at any given time point. There is current evidence of crude approaches to processing ethnicity data, for example, the label ‘South Asian’ often includes the subpopulation groups, who are often culturally and behaviourally different from each other, e.g. Bangladeshi, Pakistani and Indian (9,15). Stakeholder groups have also emphasised that the data collected on ethnicity should be ‘strengthened’ towards enabling the inclusion of all ethnicities in our research (7).

Data diversity provides a unique opportunity to access recorded ethnicity across multiple data sources and permits the harmonisation of approaches to examine the consistency of research outputs and conclusions, which contributes to an improved understanding of health outcomes in marginalised ethnicities. Here, we aimed to address the urgent need for improved harmonised ethnicity data. We assessed the availability, record completeness and distribution of ethnicity records across EHRs and administrative data sources for various approaches in grouping and assignment of ethnicity. Our work is a direct response to the national need for these data, specifically in the context of health (NuffieldTrust, 2020; Pareek et al., 2020).

Methods

Study design and data sources

All anonymised individual-level, population-scale data sources within the Secure Anonymised Information Linkage (SAIL) Databank trusted research environment (TRE) (16,17) were assessed for the availability and completeness of ethnicity records. We used longitudinal records spanning 22 years, with the earliest records starting from the 1st January 2000 and the most complete range of data providing full-year data up to the 31st December 2021 (See Supplementary Figure 2 for data source range). Ethnicity records were extracted from 26 EHRs and administrative data sources based on their associated available coverage contributing to the 22 years longitudinal study window (Table 1 for the number and distribution of ethnicity records in each data source). Contributing data sources had various update frequencies ranging from daily flows to quarterly flows (Figure 1 – Supplementary Table 1 and Supplementary Figure 2 for further information on contributing data sources).

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1

Data source range, number of individual and total ethnicity records in each contributing data source (breakdown of ethnic group is presented using ONS categorisation).

Figure 1
  • Download figure
  • Open in new tab
Figure 1

data sources used to create the harmonised ethnicity spine for Wales (data sources presented in update frequency).

Our methodology extracts ethnicity data from these 26 data sources and compiles the longitudinal, varied ethnicity values into a harmonised, research ready data asset (RRDA) using a Common Data Model (CDM). The steps in creating the population-spine ethnicity RRDA involved: record extraction, where all existing ethnicity records for each individual are extracted from the 26 data sources over time and gathered in a longitudinal dataset, followed by cleaning and harmonisation into ethnic groups using two different categorisation methods within each data source (Figure 2.A - we refer to this stage as categorisation hereafter), and finally serialising these records into a single row per individual using four different selection approaches to assign a single ethnicity to each individual (we refer to this stage as record selection hereafter). All longitudinal ethnicity records are preserved and changes over time for each individual are accessible, enabling the allocation of an individual-level ethnic grouping at any point in time to any anonymised linkable individual records. Approximately 4% of the Welsh population had two contradicting ethnicity records over time. In the record extraction stage: all data sources have only the records extracted from each respective data source that meets the required linkage quality (i.e. a match percentage of 95% quality or higher) and the date of the record occurring within the desired study window for all available longitudinal records, and then for each data source, all rows are extracted which corresponds to appropriate ethnicity values reviewed by clinical and domain experts. Once extracted, each data source has all the available records organised longitudinally and all values are harmonised into two different ethnicity categorisations consisting of five and nine ethnicity categories: I) five ethnic group categorisation consists of: ‘White’, ‘Asian’, ‘Black’, ‘Mixed’, ‘Other’ as proposed by the Office for National Statistics (18) (referred to as ONS hereafter) and II) nine ethnic groups categorisation consists of: ‘White’, ‘Chinese’, ‘Bangladeshi’, ’ Pakistani’, ’ Indian’, ‘Black African’, ‘Black Caribbean’, ‘Mixed’, ‘Other’ as proposed for use by the New and Emerging Respiratory Virus Threats Advisory Group (NERVTAG) (referred to as NER hereafter) (9) (Figure 2.A and Supplementary 4 for categorisation and harmonisation rules in each data source). Where we found multiple ethnicity recordings for an individual in various sources or over time, all records are ordered sequentially based on the date of recorded ethnicity. Finally, to reshape and process the data into one row per person, in the selection stage, the assignment of a single ethnic group to each individual occurs based on the following four selection approaches:

Figure 2
  • Download figure
  • Open in new tab
Figure 2

Ethnic group categorisations, stages of methodology, and cohorts used for comparing records over time.

  1. Latest: The most recent ethnic group available in any of the available data sources is assigned to an individual as their ethnic group.

  2. Mode: The most frequent recorded ethnic group (the mode) is extracted from all available data sources and is assigned to an individual as their ethnic group.

  3. Weighted-mode: The mode is extracted, with appropriate weights derived from the ONS 2011 Census population composition is assigned to an individual as their ethnic group.

  4. Composite: The ethnicity of those individuals who have declared more than one distinct ethnic group over time is assigned to the Mixed ethnic group (due to the nature of the composite method and no hierarchy or prioritisation given to which ethnic group should be assigned, where different categories are found for an individual over time, the composite method would assign all of the 4% of individuals with contradicting records into the Mixed ethnic category).

Post assessment, harmonisation and creation of a CDM for population-scale data on ethnicity, we summarise all this information in our ethnicity spine RRDA, to provide harmonised ethnic group information suitable for data linkage, maintained over time, which is available to researchers using the SAIL Databank. All available records within and across data sources are harmonised based on clinical and domain expertise into a CDM (see Figure 2.B for the steps involved in the methodology and Supplementary Table 3 for the metadata of RRDA tables).

Statistical Analysis

Three cohorts consisting of individuals alive and living in Wales in 2011 (census year), 2016 (mid-study) and 2021 (end-of-study) (Figure 2.C) were used to compare record completeness over time from the year the census was completed in 2011. Using the ONS and NER ethnic group categorisations, we compare the distribution of records for four different selection approaches across age bands, sex, deprivation quintiles, geographic health board and residential areas. Cohen’ s kappa coefficient was calculated to assess the inter-rater reliability between the Latest ethnicity approach and census records for the 2011 population (19). We report the inter-rater coefficients for those with known ethnicity status in both the census and the Latest approach. Where 0 is defined as no agreement between the two sources and 1 is the maximal agreement of records across the two sources (See Supplementary Figure 1 for detail on agreement scales). Age is calculated based on the week of birth of individuals on 1st January for each population or later if born afterwards. Week of birth, sex, health board and residential information of all individuals are sourced from Welsh Demographic Dataset (20) and in the COVID-19 e-cohort known as C20 (21).

Results

Cohort curation

A total of 46 million ethnicity records were recorded between 2000 and 2021 across the 26 EHRs and administrative data sources for 4,297,694 individuals (see Table 1, Supplementary Figure 2 for distribution of existing records in each contributing data source). We summarised the data for the population at three inception dates: 1st January 2011 (C11) for 3,110,462 individuals, 1st January 2016 (C16) for 3,110,375 and 1st January 2021 (C21) for 3,457,697. Each cohort contains records of individuals alive and residing in Wales on the 1st January of each year.

Evaluation of different selection approaches for ethnic group categorisation

A comparison of four selection approaches (Latest, Mode, Weighted-Mode, and Composite) showed that the distribution of C21 records varied based on approach: for the Mixed ethnic group using the Latest and Mode method resulted in 1.3% of the population being assigned to Mixed group while using Weighted-Mode resulted in 2.0% assignment, this has raised to 6.0% in the Composite approach, in which anyone with more than one distinct ethnic group category is assigned to the Mixed group (ONS-Mixed-Latest=1.3%, ONS-Mixed-Mode=1.3%, ONS-Mixed-Weighted-Mode=2.0%, ONS-Mixed-Composite=6.0%) similar patterns were observed for the NER categorisation (NER-Mixed-Latest=1.3%, NER-Mixed-Mode=1.3%, NER-Mixed-Weighted-Mode=1.9%, NER-Mixed-Composite=6.3%) (Table 2). Breaking down the records for each approach by sex and age for the two ONS and NER categorisations showed that the White ethnic group was recorded as the highest proportion, and its distribution was minimally affected by the approach used (Female-White-ONS-Latest=44%, Female-White-ONS-Mode=44%, Female-White-ONS-Weighted-Mode=43%, Female-White-ONS-Composite=43% & Female-White-NER-Latest=44%, Female-White-NER-Mode=44%, Female-White-NER-Weighted-Mode=43%, Female-White-NER-Composite=43%). Across all the minority ethnic groups, the Mixed group was the group for which the record distribution varied from the Latest and Mode compared to the Composite approach (Female-Mixed-ONS-Latest=0.6%, Female-Mixed-ONS-Mode=0.6%, Female-Mixed-ONS-Weighted-Mode=1.1%, Female-Mixed-ONS-Composite=3.1% & Female-Mixed-NER-Latest=0.6%, Female-Mixed-NER-Mode=0.6%, Female-Mixed-NER-Weighted-Mode=1.0%, Female-Mixed-NER-Composite=3.2%). Whilst for the rest of the ethnic groups, the distribution of records across age bands and sex were similar for four approaches across both ONS and NER categorisations, we illustrated these across all age bands in Figures 3 and 4 (Supplementary Table 2A & 2B for distribution of records over each age band and selection method). We observed an improved categorisation achieved by the Latest method followed by the Mode selection method in nine NER categorisation.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2

characteristics of the cohort based on ONS and NER categorisation.

Figure 3
  • Download figure
  • Open in new tab
Figure 3

distribution of ethnic groups across sex & age groups for ONS categorisation for four different retrieval method approaches.

Figure 4
  • Download figure
  • Open in new tab
Figure 4

distribution of ethnic groups across sex & age groups for NER categorisation for four different retrieval method approaches.

In 2011, census data provided the most complete distribution of ethnicity records, where 2,546,403 (82%) of the C11 population had a recorded ethnicity in the census, leaving 18% of the population with an unknown ethnic group across both ONS and NER categorisation. The contribution of EHR overtime increases from 10% in 2011 to 31% in 2021, resulting in a 13·6% reduction of missingness of an available ethnic group over time for both ONS and NER categorisation with the Latest method. This is in the background of changes in the population, including births and moving in and out of Wales (Table 3 and Supplementary Figure 3). A comparison of individual-level category assignment between those who had a record both in the ethnicity-spine and in the census (Eliminating 1,663,775 individuals who had an unknown ethnicity either in the census or in the C11 cohort) showed substantial agreement between the census and ethnicity-spine RRDA (Cohen’ s kappa=0·68, 95%CI [0·67, 0·71])). Post 2011, record completeness improved by 2·8% from 2016 to 2021, with agreement level of 0·46% across the two years (Cohen’ s kappa=0·46, 95% CI [0·44, 0·49]) (see Figure 5 and Supplementary Figure 4 for Cohen’ s kappa matrix).

View this table:
  • View inline
  • View popup
Table 3

record comparison for ONS categorisation between the census and the Latest method approaches (Note: ethnic group results are presented specifically for ONS and NER categorisation).

Figure 5
  • Download figure
  • Open in new tab
Figure 5

inter-rater coefficients between existing ethnic groups in Census 2011 (CENW), Welsh Longitudinal General Practice (WLGP), Patient Episode Dataset for Wales (PEDW), Maternity Indicator DataSet (MIDS), National Community Child Health data (NCCH), COVID-19 Lateral Flow data (CVLF) and COVID Vaccine Data (CVVD).

Discussion

Our development of the population-scale ethnicity spine provides robust ethnicity measures to healthcare research in Wales and a template which can easily be deployed in other TREs in the UK and beyond. The development of the ethnicity CDM, presented in this study, involved categorising and cleaning data and values recorded across 26 longitudinal data sources over 22 years. This resulted in record completeness and recapturing of current ethnic group distribution for the population. Our approach also provides a CDM where variable names, contributing data sources, and categorisation is readily accessible to users, enabling transferability of the final output tables across projects. We report a 2.8% improvement in record completeness from 2016 to 2021, with a major contributor to this being the regular update and population-scale nature of recent COVID-19 related data sources, such as COVID-19 vaccination (CVVD) and lateral flow testing (CVLF) data sources, which have provided the opportunity of capturing ethnicity records across the entire Welsh population. While multiple data sources have contributed to achieving an accurate ethnicity status for each individual, we can see which data sources have contributed longitudinally and across the whole population, with no prioritisation applied, with records from every data source having equal value (See Supplementary Figure 2 for contributing datasets and the ranges over time).

We present a methodology for creating a longitudinal population-scale individual-level linked data ethnicity spine for the population of Wales. Additionally, to assess the robustness of our proposed methodology, we provide a comparison of different categorisation and selection approaches. We further summarised the contribution level of each data source and compared the EHR ethnicity with the ONS 2011 census over time, considering changes in the population dynamics due to births, deaths and migration. Our methodology showed that using the Latest method provides substantial agreement measure of ethnicity in the population when compared directly with the census (Cohen’ s kappa=0.68). Our methodology has demonstrated the benefits of including ethnic group information when understanding clinical outcomes at a population scale.

The continuous recapturing of records across EHRs ensures the underrepresentation of ethnic minority (EM) groups over time is addressed, leading to fewer erroneous results among other studies using this data. As we have shown during the COVID-19 pandemic, the availability of various data sources resulted in an opportunity for recapturing ethnicity records. The proposed method in this manuscript have shown usefulness in various research enabling investigations around health inequalities for minority ethnic groups (22–28). There are further recent reports on variations in outcomes for EM groups with documented effects on vaccine hesitancy (29) as well as post-vaccination COVID-19 deaths that were found to be higher among those of Pakistani or Indian background (30). Our study provides a population-scale national ethnicity CDM in the form of an RRDA to support ongoing activities for the assessment of health risks, behaviours and outcomes across the less representative subgroups of the population of Wales, enabling the evaluation of potential inequalities due to ethnic groups across the population of Wales. To date, this methodology has enabled national level reporting on vaccination uptake in the general population of Wales (24) as well as supporting strategic vaccination decisions for subgroups of the population such as healthcare workers (23).

We explored the effect of categorising people with inconsistent records throughout time into the Mixed ethnic group category through the Composite method. We have shown that this method skews records for descriptive analysis across the COVID-19 measure. We would like to further outline the limitations of the Composite method as demonstrated throughout, with the availability of more data over time contributing to the Composite method only by increasing the proportion of records in the Mixed group and the decrease in those individuals being assigned to a unique ethnic group category. As such, the Composite approach is only appropriate in certain study design circumstances with a fixed point in time, no updates in data coverage and a need to retain the knowledge of individuals that declare to more than one ethnic group over time as a single row per individual variable. We would like to acknowledge that although within the SAIL Databank, we had access to a collection of longitudinal data sources, the scope of our work is limited to the quality of records available in each data source. Further improvement in primary data collection, encouragement on accurate reporting of ethnicity, as well as further work to understand those people who do not disclose their ethnicity and who are unknown in data sources to be interacting with services is needed, especially in potential groups where inequalities are thought to be higher or certain groups who are potentially at greater risk of COVID-19 infection or outcomes such as health care providers or social workers. Our study demonstrates that the Latest, Mode and Weighted-Mode approaches all provide suitable categorisations for ethnic group data and the choice of ethnic group methodological approach chosen by any study down to their respective study design, with the standard choice from current studies across the UK being the Latest (Figures 3 & 4 - Table 3).

Our population-level CDM and methodology, utilising 22 years of longitudinal data in 26 data sources, provides a ground for investigating existing associations and health inequalities at the population level in Wales. It also informs opportunities for the transferability of this methodology across the UK to other data sources and TREs, platforms and systems which hold comparable routine data sources that contain ethnicity information, who can take the learning, harmonisation and implementation of this methodology and replicate it for wider use and further harmonisation and opportunities to implement UK wide inequalities research. The ethnicity spine RRDA is currently available to all COVID-19 researchers within the SAIL Databank to utilise towards COVID-19 research and evaluations, with further work planned to discuss with data owners around the opportunities to share this implemented RRDA within the SAIL Databank with all researchers in SAIL directly, following appropriate governance and application approvals.

Data Availability

The anonymised individual-level data sources used in this study are available in the SAIL Databank at Swansea University, Swansea, UK, but as restrictions apply, they are not publicly available. All proposals to use SAIL data are subject to review by the independent Information Governance Review Panel (IGRP). Before any data can be accessed, approval must be given by the IGRP. The IGRP gives careful consideration to each project to ensure proper and appropriate use of SAIL data. When access has been granted, it is gained through a privacy-protecting safe haven and remote access system referred to as the SAIL Gateway. SAIL has established an application process to be followed by anyone who would like to access data via SAIL at: https://www.saildatabank.com/application-process/

https://www.saildatabank.com/application-process/

Author contributions

AA, RAL and KK were responsible for the conceptualisation, data curation, and development of the methodology. KK provided the ethnicity harmonisation of the various ethnicity values into the harmonised ONS and NER ethnic group categories. AA developed the methodology. All results are generated by FT and SB and reviewed by AA and RO. FT and AA developed the first draft of the paper. All authors have reviewed the work in various stages and approved the final version of this manuscript. The corresponding author confirms that all authors have reviewed and approved the final manuscript. AA and FT are the guarantors and had full access to the raw and derived study data.

Data sharing

The anonymised individual-level data sources used in this study are available in the SAIL Databank at Swansea University, Swansea, UK, but as restrictions apply, they are not publicly available. All proposals to use SAIL data are subject to review by the independent Information Governance Review Panel (IGRP). Before any data can be accessed, approval must be given by the IGRP. The IGRP gives careful consideration to each project to ensure proper and appropriate use of SAIL data. When access has been granted, it is gained through a privacy-protecting safe haven and remote access system referred to as the SAIL Gateway. SAIL has established an application process to be followed by anyone who would like to access data via SAIL at: https://www.saildatabank.com/application-process/

Declaration of interest

KK is Director of the University Centre for Ethnic Health Research, Trustee of the South Asian Health Foundation, Chair of the Ethnicity subgroup of the UK Scientific Advisory Group for Emergencies (SAGE) and member of SAGE. RAL is a member of the Welsh Government COVID-19 Technical Advisory Group. RKO is a member of the National Institute for Health and Care Excellence (NICE) Technology Appraisal Committee, a member of the NICE Decision Support Unit (DSU), and an associate member of the NICE Technical Support Unit (TSU). She has provided unrelated methodological advice as a paid consultant to the pharmaceutical industry. She reports teaching fees from the Association of British Pharmaceutical Industry (ABPI) and the University of Bristol.

Funding

This work was supported by the Con-COV team funded by the Medical Research Council (grant number: MR/V028367/1. This work was supported by Health Data Research UK, which receives its funding from HDR UK Ltd (HDR-9006) funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation (BHF) and the Wellcome Trust.

This work was supported by the ADR Wales programme of work. The ADR Wales programme of work is aligned to the priority themes as identified in the Welsh Government’ s national strategy: Prosperity for All.

ADR Wales brings together data science experts at Swansea University Medical School, staff from the Wales Institute of Social and Economic Research, Data and Methods (WISERD) at Cardiff University and specialist teams within the Welsh Government to develop new evidence which supports Prosperity for All by using the SAIL Databank at Swansea University, to link and analyse anonymised data. ADR Wales is part of the Economic and Social Research Council (part of UK Research and Innovation) funded ADR UK (grant ES/S007393/1).

This work was supported by the Wales COVID-19 Evidence Centre, funded by Health and Care Research Wales.

KK is supported by the National Institute Health and Care Research (NIHR) Applied Research Collaboration East Midlands (ARC-EM) and the NIHR Biomedical Research Centre.

Public and patient involvement

This project is undertaken under a proposal submitted to the independent Information Governance Review Panel (IGRP), including members of the public (IGRP Project: 0911). Two public members contributed to the project’ s approval as part of the IGRP panel.

Acknowledgement

This work uses data provided by patients and collected by the NHS as part of their care and support. We would also like to acknowledge all data providers who make anonymised data available for research.

We wish to acknowledge the collaborative partnership that enabled the acquisition and access to the de-identified data, which led to this output. The collaboration was led by the Swansea University Health Data Research UK team under the direction of the Welsh Government Technical Advisory Cell (TAC) and includes the following groups and organisations: the Secure Anonymised Information Linkage (SAIL) Databank, Administrative Data Research (ADR) Wales, Digital Health and Care Wales (DHCW), Public Health Wales, NHS Shared Services Partnership and the Welsh Ambulance Service Trust (WAST). All research conducted has been completed under the permission and approval of the SAIL independent Information Governance Review Panel (IGRP) project number 0911.

Footnotes

  • Email: Fatemeh.Torabi{at}Swansea.ac.uk, Email: stuart.bedston{at}Swansea.ac.uk, Email: e.m.lowthian{at}Swansea.ac.uk, Email: Hoda.Abbasizanjani{at}Swansea.ac.uk, Email: R.J.Fry{at}Swansea.ac.uk, Email: J.Lyons{at}Swansea.ac.uk, Email: r.k.owen{at}Swansea.ac.uk, Email: kk22{at}leicester.ac.uk, Email: R.A.Lyons{at}Swansea.ac.uk

  • ↵□ Senior Authors

References

  1. 1.↵
    The Lancet Digital Health. New resolutions for equity. Lancet Digit Heal [Internet]. 2022 Jan 1 [cited 2022 Oct 7];4(1):e1. Available from: http://www.thelancet.com/article/S2589750021002806/fulltext
    OpenUrl
  2. 2.↵
    Pareek M, Bangash MN, Pareek N, Pan D, Sze S, Minhas JS, et al. Ethnicity and COVID-19: an urgent public health research priority. Lancet [Internet]. 2020 May 2 [cited 2022 Jun 10];395(10234):1421–2. Available from: http://www.thelancet.com/article/S0140673620309223/fulltext
    OpenUrl
  3. 3.↵
    Routen A, Akbari A, Banerjee A, Katikireddi SV, Mathur R, McKee M, et al. Strategies to record and use ethnicity information in routine health data. Nat Med 2022 [Internet]. 2022 May 31 [cited 2022 Jul 20];1–4. Available from: https://www.nature.com/articles/s41591-022-01842-y
  4. 4.↵
    Nafilyan V, Islam N, Ayoubkhani D, Gilles C, Katikireddi SV, Mathur R, et al. Ethnicity, Household Composition and COVID-19 Mortality: A National Linked Data Study. medRxiv [Internet]. 2020 Dec 6 [cited 2022 Jun 10];2020.11.27.20238147. Available from: https://www.medrxiv.org/content/10.1101/2020.11.27.20238147v3
  5. 5.
    Yan S-L, Hu C-J, Tsai L-Y, Hsu C-Y. Postoperative Lifestyle of Patients with Liver Cancer: An Exploratory Study in a Single Center in Taiwan. Int J Environ Res Public Health [Internet]. 2022 Aug 11 [cited 2022 Aug 31];19(16):9883. Available from: /pmc/articles/PMC9408761/
    OpenUrl
  6. 6.
    Maenner MJ, Shaw KA, Baio J, Washington A, Patrick M, DiRienzo M, et al. Prevalence of Autism Spectrum Disorder Among Children Aged 8 Years — Autism and Developmental Disabilities Monitoring Network, 11 Sites, United States, 2016. MMWR Surveill Summ [Internet]. 2020 Mar 3 [cited 2022 Aug 18];69(4):1. Available from: /pmc/articles/PMC7119644/
    OpenUrl
  7. 7.↵
    Public Health England. Beyond the Data: Understanding the Impact of COVID-19 on BAME Communities. 2020 [cited 2022 Jun 10]; Available from: https://www.facebook.com/PublicHealthEngland
  8. 8.↵
    Yates T, Summerfield A, Razieh C, Banerjee A, Chudasama Y, Davies MJ, et al. A population-based cohort study of obesity, ethnicity and COVID-19 mortality in 12.6 million adults in England. Nat Commun [Internet]. 2022 Dec 1 [cited 2022 Aug 18];13(1). Available from: /pmc/articles/PMC8810846/
  9. 9.↵
    Khunti K, Routen A, Banerjee A, Pareek M. The need for improved collection and coding of ethnicity in health research. J Public Health (Bangkok) [Internet]. 2021 Jun 7 [cited 2022 Jun 10];43(2):e270–2. Available from: https://academic.oup.com/jpubhealth/article/43/2/e270/6024749
    OpenUrl
  10. 10.↵
    Mathur R, Smeeth L, Grundy E. Availability and use of UK based ethnicity data for health research AVAILABILITY AND USE OF UK BASED ETHNICITY DATA FOR HEALTH RESEARCH 2 CONTENTS. 2013;
  11. 11.↵
    Tillin T, Hughes AD, Mayet J, Whincup P, Sattar N, Forouhi NG, et al. The Relationship Between Metabolic Risk Factors and Incident Cardiovascular Disease in Europeans, South Asians, and African Caribbeans: SABRE (Southall and Brent Revisited)—A Prospective Population-Based Study. J Am Coll Cardiol [Internet]. 2013 Apr 30 [cited 2022 Jun 10];61(17):1777–86. Available from: https://www.jacc.org/doi/10.1016/j.jacc.2012.12.046
    OpenUrl
  12. 12.↵
    Zhao H, Harris RJ, Ellis J, Pebody RG. Ethnicity, deprivation and mortality due to 2009 pandemic influenza A(H1N1) in England during the 2009/2010 pandemic and the first post-pandemic season. Epidemiol Infect [Internet]. 2015 Dec 1 [cited 2022 Jun 10];143(16):3375–83. Available from: https://www.cambridge.org/core/journals/epidemiology-and-infection/article/ethnicity-deprivation-and-mortality-due-to-2009-pandemic-influenza-ah1n1-in-england-during-the-20092010-pandemic-and-the-first-postpandemic-season/682FDBAED79924F1965ADB4D8F40A3B3
    OpenUrl
  13. 13.↵
    Evandrou M, Falkingham J, Feng Z, Vlachantoni A. Ethnic inequalities in limiting health and self-reported health in later life revisited. J Epidemiol Community Health [Internet]. 2016 Jul 1 [cited 2022 Jun 10];70(7):653. Available from: /pmc/articles/PMC4941192/
    OpenUrl
  14. 14.↵
    Iqbal G, Johnson MRD, Szczepura A, Wilson S, Gumber A, Dunn JA. UK ethnicity data collection for healthcare statistics: The South Asian perspective. BMC Public Health [Internet]. 2012 Mar 27 [cited 2022 Jun 10];12(1):1–9. Available from: https://bmcpublichealth.biomedcentral.com/articles/10.1186/1471-2458-12-243
    OpenUrl
  15. 15.↵
    Bhopal R. Glossary of terms relating to ethnicity and race: for reflection and debate. J Epidemiol Community Heal [Internet]. 2004 Jun 1 [cited 2022 Jun 10];58(6):441–5. Available from: https://jech.bmj.com/content/58/6/441
    OpenUrl
  16. 16.↵
    Lyons RA, Jones KH, John G, Brooks CJ, Verplancke J-P, Ford D V, et al. The SAIL databank: linking multiple health and social care datasets. BMC Med Inform Decis Mak [Internet]. 2009 Dec 16 [cited 2017 Sep 12];9(1):3. Available from: http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-9-3
    OpenUrl
  17. 17.↵
    Ford D V., Jones KH, Verplancke JP, Lyons RA, John G, Brown G, et al. The SAIL Databank: Building a national architecture for e-health research and evaluation. BMC Health Serv Res. 2009;9.
  18. 18.↵
    ONS. Ethnic group, national identity and religion - Office for National Statistics [Internet]. 2011 [cited 2022 Jun 10]. Available from: https://www.ons.gov.uk/methodology/classificationsandstandards/measuringequality/ethnicgroupnationalidentityandreligion
  19. 19.↵
    McHugh ML. Interrater reliability: the kappa statistic. Biochem Medica [Internet]. 2012 [cited 2022 Oct 26];22(3):276. Available from: /pmc/articles/PMC3900052/
    OpenUrl
  20. 20.↵
    WDSD. Welsh Demographic Service Dataset (WDSD) [Internet]. 2022 [cited 2022 Feb 6]. Available from: https://web.www.healthdatagateway.org/dataset/8a8a5e90-b0c6-4839-bcd2-c69e6e8dca6d
  21. 21.↵
    Lyons J, Akbari A, Torabi F, Davies GI, North L, Griffiths R, et al. Understanding and responding to COVID-19 in Wales: protocol for a privacy-protecting data platform for enhanced epidemiology and evaluation of interventions. BMJ Open [Internet]. 2020 Oct 21 [cited 2020 Nov 7];10(10):e043010. Available from: https://bmjopen.bmj.com/lookup/doi/10.1136/bmjopen-2020-043010
    OpenUrl
  22. 22.↵
    Thomas DR, Orife O, Plimmer A, Williams C, Karani G, Evans MR, et al. Ethnic variation in outcome of people hospitalised during the first COVID-19 epidemic wave in Wales (UK): an analysis of national surveillance data using Onomap, a name-based ethnicity classification tool. BMJ Open [Internet]. 2021 Aug 1 [cited 2022 Jun 10];11(8):e048335. Available from: https://bmjopen.bmj.com/content/11/8/e048335
    OpenUrl
  23. 23.↵
    Bedston S, Akbari A, Jarvis CI, Lowthian E, Torabi F, North L, et al. COVID-19 vaccine uptake, effectiveness, and waning in 82,959 health care workers: A national prospective cohort study in Wales. Vaccine [Internet]. 2022 Feb 2 [cited 2022 Jun 10];40(8):1180. Available from: /pmc/articles/PMC8760602/
    OpenUrl
  24. 24.↵
    Perry M, Akbari A, Cottrell S, Gravenor MB, Roberts R, Lyons RA, et al. Inequalities in coverage of COVID-19 vaccination: A population register based cross-sectional study in Wales, UK. Vaccine. 2021 Oct 8;39(42):6256–61.
    OpenUrl
  25. 25.
    Agrawal U, Bedston S, McCowan C, Oke J, Patterson L, Robertson C, et al. Severe COVID-19 outcomes after full vaccination of primary schedule and initial boosters: pooled analysis of national prospective cohort studies of 30 million individuals in England, Northern Ireland, Scotland, and Wales. Lancet [Internet]. 2022 Oct 15 [cited 2022 Oct 20];400(10360):1305–20. Available from: http://www.thelancet.com/article/S0140673622016567/fulltext
    OpenUrl
  26. 26.
    Knight R, Walker V, Ip S, Cooper JA, Bolton T, Keene S, et al. Association of COVID-19 With Major Arterial and Venous Thrombotic Diseases: A Population-Wide Cohort Study of 48 Million Adults in England and Wales. Circulation [Internet]. 2022 Sep 20 [cited 2022 Oct 20];146(12):892–906. Available from: https://www.ahajournals.org/doi/abs/10.1161/CIRCULATIONAHA.122.060785
    OpenUrl
  27. 27.
    Kerr S, Joy M, Torabi F, Bedston S, Akbari A, Agrawal U, et al. First dose ChAdOx1 and BNT162b2 COVID-19 vaccinations and cerebral venous sinus thrombosis: A pooled self-controlled case series study of 11.6 million individuals in England, Scotland, and Wales. PLOS Med [Internet]. 2022 [cited 2022 Feb 27];19(2):e1003927. Available from: https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003927
    OpenUrl
  28. 28.↵
    Abbasizanjani H, Torabi F, Bedston S, Bolton T, Davies G, Denaxas S, et al. Harmonising electronic health records for reproducible research: challenges, solutions and recommendations from a UK-wide COVID-19 research collaboration. 2022 Sep 28 [cited 2022 Oct 20]; Available from: https://www.researchsquare.com
  29. 29.↵
    Ekezie W, Czyznikowska B, Campbell-Morris P, Rohit S, Miah N, Harrison J, et al. Public Perceptions Towards Vaccine Trial Research within Ethnic Minority and Vulnerable Communities. 2020;
  30. 30.↵
    Hippisley-Cox J, Coupland CAC, Mehta N, Keogh RH, Diaz-Ordaz K, Khunti K, et al. Risk prediction of covid-19 related death and hospital admission in adults after covid-19 vaccination: national prospective cohort study. BMJ [Internet]. 2021 Sep 17 [cited 2022 Jun 10];374:2244. Available from: https://www.bmj.com/content/374/bmj.n2244
    OpenUrl
Back to top
PreviousNext
Posted November 29, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Developing a research ready population-scale linked data ethnicity-spine in Wales
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Developing a research ready population-scale linked data ethnicity-spine in Wales
Ashley Akbari, Fatemeh Torabi, Stuart Bedston, Emily Lowthian, Hoda Abbasizanjani, Richard Fry, Jane Lyons, Rhiannon K Owen, Kamlesh Khunti, Ronan A Lyons
medRxiv 2022.11.28.22282810; doi: https://doi.org/10.1101/2022.11.28.22282810
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Developing a research ready population-scale linked data ethnicity-spine in Wales
Ashley Akbari, Fatemeh Torabi, Stuart Bedston, Emily Lowthian, Hoda Abbasizanjani, Richard Fry, Jane Lyons, Rhiannon K Owen, Kamlesh Khunti, Ronan A Lyons
medRxiv 2022.11.28.22282810; doi: https://doi.org/10.1101/2022.11.28.22282810

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Public and Global Health
Subject Areas
All Articles
  • Addiction Medicine (349)
  • Allergy and Immunology (668)
  • Allergy and Immunology (668)
  • Anesthesia (181)
  • Cardiovascular Medicine (2648)
  • Dentistry and Oral Medicine (316)
  • Dermatology (223)
  • Emergency Medicine (399)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (942)
  • Epidemiology (12228)
  • Forensic Medicine (10)
  • Gastroenterology (759)
  • Genetic and Genomic Medicine (4103)
  • Geriatric Medicine (387)
  • Health Economics (680)
  • Health Informatics (2657)
  • Health Policy (1005)
  • Health Systems and Quality Improvement (985)
  • Hematology (363)
  • HIV/AIDS (851)
  • Infectious Diseases (except HIV/AIDS) (13695)
  • Intensive Care and Critical Care Medicine (797)
  • Medical Education (399)
  • Medical Ethics (109)
  • Nephrology (436)
  • Neurology (3882)
  • Nursing (209)
  • Nutrition (577)
  • Obstetrics and Gynecology (739)
  • Occupational and Environmental Health (695)
  • Oncology (2030)
  • Ophthalmology (585)
  • Orthopedics (240)
  • Otolaryngology (306)
  • Pain Medicine (250)
  • Palliative Medicine (75)
  • Pathology (473)
  • Pediatrics (1115)
  • Pharmacology and Therapeutics (466)
  • Primary Care Research (452)
  • Psychiatry and Clinical Psychology (3432)
  • Public and Global Health (6527)
  • Radiology and Imaging (1403)
  • Rehabilitation Medicine and Physical Therapy (814)
  • Respiratory Medicine (871)
  • Rheumatology (409)
  • Sexual and Reproductive Health (410)
  • Sports Medicine (342)
  • Surgery (448)
  • Toxicology (53)
  • Transplantation (185)
  • Urology (165)