Abstract
Importance Serological assays can help diagnose and determine the rate of SARS-CoV-2 infections in a population.
Objective We characterized and compared 11 different lateral flow assays for their performance in diagnostic or epidemiological settings.
Design, Setting, Participants We used two cohorts to determine the specificity: (i) up to 350 blood donor samples from past influenza seasons and (ii) up to 110 samples which tested PCR negative for SARS-CoV-2 during the first wave of SARS-CoV-2 infections in Switzerland. The sensitivity was determined using up to 370 samples which tested PCR positive for SARS-CoV-2 during the same time and is representative for age distribution and severity.
Main Outcome We found a single test usable for epidemiological studies in the current low-prevalence setting, all other tests showed lacking sensitivity or specificity for a usage in either epidemiological or diagnostic setting. However, orthogonal testing by combining two tests without common cross-reactivities makes testing in a low-prevalence setting feasible.
Results Nine out of the eleven tests showed specificities below 99%, only five of eleven tests showed sensitivities comparable to established ELISAs, and only one fulfilled both criteria. Contrary to previous results from lab assays, five tests measured an IgM response in >80% of the samples. We found no common cross-reactivities, which allows orthogonal testing schemes for five tests of sufficient sensitivities.
Conclusions and Relevance This study emphasizes the need for large and diverse negative cohorts when determining specificities, and for diverse and representative positive samples when determining sensitivities of lateral flow assays for SARS-CoV-2 infections. Failure to adhere to statistically relevant sample sizes or cohorts exclusively made up of hospitalised patients fails to accurately capture the performance of these assays in epidemiological settings. Our results allow a rational choice between tests for different use cases.
Introduction
Antibodies are a hallmark of the human adaptive immune response to viral infections,1 and the immune system produces a heterogeneous population of different types of antibodies with distinct response kinetics and binding affinities.2 For instance, IgM is the first type of antibody induced upon infection, but has the shortest residence time and 2weakest antigen interaction. In contrast, IgA and IgG antibodies are produced later, but show orders of magnitude higher affinity and can provide long-term protection. Hence, IgM levels are sometimes used for diagnostic testing, while IgA and especially IgG levels can confirm a previous infection or vaccination.3
Scalable and accurate serological tests are fundamental to the understanding of the cumulative incidence of infections in a population.4 According to the WHO such knowledge can help in establishing the occurrence of infection in a population,5 which is crucial to determine the true extent of the disease, the distribution of severe, mild or asymptomatic cases, the infection fatality ratio of a population, the number of cases missed using routine disease surveillance methods, and the proportion of the population potentially protected against future infection.
A SARS-CoV-2 infection can lead to a strong IgG and IgA response for all tested epitopes in people with severe symptoms, while oligosymptomatic cases showed a diverse response in both antibody levels and epitope recognition.6-8 Most studies also point to a fast IgA and IgG response and a rather weak IgM response.
Lateral flow assays (LFA) are scalable and affordable tests9 that can potentially be used as both diagnostic and epidemiological assays. To be of use, those tests need to have a specificity >99.5% and a sensitivity >90%.10 The use and approval of a test by the FDA therefore requires a rigorous characterisation of its specificity, the probability that a negative sample yields a negative test result, and its sensitivity, the probability that a positive sample yields a positive test result. These characteristics also provide a means to compare different assays in the absence of established clinical decision points.
An accurate and precise determination of the specificity hinges on sufficiently many and sufficiently diverse samples to check for unknown cross-reactivities, such as a large, diverse panel of samples from blood donors including samples from flu seasons of the previous years, as performed for the Roche assay.11 Similarly but using a much smaller panel, post-market validations in Australia12 and the US13 are conducted by the regulators to check the products for their actual performance. Most LFA did not reach the manufacturers specification and were deemed “should not be used” in the US.14 Such cohorts also mimic the setting of a seroprevalence study, an important feature for epidemiological application of a test,15 where current low prevalence settings require high precision of specificity estimates.16
On the other hand, the sensitivity needs to be assayed using a sufficiently large and representative sample of the antibody response in a population for time post infection, (but especially also in disease severity) i don’t know what you mean. 17 A diverse standard panel representative for the SARS-CoV-2 response in a population yields a sufficiently precise estimate. However, most published clinical characterisations of LFA and claims from vendors are not based on such representative samples, but instead use small negative and/or positive cohorts predominantly comprised of hospitalised patients with severe cases of the disease.18-22 They thereby neglect the approximately 80% oligosymp-tomatic cases observed in the population23 and likely overestimate a the sensitivity of a test when applied population-wide rather than for severe cases only.
Here we present the clinical characterisation of eleven commercially available SARS-CoV-2 lateral flow assays (Table 1). All tests were characterised using the same positive cohort consisting of all or part of a collection of 366 convalescent samples24 and the same negative cohort of up to 500 blood donor samples from the influenza seasons 2016/17 and 2017/18. This allows a direct comparison of different characteristics of the tests.
The manufacturer, buffer volume, volumes for serum(S), plasma (P), whole blood (WB), and incubation time are listed. Epitopes and production hosts are taken from the manufacturers documentation if available.
Overall, most tests appear unsuitable for general use: only two tests, Hightop and Augurix, achieved a specificity higher than 99%. Several tests, including Hightop, showed sensitivities greater than two previously characterized Euroimmun and Epitope Diagnostics ELISA (>92%), while Augurix showed a sensitivity only slightly above 50%. However, our analysis and data provide evidence that orthogonal testing strategies combining several tests with high sensitivity can compensate for the individual low specificities to achieve a combined specificity of more than 98.5% with sensitivities between 85-95% for IgG. Interestingly, we found that five LFA were able to detect an IgM response early after symptom onset in more than 80% of the samples. In contrast to previous findings, this indicates a robust IgM response in SARS-CoV-2 infections. Taken together, our study highlights the need for standardized testing of these assays and suggests applications for those that perform better performing.
Material & methods
Clinical specificity & sensitivity
The clinical specificity Sp=TN/(TN+FP) characterizes the qualitative performance of a dichotomous test based on the number of true negatives (TN) and false positives (FP) and is the probability that a sample with no or very low antibodies yields a negative test result.
Conversely, clinical sensitivity Sp=TP/(TP+FN) characterizes the probability that a patient with detectable level of antibodies yields a positive test result based on the number of true positives (TP) and false negatives (FN).
Cohorts
To accurately calculate the specificity, we used the plasma of a blood donor cohort composed of donations from December 2016, February 2017, and February 2018. Additionally, we used the serum of our previously described positive (SERO-BL-positive) and negative (SERO-BL-negative) cohorts of study participants testing PCR-positive (resp. - negative) for SARS-CoV-2 during the initial wave of COVID-19 infections in the canton of Basel-Landschaft,24 Switzerland. We also recorded sample characteristics in regard to lipophilic appearance and hemolysis.
Assay procedure
We characterised eleven different commercially available lateral flow assays (LFA) for detection of SARS-CoV-2 specific IgM and IgG with serum samples from the three cohorts (Table 1).
All LFA were performed according to their respective manual. In brief, test components were brought to room temperature, sera or plasma aliquots were completely thawed before testing. The test cassette was removed from the sealed pouch and the required amount of sample was pipetted into the specimen well (Table 1), followed by addition of two or three drops of sample buffer to the specimen or, if present, buffer well. Results were read within the specified time window stated in Table 1. Lot numbers and expiration dates are given in (Supp. Table 2)
Presence of bands was visually inspected, and each test was imaged with a digital camera (different models) under standardized lightning conditions. We considered a test valid if its control band was present, and we considered a valid test positive for the respective antibody if the SARS-CoV-2 specific IgM, IgG or IgM/IgG band was detected in the sample.
We assayed the Hightop test using whole blood,24 serum and plasma, while all other tests were assayed using serum and plasma. The Hightop and MEDSan assays were characterised at the SwissTPH using the identical biobank and experimental setup as outlined previously.24 Eight tests were characterised simultaneously at the KUSPO Miinchenstein and the Biotime at the FHNW, Muttenz. These latter nine tests were characterised using the experimental design outlined below.
Experimental design
We prepared ten 96-well plates to distribute the samples of the SERO-BL-negative, SERO-BL-positive and blood donor cohorts. We aimed at a roughly equal distribution of IgG levels (as previously established using ELISA) on each plate to avoid plate-specific biases and to this end divided the positive samples into five strata of different IgG level, each strata occurring on each plate roughly the same number of times. Assignment of samples of each stratum and the negative cohorts to each plate was fully randomized. We selected two samples from the SERO-BL-negative cohort, and two samples each from medium and high-level IgG patients and replicated each of these samples on each plate to estimate between-plate variation. We also selected one patient sample at random for each plate and replicated it five times on that plate to provide an estimate of within-plate variation. Finally, we randomly selected one patient sample with high IgG level for each plate, and added a ten-fold 1:2 dilution series on the same plate to be able to establish detection limits for each test. Assignment of patient samples to wells was fully randomized individually for each plate and the same plate layouts were used for all tests.
Statistical analysis
Data analysis and creation of figures and tables was carried out using in-house scripts in R;25 binomial confidence intervals are 95%-Clopper-Pearson intervals calculated using exactci provided by the package PropCIs.26
Data and code availability
Data, results, and R scripts are available on GitLab: https://gitlab.com/csb.ethz/serobl-covid19-lta.
Results
Specificity
We calculated the specificities for all 11 LFA separately for the IgM and IgG responses (Table 2 and Figure 1). Note that the TAmiRNA assay uses a combined IgM/IgG response using a single band and values for IgM and IgG therefore coincide for this test.
Number of true negatives (TN), false positives (FP), and resulting specificity Sp with 95%-confidence interval for each test for IgM and IgG based on all negative samples (left) and samples from blood donors only (right). Note that TAmiRNA uses a combined IgM-IgG readout, resulting in identical values for TN/FP/Sp.
Specificities and sensitivities for all tests for IgM (left) and IgG (right). A: Sensitivities stratified by days after onset of symptoms. B: Sensitivities stratified by severity of disease.
Overall, the assays from Hightop and Augurix show specificities >99% for both IgG and IgM, while the assays by CTK, SureBiotech and Mexacare have specificities between 97% and 99% for IgG and IgM. The assay by Biotime also showed a specificity between 97% and 99%, but only for IgM. All other assays have specificities below 95% for both IgG and IgM. The combined IgG/IgM specificity of TAmiRNA is 97.8%.
To address potential cross-reactivity with different viruses circulating in the population, we additionally characterized each assay based only on blood donor samples from previous flu seasons. We found no significant differences for any test based on IgM, and only Lumiratek and MEDsan showed markedly worse characteristics for samples from earlier flu seasons.
We next checked whether samples are cross-reactive in multiple tests, which could be indicative for common epitopes, membranes or impurities. We found no noticeable shared cross-reactivities: Only 11 samples were positive in two or more tests and only two tests showed common cross-reactive samples for IgG (Supp. Table 5). Similarly, only six samples were positive in two or more IgM tests and only two tests showed common cross-reactive samples for IgM (Supp. Table 6). Only two samples were cross-reactive in two or more tests for both IgM and IgG.
Sensitivity
The sensitivity of an assay depends on the time post infection and on the disease severity. To arrive at a comprehensive characterization of the sensitivity of each assay, we used samples from our previously established biobank of the canton of Basel-Landschaft. It contains samples from people tested during the first wave of the pandemic in Switzerland, with a wide range of days post symptom onset and of disease severity; these samples are representative for symptomatic and oligosymptomatic cases.24 Days post symptoms and disease severity were established with a doctor’s interview, and we categorized the severity in ‘bedridden’, ‘help needed’, and ‘no restriction’. An overview of the results stratified by days after onset of symptoms (Figure 1A) and severity of the disease (Figure 1B).
A positive IgG assay is considered as indicative of a past infection, and IgG seroconversion was previously reported measurable >14d post symptom onset and completed >21d (Figure 2, Supp Table 3). We therefore estimated the sensitivity of each test for IgG based on samples with more than 21 days post symptom onset (Table 3).
The LFA results of all samples from the SERO-BL-positive cohort displayed by days post symptom onset for both IgG and IgM. TAmiRNA is identical in both panels, as it detects the combination of the two antibody type
Results for all positive samples and samples stratified by disease severity.
The assays from SureBiotech, Lumiratek, and Hightop showed sensitivities >94%, a value exceeding the sensitivities of two previously characterised ELISAs (Epitope Diagnostics and Euroimmun),24 while the assay from Biotime showed a sensitivity of about 91%, comparable to these ELISAs. Tests generally showed higher sensitivities with increasing days post onset of symptoms, while Augurix and NTbio surprisingly detected more cases at >14d, although on a low level. The assays from SureBiotech and Lumiratek additionally showed a response for about 2/3 of the samples with less than 14 days post symptom onset.
Levels of IgM are expected to increase early during a disease and then decrease at later timepoints. We also observe this general trend for the 11 tests, even though each test seems to react differently at different stages (Supp Table 3). Detection of IgM levels is most important early in the disease. Between 7 and 28 days post onset of symptoms, only the assay by Lumiratek showed a sensitivity >90%, while tests from Sure Biotech, Hightop and Biotime still showed >80% sensitivity (Table 4). Considering the whole range of time points, the Lumiratek assay consistently showed sensitivities of about 90%, and tests by CTK, SureBiotech, Mexacare, Biotime, and Hightop had sensitivity >83% in at least one of the time windows. The combined detection of IgM and IgG for TAmiRNA results in a monotone increase of the sensitivity with time, starting from about 78%.
Results for all positive samples and samples stratified by disease severity.
Levels of both IgG and IgM are also expected to increase with disease severity, leading to higher sensitivity for increased severity (Figure 2, Supp. Table 4). Sensitivities for IgG in the bedridden cohort >21d was >92% for all tests except Augurix. For IgM, all tests except Augurix and Biozek had sensitivities >90% for IgM in the bedridden cohort <21d. Sample size are low in these cases, precluding a more robust analysis. For oligosymptomatic cases—the majority for COVID-19—only five assay (SureBiotech, Biozek, Lumiratek, Biotime, Hightop) showed sensitivities above 90% for IgG. Moreover, only the Lumiratek assay showed a sensitivity above 90% for IgM in the ‘no restriction’ cohort at less than 21 days post symptoms.
Predictive value and usage
Sensitivity and specificity describe the probabilities of a test correctly recognizing a positive or negative sample. For applications, especially serological studies, we are more interested in the converse conclusion: given a positive or negative test result, how likely is it that the underlying sample is truly positive or negative, respectively? These probabilities are given by the positive predictive value (PPV) and the negative predictive value (NPV), respectively. Their values depend on the sensitivity and specificity of a test, but importantly also on the true prevalence of the disease in the population. They are calculated as

To evaluate the suitability of each test for applications, we calculated the PPV and NPV for an assumed prevalence between 0% and 25% (Figure 3 and Supp. Figure 8).
All negative samples have been included in the analysis. For IgM (IgG), only positive samples with 7-28 (>21) days after onset of symptoms were considered.
The PPV is mainly determined by the specificity of the test, and a low prevalence can severely influence the PPV due to the high proportion of false positive test results (Figure 3).
For IgG at >21d post onset of symptoms, only the assays by Hightop and Augurix showed sufficient PPV for low prevalence, while all other assays showed poor performance, often far below 50% PPV (meaning at least one in two positive tests is incorrect). Even at prevalence as high as 5%, the tests by MEDSan, Biotime, Biozek, and NTBio have PPVs below 50%. Results for IgM for 7-28 days post symptoms are generally worse.
Conversely, the NPV is largely determined by the sensitivity of a test and usually decreases with increasing prevalence, as proportionally more positive cases are observed; this decrease in NPV is therefore more pronounced for tests with low sensitivity, especially Augurix and CTK (Figure 3).
The PPV can be substantially increased by a strategy of orthogonal testing, where results of two or more tests are combined.10 For maximum effect, this strategy requires combining tests of reasonable sensitivity without shared cross-reactivities. We therefore selected the six individual tests with >92% sensitivity for IgG at >21 days post symptoms, respectively the four tests with IgM sensitivity >82% for 7-28 days post symptoms. The respective orthogonal tests based on pairs of tests almost all showed specificities >99.3%, the exceptions are Lumiratek combined with either SureBiotech or Biotime, which both showed specificities of ~98.5%. Naturally, the sensitivities decreased compared to individual tests, but remained >90% for some combinations. Overall, the specificities of the orthogonal tests are similar or better than the most specific single test—Hightop with 99.6% specificity. For IgM, we found combined specificities of around 99% with sensitivities of around 80%, still yielding workable PPVs (Table 5).
The tests with a sensitivity for IgG >92% and the IgM >82% were analysed for their combined specificty and sensitivity
Discussion
Accurate and precise estimation of seroprevalence in a population is crucial for helping public health officials make informed decisions for targeting affected areas. Suitable serological tests have to accurately capture the spectra of the human antibody response in a population. Here, we assessed the performance of 11 commercially available lateral flow assays. We used samples from a previously established biobank of symptomatic and oligosymptomatic patients representative for the disease spectrum observed in western Europe, augmented by samples from blood donors of previous influenza seasons.
Our point estimates of specificities differ from those previously reported in studies or by manufacturers. However, the corresponding interval estimates show compatible estimates with all but one previous study (Supp Table 7). Our comparatively large number of negative samples resulted in a narrower interval estimation, indicating higher precision.
We did not observe clusters of samples showing cross-reactivity, these false positive samples are unlikely to come from a common recognition of the employed epitopes. This also indicates that few common cross-reactivities exist and specificities are therefore inherent to each lateral flow assay. Consequently, orthogonal testing strategies become viable options for increasing specificity beyond a single test. Our results indicate that specificity increases dramatically while the sensitivity remains usable, most often by employing a combination of the Spike and NCP as epitopes.
Our point estimates of the sensitivities differ substantially from previous reports for some tests (Supp Table 7), but interval estimates are again compatible. In particular, our estimated sensitivities of samples of bedridden patients with >21 days post symptoms are comparable to most previous studies. This subcohort includes hospitalised patients and might therefore be comparable to the previous studies, but no definite statement is possible due to the small size of this subcohort.
Importantly, our biobank allows stratifying the sensitivity estimates by days post symptoms and disease severity, thereby providing a more detailed picture of test performance for oligosymptomatic patients which comprise about 80% of cases in a population. We found that for the ‘no restriction’ cohort, three assays-SureBiotech, Lumiratek, Hightop-perform similarly to previously characterised ELISAs.
In contrast to previous reports, we were able to determine an early IgM response with several assays. This hints at the possibility of using IgM for diagnostic purposes in early infection, but our cohort is again for a conclusive statement. Additionally, the sensitive IgM assays show low specificity resulting in a high false positive rate, and their results can therefore only serve as an additional input for a clinical decision. At the moment, there is no diagnostic value in the IgG measurement as a comparison to neutralising titers is missing in all studies as well as the vendors specifications.
We anticipate that our characterisation of 11 LFA on common and diverse samples provides a basis to establish clinical and epidemiological decision points from which analytical sensitivities can be established. This requires standard panels of antibodies representative for the immune response to a SARS-CoV-2 infection, which are still under development.10 In contrast, the clinical specificity will always remain the sole meaningful one and future specificity tests have to be performed on rather large, diverse panels of negative samples, especially as only few common cross-reactivities were detected.
Data Availability
Data are available upon request
Acknowledgment
We thank Christian Kahlert for careful reading of the manuscript. The Swiss Red Cross financed all the used LFA except for the Hightop and Biotime assays. The Hightop was purchased by the canton Basel-Landschaft and the Biotime was provided by the Swiss importer. Thomas Büeler organised all the purchases, and helped in organising the logistics for the testing at the KUSPO Münchenstein site. Biolytix AG, Witterswil, Switzerland rearranged the whole biobank according to the experimental design used at the testing site. FR is funded by the NCCR ‘Molecular Systems Engineering’.
Footnotes
Table 1 and Figure 3 corrected