Exploring the Accuracy of Differentiation-Based Regressive Models in Disease Forecasting
========================================================================================

* Rojina Karimirad

## ABSTRACT

Predictive models have been able to foresee outbreaks of mosquito-borne diseases such as malaria and map Ebola outbreaks1. This has allowed health organizations to plan the amount of resources and the number of healthcare workers needed more effectively, on top of finding out other useful data such as the locations most vulnerable to the disease and the demographics most affected. It can therefore be assumed that predictive analytics can reduce the amount of economic and non-economic burden caused by other epidemics as well, with COVID-19 being an obvious example.

To explore the use of predictive analytics in disease forecasting and in COVID-19 specifically, I decided to test the accuracy of a differentiation-based regression model on data provided by the Ontario Data Catalogue2 and then compare its performance to other methods of calculating regression. To make the prediction more personal, I decided to use data pertaining to the closest Ontario Health region to me, which is Central Ontario. The original set of data provided the daily number of hospitalizations since the beginning of the virus outbreak, however the data belonging to the year 2020 was discarded due to the assumption that the overwhelming surge to hospitals at the beginning of the pandemic would skew the data and hence the regression model. The reduced raw set of data covers COVID-19 cases in the hospital from January 1, 2021 to December 31, 2022, where the date is the independent variable, and the number of hospitalizations is the dependent variable. It can be found in *Appendix A*. To clearly display the data spanning two years on a single table, the number of hospitalizations for each ten days in the data were put into one group, and to process the data, a numerical value was assigned to each ten-day group, so January 1, 2021 to January 10, 2021 was assigned 1, January 11 to January 20 was assigned 2, etc. Since there are 730 days in two years, there ended up being 73 groups of 10 days in total. The new data table can be seen in *Appendix B*.

The scatterplot showing the number of hospitalizations due to COVID-19 in the years 2021 and 2022 and their corresponding ten-day groups is shown in *Graph 1*.

![Graph 1:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/10/28/2023.10.26.23297654/F1.medium.gif)

[Graph 1:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/F1)

Graph 1: 
Time series plot showing the number of COVID-19 hospitalizations in 2021 and 2022 in groups of 10 days

By only looking at the scatterplot, we notice certain outliers within the data. Outliers can negatively impact the accuracy of a regression model, so their elimination would be beneficial. Since the independent variable of the data is groups of ten days, a unit of time, the data can be categorized as a time series and the above scatterplot can be considered a time series plot. This makes us able to use methods typically utilized for single-variable data, such as the interquartile range, quartile values, and the lower and upper inner fences of the number of hospitalizations to calculate the outliers, because it is impossible for the *x* or time values that are consistently increasing by 1 group or 10 days to produce outliers on their own3.

The lower and upper inner fences of the dataset can be used to find the set’s outliers, with any value that lies beyond these two points being an outlier. Since the formulae for the lower and upper fences are, respectively: ![Formula][1]</img>  Where *Q**3* is Quartile 3 and *IQR* is the interquartile range. ![Formula][2]</img>  Where *Q**1* is Quartile 1 and *IQR* is the interquartile range.

And the formula for interquartile range or *IQR* is: ![Formula][3]</img>  Where *IQR* is the interquartile range, *Q**1* is Quartile 1 and *Q**3* is Quartile 3.

The values for Quartiles 1 and 3 need to be calculated. For the quartile values to be determined, the number of hospitalizations for each group of 10 days were placed in an increasing order and assigned term numbers based on their place in the newly ordered list, as shown in *Appendix C*. The formula for calculating the term number for the value of Quartile 1 is, ![Formula][4]</img>  Where *n* is the number of terms, which in this case is 73.

Substituting *n* = 73 into this formula, we get: ![Formula][5]</img>  Since there is no 18.5th term, the mean of the values of the 18th and 19th terms, which are obtained from *Appendix C*, is used to determine *Q**1*. ![Formula][6]</img>  The time values of Quartile 3 can be calculated using a similar formula, ![Formula][7]</img>  Where *n* is the number of terms.

Substituting *n* = 73 once again, ![Formula][8]</img>  Again, since there is no 55.5th term, the mean of the 55th and 56th terms’ values from *Appendix C* is used to calculate *Q**3*. ![Formula][9]</img>  Having calculated Quartiles 1 and 3, the values can be substituted into the previously stated formula for interquartile range, ![Formula][10]</img>  The interquartile range, along with *Q**1* and *Q**3*, can then be used to find the upper inner fence, as shown below, ![Formula][11]</img>  And similarly, the lower inner fence, ![Formula][12]</img>  Since the lower inner fence was calculated to be a negative number, and the number of hospitalizations cannot be negative, it can be concluded that there are no *y* values in the time series that are outliers due to being too small.

The upper inner fence, however, provides a limit for how large the values for the number of hospitalizations can be without skewing the data and therefore the regression model that will be produced. The following groups and their corresponding values for number of hospitalizations were taken from *Appendix B* and noted as outliers on the basis of being larger than the upper inner fence, 7831.75:

The outliers were removed from the dataset and replaced with the means of the two values before and after each of them in *Appendix B*, to stop them from impacting the accuracy of the regression, as shown in the following table:

The data excluding the outliers and instead including their newly assigned values, which will be used for the regression model, is shown in *Appendix D*. The graph visualizing the new data on the same scale can be seen below.

After having removed the outliers from the data, we have a dataset that would produce a more reliable regression model. There however needs to be a method of checking the accuracy of the regression model, which is where Test Train Split will be used. Test Train Split is a model validation procedure that checks the accuracy of a regression model’s performance on new data through interpolation and the data already available4. The Split refers to the split of the data into Train, which is 80% of the total data and will be used to calculate the regression equation, and Test, which is the remaining 20% and will be used to test the accuracy of the regression model. Since 80% of 73, the total number of data points, is 58.4 and not a whole number, it is rounded to 58. Similarly, 20% of 73, 14.6, is rounded to 15. The fifteen numbers that will only be used to test for the accuracy of and not to come up with the regression equation were randomized using a Java program I coded myself, linked in *Appendix E*. The program randomly printed the *x* values that can be seen in *Table 3* with their corresponding *y* values:

View this table:
[Table 1:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/T1)

Table 1: Outliers determined based on the upper inner fence value of 7831.75

View this table:
[Table 2:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/T2)

Table 2: Outliers replaced by the mean of the inlier values nearest to them

View this table:
[Table 3:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/T3)

Table 3: Fifteen randomly generated values making up the Test split, in increasing order of time

The final version of the processed data, excluding the outliers and only containing the Train split, can be seen in *Appendix F*.

To come up with the most accurate regression equation for this data, we can use the concept of the loss or cost function. The loss function is a measure of how badly a regression model can estimate the relationship between *x* and *y*, and it can be written using sigma notation, signifying summation5. The way the loss function measures the performance of the model is by calculating the distance between the expected versus real value of *y* at *x*, with *x* and *y* being the group and number of hospitalization values recorded in *Appendix F*. The loss function for linear regression is written as: ![Formula][13]</img>  Where *S* is the loss function, *ŷ**i* is the expected *i*th value of *y*, and *y**i* is the actual *i*th value of *y*.

The difference between the expected and actual value of *y* is squared to avoid negative error values. This issue could also be avoided via finding the absolute value of the difference, however that would make the function indifferentiable at some points, which would make us unable to minimize the error using derivatives. Squaring the error also further penalizes the regression model for making errors, as it would make a small error, like one by 20 units, appear as 400 instead.

Now, the goal is to find the regression model that achieves the lowest possible amount of loss. To do this, we need to identify the unknown coefficients and constants in the equation of a linear regression model, which is: ![Formula][14]</img>  Where *a* is the slope of the regression model, or the coefficient of *x*, and *b* is the y-intercept, or the constant.

Substituting the equation of the regression model into the loss function, we get: ![Formula][15]</img>  To find the values of *a* and *b* that would minimize the amount of loss, we need to partially differentiate the loss function with respect to the two unknowns. We can start with *b*, using the chain rule, ![Formula][16]</img> 

To minimize the value of *b* and to find the “critical numbers” of the loss function with respect to *b*, we set the partial derivative to 0 and isolate for *b*: ![Formula][17]</img>  Breaking the summation up and factoring out *a*, ![Formula][18]</img>  Solving for ![Graphic][19]</img> with respect to *n*, ![Formula][20]</img>  Adding both sides by *nb*, ![Formula][21]</img>  Dividing both sides by *n*, ![Formula][22]</img>  Looking at the resulting equation closely, we notice that the summation of *y* values divided by *n*, which is the number of terms, is equal to the mean of *y* values, or ![Graphic][23]</img>. The same can be said for the summation of *x* values divided by *n*, which is equal to ![Graphic][24]</img>.

Substituting in ![Graphic][25]</img> and ![Graphic][26]</img>, we get, ![Formula][27]</img>  We will leave *b* for now, and partially differentiate the loss function with respect to *a* this time: ![Formula][28]</img>  Substituting in ![Graphic][29]</img>, ![Formula][30]</img>  Isolating *a*, ![Formula][31]</img>  Having found the equations for both *a* and *b*, the means of the *x* and *y* values from *Appendix F* were found to solve for *a* and *b*, using the following formula, ![Formula][32]</img>  The mean values and the values for *x* and *y* obtained from *Appendix F* were substituted into *a*. ![Formula][33]</img>  *a* was rounded to five significant digits. Substituting *a* into the equation of *b*, ![Formula][34]</img>  *b* was also rounded to five significant digits. Substituting the values of *a* and *b* into the equation for line of best fit, ![Formula][35]</img>  Having found the equation of the linear regression model, we can graph the time series plot representing the data along with the regression. The fifteen Test values are also on the graph, represented by a different shade of grey to signify that they did not influence the regression line.

Visually, the regression seems to pass through some of the Test points and be far away from others. The difference between the actual Test points versus the ones predicted by the regression can be found by subtracting the number of hospitalizations of each Test point by the value obtained when substituting their time values into the regression equation and taking the absolute value of the difference. A sample calculation of this is shown below for the first Test point at *x* = 5, ![Formula][36]</img>  The sum of the differences can then be divided by 15, the total number of Test points, to find the Mean Absolute Error of the regression model6. This process is shown in *Table 4*.

View this table:
[Table 4:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/T4)

Table 4: Test Train Split calculations to determine the Mean Absolute Error and evaluate the accuracy of the regression model

A Mean Absolute Error of 867.0314 is high for a dataset with numbers that range from 309 to 6630, hinting at the regression model not being a good fit for the data. This result led to me looking back at my process and attempting to identify limitations that caused the calculated regression model to not be well representative of the data.

The main limitation I found was the shape of the regression model. The loss function I optimized minimized the inaccuracy of a linear regression model, but data pertaining to a pandemic may not have a linear trend as the rate of the drop in the number of hospitalizations decreases overtime as the total number of cases decreases. Data of such nature can be represented by a logarithmic or polynomial regression model. As an extension, the accuracy of the calculated linear regression model versus logarithmic and polynomial regression models can be compared via the *R**2* value or the coefficient of determination. The *R**2* value is a value from 0 to 1, with 0 being the least accurate and 1 being the most accurate, that is calculated based on the ratio of the residual sum of squares, which measures the deviation between the actual data and the data predicted by the regression model, to the total sum of squares, which is the deviation between the actual data and the mean7. The formula for the *R**2* value, therefore, is: ![Formula][37]</img>  Where *RSS* is the residual sum of squares and *TSS* is the total sum of squares.

It is important to note that the accuracy of linear regression is typically not measured using the *R**2* value and is instead determined based on the *r* value, or the correlation coefficient. However, for the sake of comparing a linear regression model with non-linear models, the *R**2* value will be used.

Below is a graph containing the same data as *Graph 3*, but instead with a logarithmic regression curve and its *R**2* value generated via Excel.

![Graph 2:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/10/28/2023.10.26.23297654/F2.medium.gif)

[Graph 2:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/F2)

Graph 2: 
Time series plot showing the number of COVID-19 hospitalizations in 2021-2022 in groups of 10 days, without outliers

![Graph 3:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/10/28/2023.10.26.23297654/F3.medium.gif)

[Graph 3:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/F3)

Graph 3: 
Time series plot showing the number of COVID-19 hospitalizations in 2021-2022 in groups of 10 days, including the linear regression model, separated into the Test and Train splits

The *R**2* value of the linear regression drawn in *Graph 3* was calculated to be 0.011, again via Excel. The *R**2* value of the logarithmic regression model seen on *Graph 4*, 0.1115, is around ten times greater than 0.011, confirming that my identification of the biggest limitation being the shape was correct and showing that a logarithmic regression would fit the data better.

![Graph 4:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/10/28/2023.10.26.23297654/F4.medium.gif)

[Graph 4:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/F4)

Graph 4: 
Time series plot showing the number of COVID-19 hospitalizations in 2021-2022 in groups of 10 days, including a logarithmic regression model, separated into the Test and Train splits

![Graph 5:](http://medrxiv.org/http://medrxiv.stage.highwire.org/content/medrxiv/early/2023/10/28/2023.10.26.23297654/F5.medium.gif)

[Graph 5:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/F5)

Graph 5: 
Time series plot showing the number of COVID-19 hospitalizations in 2021-2022 in groups of 10 days, including a polynomial regression model, separated into the Test and Train splits

The polynomial regression model and its *R**2* value were also found using Excel and can be seen in the graph below.

The *R**2* value of the polynomial regression model, 0.4114, is even higher, being around four times greater than that of the logarithmic regression model and around forty times greater than that of the linear model. This proves that the shape of the model was in fact the issue with the lack of inaccuracy shown by Test Train Split in *Table 4*.

It is important to deduce the reason for the linear regression model’s inaccuracy to answer my original research question: can differentiation-based regressive models provide accurate disease forecasting? The answer is not no, because the limitation was confirmed to be the linear model’s shape and not the method by which its equation was found. Since differentiation was used to minimize the error calculated by the loss function, we can be certain that the derived linear equation was the best possible linear model for the data. So, as an even further extension, if differentiation-based regression was applied to non-linear regression, it could absolutely be used to forecast the progression of diseases such as COVID-19.

## Data Availability

All data produced are available online at

[https://data.ontario.ca/dataset/covid-19-cases-in-hospital-and-icu-by-ontario-health-region](https://data.ontario.ca/dataset/covid-19-cases-in-hospital-and-icu-by-ontario-health-region) 

View this table:
[Appendix A:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/T5)

Appendix A: 
Raw data from January 1, 2021 to December 31, 2022, obtained from the Ontario Data Catalogue

View this table:
[Appendix B:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/T6)

Appendix B: 
Data grouped into 73 groups with dates assigned numerical values

View this table:
[Appendix C:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/T7)

Appendix C: 
Number of hospitalizations in increasing order and with term numbers

View this table:
[Appendix D:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/T8)

Appendix D: 
Data grouped into 73 groups with dates assigned numerical values and outliers replaced by the mean of the two closest number of hospitalizations

**Appendix E:** Link to the Code Randomizing Fifteen Numbers from 1 to 73 [https://docs.google.com/document/d/1iTKYf4wEY5faM6ikTFJ7JEy7ghZVIxSkIOOK15JgwPo/edit?usp=sharing](https://docs.google.com/document/d/1iTKYf4wEY5faM6ikTFJ7JEy7ghZVIxSkIOOK15JgwPo/edit?usp=sharing)

View this table:
[Appendix F:](http://medrxiv.org/content/early/2023/10/28/2023.10.26.23297654/T9)

Appendix F: 
Data Grouped into 73 Groups with Dates Assigned Numerical Values, with Outliers Replaced, Only Including the Train Split

## Footnotes

*   1 Meisa Salaita, “10 Ways We’Re Using Data to Fight Disease,” HowStuffWorks, August 20, 2020, [https://science.howstuffworks.com/life/genetic/10-ways-were-using-data-fight-disease.htm](https://science.howstuffworks.com/life/genetic/10-ways-were-using-data-fight-disease.htm).

*   2 “COVID-19 Cases in Hospital and ICU, by Ontario Health (OH) Region - Ontario Data Catalogue,” n.d., [https://data.ontario.ca/dataset/covid-19-cases-in-hospital-and-icu-by-ontario-health-region](https://data.ontario.ca/dataset/covid-19-cases-in-hospital-and-icu-by-ontario-health-region).

*   3 Mark LeBoeuf, “Time Series Outlier Detection,” The Code Forest, July 29, 2017, [https://thecodeforest.github.io/post/time\_series\_outlier\_detection.html](https://thecodeforest.github.io/post/time_series_outlier_detection.html).

*   4 Michael Galarnyk, “Understanding Train Test Split,” Built In, July 28, 2022, [https://builtin.com/data-science/train-test-split](https://builtin.com/data-science/train-test-split).

*   5 Conor Mack, “Machine Learning Fundamentals (I): Cost Functions and Gradient Descent,” Medium, April 4, 2021, [https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220](https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220).

*   6 Jason Brownlee, “Train-Test Split for Evaluating Machine Learning Algorithms,” Machine Learning Mastery, August 26, 2020, [https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/).

*   7 Wallstreetmojo Team, “Residual Sum of Squares,” WallStreetMojo, June 18, 2022, [https://www.wallstreetmojo.com/residual-sum-of-squares/](https://www.wallstreetmojo.com/residual-sum-of-squares/).

*   Received October 26, 2023.
*   Revision received October 26, 2023.
*   Accepted October 28, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/)

## Bibliography

1.  “COVID-19 Cases in Hospital and ICU, by Ontario Health (OH) Region - Ontario Data Catalogue,” n.d. [https://data.ontario.ca/dataset/covid-19-cases-in-hospital-and-icu-by-ontario-health-region](https://data.ontario.ca/dataset/covid-19-cases-in-hospital-and-icu-by-ontario-health-region).
    
    
2.  Bank of Canada. “Inflation Calculator.” Accessed January 5, 2023. [https://www.bankofcanada.ca/rates/related/inflation-calculator/](https://www.bankofcanada.ca/rates/related/inflation-calculator/).
    
    
3.  Brownlee, Jason. “Train-Test Split for Evaluating Machine Learning Algorithms.” Machine Learning Mastery, August 26, 2020. Accessed January 5, 2023. [https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/).
    
    
4.  Galarnyk, Michael. “Understanding Train Test Split.” Built In, July 28, 2022. Accessed January 5, 2023. [https://builtin.com/data-science/train-test-split](https://builtin.com/data-science/train-test-split).
    
    
5.  Hopper, Tristin. “More than the Second World War: Here’s the Eyewatering Debt Canada Is Racking Up.” Nationalpost, March 17, 2021. Accessed January 5, 2023. [https://nationalpost.com/news/canada/heres-the-eyewatering-debt-canada-is-racking-up](https://nationalpost.com/news/canada/heres-the-eyewatering-debt-canada-is-racking-up).
    
    
6.  LeBoeuf, Mark. “Time Series Outlier Detection.” The Code Forest, July 29, 2017. Accessed January 5, 2023. [https://thecodeforest.github.io/post/time\_series\_outlier\_detection.html](https://thecodeforest.github.io/post/time_series_outlier_detection.html).
    
    
7.  Lorinc, Jacob. “How Much — Exactly — Has the Pandemic Cost Canada? Star Analysis Finds Toll Is More than $1.5 Billion a Day.” thestar.com, May 29, 2021. Accessed January 5, 2023. [https://www.thestar.com/business/2021/05/29/how-much-exactly-has-the-pandemic-cost-canada-star-analysis-finds-toll-is-more-than-15-billion-a-day.html](https://www.thestar.com/business/2021/05/29/how-much-exactly-has-the-pandemic-cost-canada-star-analysis-finds-toll-is-more-than-15-billion-a-day.html).
    
    
8.  Mack, Conor. “Machine Learning Fundamentals (I): Cost Functions and Gradient Descent.” Medium, April 4, 2021. Accessed January 5, 2023. [https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220](https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220).
    
    
9.  Microsoft Corporation. Microsoft Excel. [https://office.microsoft.com/excel](https://office.microsoft.com/excel).
    
    
10. Salaita, Meisa. “10 Ways We’Re Using Data to Fight Disease.” HowStuffWorks, August 20, 2020. Accessed January 5, 2023. [https://science.howstuffworks.com/life/genetic/10-ways-were-using-data-fight-disease.htm](https://science.howstuffworks.com/life/genetic/10-ways-were-using-data-fight-disease.htm).
    
    
11. Team, Wallstreetmojo. “Residual Sum of Squares.” WallStreetMojo, June 18, 2022. Accessed January 5, 2023. [https://www.wallstreetmojo.com/residual-sum-of-squares/](https://www.wallstreetmojo.com/residual-sum-of-squares/).

 [1]: /embed/graphic-2.gif
 [2]: /embed/graphic-3.gif
 [3]: /embed/graphic-4.gif
 [4]: /embed/graphic-5.gif
 [5]: /embed/graphic-6.gif
 [6]: /embed/graphic-7.gif
 [7]: /embed/graphic-8.gif
 [8]: /embed/graphic-9.gif
 [9]: /embed/graphic-10.gif
 [10]: /embed/graphic-11.gif
 [11]: /embed/graphic-12.gif
 [12]: /embed/graphic-13.gif
 [13]: /embed/graphic-18.gif
 [14]: /embed/graphic-19.gif
 [15]: /embed/graphic-20.gif
 [16]: /embed/graphic-21.gif
 [17]: /embed/graphic-22.gif
 [18]: /embed/graphic-23.gif
 [19]: /embed/inline-graphic-1.gif
 [20]: /embed/graphic-24.gif
 [21]: /embed/graphic-25.gif
 [22]: /embed/graphic-26.gif
 [23]: /embed/inline-graphic-2.gif
 [24]: /embed/inline-graphic-3.gif
 [25]: /embed/inline-graphic-4.gif
 [26]: /embed/inline-graphic-5.gif
 [27]: /embed/graphic-27.gif
 [28]: /embed/graphic-28.gif
 [29]: /embed/inline-graphic-6.gif
 [30]: /embed/graphic-29.gif
 [31]: /embed/graphic-30.gif
 [32]: /embed/graphic-31.gif
 [33]: /embed/graphic-32.gif
 [34]: /embed/graphic-33.gif
 [35]: /embed/graphic-34.gif
 [36]: /embed/graphic-35.gif
 [37]: /embed/graphic-37.gif