COVID-19 regression

Polynomial regression and the COVID-19 epidemic #

In this article, Hari Singh and Seema Bawa developed a polynomial regression model designed to predict COVID-19 deaths in the US. This was an early project and only used 140 days of data. We will use a more extensive data set in an attempt to replicate their results. Our work will be based on a freely-available CDC data set.

Cumulative US deaths attributed to COVID-19 by day since January 22, 2020.

Goal: Develop a polynomial regression model to predict the number of deaths in the United States during the beginning of the COVID-19 epidemic. We will analyze its predictive ability by splitting the data into training and validation sets.

The data available to us is simple: it lists the cumulative number of deaths in the US attributed to COVID-19 for each day since January 22, 2020. By now we know that polynomial models are really good a overfitting data so part of our task is to estimate the quality of our model’s predictions. We will use a validation set to do this but form a validation set in two different ways. As always, there is a Mathematica notebook with all the necessary computational tools. Follow these exercises as you work.

Random validation set #

In this exercise we will use a random selection of 80% of our data points as the training set and the other 20% percent as validation. This split is fairly standard for a data set of this size although when working with smaller data sets a larger percentage is often used for validation.

Exercise:

Use the Mathematica notebook to load and split the CDC data into a training and a validation set. Fit polynomials of a variety of degrees to the training data and compare the training MSE to the validation MSE. Which degree polynomial would you use as a model?
Assess the size of error you expect if you used this model to estimate the cumulative number of COVID-19 deaths for a given day. Remember, the validation MSE is the average squared error your model makes on the validation set.

Solution

I tried polynomials up to degree thirty, and the training and validation MSE tracked closely. There is an initial MSE drop at degree three and another one at degree twelve. After this, the MSE decreases slowly.

The validation MSE for a degree three polynomial is roughly 24700. Since this is the average of the squares of the errors of our model’s predictions, very roughly we can expect the error of our predictions to be around \(\sqrt{24700} \approx 160.\) The validation MSE is roughly 6300, and \(\sqrt{6300} \approx 80\). A more accurate estimate of our error would be to compute the average of the absolute values of our prediction errors. This is called the MAE, or mean absolute error.

Time-sequenced validation set #

Now, let us form the validation set a little differently. We will pick some number of days, say \(n=500\), for training and the next few days, say \(k=10\), for validation. We will experiment with different values of \(n\) and \(k\).

Exercise:

Choose values of \(n\) and \(k\) and use the Mathematica notebook to load and split the CDC data into a training and a validation set in this new time-sequenced way. Fit polynomials of a variety of degrees to the training data and compare the training MSE to the validation MSE. Which degree polynomial would you use as a model?
Repeat with a variety of values of \(n\) and \(k\). The original article used \(n=140\).
Based on your experiments, would you be comfortable using a polynomial model to predict COVID-19 deaths?
Singh and Bawa used days 80 through 220 as their training set and as one of their models suggest the polynomial \[p(x) = 20261 + 2482 x - 9.9x^2 -0.23x^ 3+0.003x^ 4-0.000009x^5.\] Plot this polynomial against our data set. How well does their model predict the future course of the COVID-19 epidemic?

Solution

While polynomial models are really good at filling in missing data, as we saw above, one has to tread carefully when using them to extrapolate what happens in the future, or outside the range of the input values in the data set. When I used \(n=100\) and \(k=10\) with a degree five polynomial, my model predicted that after some point, cumulative COVID-19 deaths will decline. Hmmmm.

Summarize your findings #

There are two scenarios in which one can use a model trained on time-series data such as the one used in this exercise.

In the first, we can ask our model to fill in, or interpolate, values for missing data. For instance, our data may contain values for only 80% of the days scattered throughout some period of time and we are asked to estimate the values for the remaining twenty percent. This scenario often occurs when there are data collection problems.
In the second, we can ask out model to extrapolate what will happen in the future, or what has happened before our data set was collected.

Homework exercise: Suppose you have a time-series data set like the COVID-19 dataset from this lab. In a paragraph, describe how would you choose a validation set to assess the predictive error of your model in each of the two scenarios described above.

Further thoughts #

This lab raises some fundamental issues. The first is about trying to model something as complex as the progression of a disease using a single input: time. The second is the use of polynomial models to make predictions about the future. If you have the inclination, ponder the following questions:

Challenge question: The Singh-Bawa model uses just one input: time. Some phenomena can be modeled from such simple data. For instance, the distance traveled by a falling object. Is it reasonable to expect that the progress of an epidemic can be modeled using only time as input? What other features would be helpful?

Challenge question: Polynomials are interesting things. When the input \(x\) is large, then a polynomial is dominated by its highest-degree term, especially if this degree is significant. Explain why a polynomial model trained on time-series data from a particular time range is unlikely to make good predictions for the future.