COVID-19 regression

Polynomial regression and the COVID-19 epidemic: solutions #

Summarize your findings #

There are two scenarios in which one can use a model trained on time-series data such as the one used in this exercise.

In the first, we can ask our model to fill in, or interpolate, values for missing data. For instance, our data may contain values for only 80% of the days scattered throughout some period of time and we are asked to estimate the values for the remaining twenty percent. This scenario often occurs when there are data collection problems.
In the second, we can ask out model to extrapolate what will happen in the future, or what has happened before our data set was collected.

Homework exercise: Suppose you have a time-series data set like the COVID-19 data set from this lab. In a paragraph, describe how would you choose a validation set to assess the predictive error of your model in each of the two scenarios described above.

Solution

The model’s performance on the validation data set is intended to estimate the accuracy of its predictions. It should be formed with this in mind.

If the goal of our model is to make predictions for values of \(x\) in between the values of \(x\) observed in our data set (that is, interpolating) , then the correct way of structuring the validation set is to leave out points randomly chosen throughout \(\mathcal{D}\) and see how well our model recovers them.

On the other hand, if the goal is to make predictions outside the range of values of \(x\), then the best way to form a validation set is to use the points from the end of the time-series (if trying to predict the future) or from the beginning of the time-series (if trying to reconstruct the past).