Training and validation

Training and validation sets #

In the following images, I had Mathematica compute the polynomials of best fit for a fixed data set \(\mathcal{D}\). The degrees represented are 3, 10, 20, and 25. As the degree of the polynomial increases, the mean squared error is smaller and strictly speaking the polynomial “fits” the data more accurately.

But you should be concerned about the predictive ability of the models created by the higher degree polynomials. For instance, I would be reluctant to use the degree 25 polynomial to estimate the \(y\)-value when \(x = 1.5\): the prediction would not look like it came from the rest of the data set. Here is the fundamental question:

Question: If it does not make sense to use the polynomial model of degree that gives the smallest mean squared error for the data set \(\mathcal{D}\), what degree should we use?

Provided that the data set \(\mathcal{D}\) has sufficiently-many points, machine learning practitioners have come up with a clever solution. What we would like our model to do is make accurate predictions and not just replicate the data we are using to fit it. But how could we estimate how well our model would do at making predictions? The solution is to pretend that we are unaware of some of the point in our data when we fit our model!

Here are the details. Suppose our data set \(\mathcal{D}\) consists of one-hundred points. Let us split it randomly into two parts called the training and validation sets:

\[ \mathcal{D}_{training} : (x_1, y_1), (x_4, y_4), \ldots, (x_{99}, y_{99})\] \[ \mathcal{D}_{validation} : (x_2, y_2), (x_3, y_3), \ldots, (x_{100}, y_{100})\]

The idea is to use the first part of the data, \(\mathcal{D}_{training}\), to fit, or train, the model; and then validate it using \(\mathcal{D}_{validation}\). If the model we computed still has small mean squared error for points in the validation set, it is likely to make accurate predictions on yet unseen data.

Example: Here is an example of how to do this in practice. I took a data set of points and split it evenly into training and validation points. Below, the training points are in red and the validation points are in blue. Then I used only the training points to compute a degree 15 polynomial; the fit is pretty good with a low mean squared error of 1.25. But how accurate would its predictions be on unseen data? We can estimate this by looking at how good the predictions are on validation data. It look like they would be substantially worse, with a mean squared error on the validation set on roughly 21.71.

So it seems that the degree 15 polynomial I computed fits the training data much better than it fits the validation data. That is, it overfits.

The goal of the following exercises is to figure out how to effectively use training and validation data to find a good model for a data set. As always, there is a Mathematica notebook to work with as you consider these exercises.

Exercise:

Create a synthetic data set of input-output pairs and divide it into training and validation data sets.
Use the training data set to fit a polynomial. Do this for a variety of degrees. Check how well your polynomial model fits the training and validation data sets.
Repeat this exercise generating increasingly more complex data sets: the notebook will tell you how to do this.
Which degree polynomial should you use if you would like to make the most accurate predictions on yet unseen data?

Homework exercise:

Outline how you would use training and validation sets to choose the degree for a polynomial model. Illustrate this process using an example from your Mathematica work above. Feel free to import images to support your work.

Bonus question: How can you use the validation mean squared error to estimate how good your predictions would be?