Overfitting

Overfitting #

Suppose that we have a data set of \(k\) input-output pairs:

\[ \mathcal{D} : (x_1, y_1), (x_2, y_2), \ldots, (x_k, y_k)\]

My minimizing the mean squared loss (MSE), we have developed a way of finding a polynomial of any degree that “best” fits that data set \(\mathcal{D}\). The higher the degree, the more intricate our optimization problem will be, but if a computer is doing the heavy lifting, this is not something we should worry about. So suppose we plot the points in \(\mathcal{D}\) and obtain:

Question: What degree of polynomial should we use as a model for the data set \(\mathcal{D}\)?

We will use a Mathematica notebook to generate a synthetic data set consisting of a few dozen points and then compute polynomials of a variety of degrees. Use the notebook as you work through the following exercises.

Exercise:

Generate a data set \(\mathcal{D}\) and compute the quadratic polynomial of best fit, then a cubic polynomial of best fit.

What happened to the mean squared error as you increased the polynomial’s degree?
What do you expect to happen to the mean squared error as you increase the degree of the polynomial even more? Do this for a number of degrees and test your hypothesis.
Can you explain why mean squared error behaves this way as the degree of the polynomial increases?

Solution

In general, increasing the degree of the polynomial will give a smaller MSE. Simply put, there is more flexibility when choosing a degree three polynomial instead of a degree two polynomial; and with more flexibility you should should expect the error to be smaller. In fact, a degree two polynomial is just a degree three polynomial whose first coefficient if forced to be zero!

This tends to suggest that we should always use high degree polynomials! Hmmmm…

Exercise: Are high degree polynomials good models for data? Use the results from the exercise above to inform your answer. Given a data set, how could you figure out what degree polynomial you should use to model it?

Solution

If you look at a graph of a high degree polynomial optimized for a particular data set \(\mathcal{D}\), it will fit the data nicely. But what often happens is that the polynomial suggests erratic predictions for other nearby points. This phenomenon is known as overfitting. We will learn how to deal with it in the next section.