Regression #
Suppose that we have been asked to understand the relationship between outdoor temperature and ice cream sales. We have been given access to a (very small) data set and would like to make predictions about the sales numbers that should be expected at a variety of temperatures.
Temperature (°F) | Sales ($1000) |
---|---|
60 | 2.5 |
70 | 3.5 |
80 | 4.2 |
How should we proceed? We can summarize the entire data set \(\mathcal{D}\) as a collection of ordered pairs and in this form our problem is to estimate the \(y\) coordinate from the \(x\) coordinate:
\[\mathcal{D}: \;\;\; (60,2.5), \;\;\; (70,3.5), \; \text{and } \;\; (80,4.2).\]
If we plot these points in the plane, it seems like they lie close to a line \(y=mx+b\). If we are able to estimate its slope and \(y\)-intercept, then \(y=f(x)\) can be used as a model for sales predictions at a variety of temperatures. The graph is as follows:
Finding an appropriate function \(f(x)\) and using it to make predictions is an example of a regression problem. The \(x\)-coordinates are often called inputs, predictors, covariates, explanatory variables, or features. The \(y\)-coordinates are known as outputs, outcomes, or response variables.
Regression: There two two fundamental challenges in regression. Given a data set consisting of \(n\) input-output pairs of the form: \[\mathcal{D}: \; \; \; (x_1,y_1), \; \; \; (x_2,y_2), \; \; \; \ldots \; \; \;(x_n,y_n)\]
- identify a function \(f(x)\), or a model, that fits the data. That is, \(f(x_k) \) should approximately equal \( y_k\) for each of the data points, and
- estimate the error of the predictions made by this model.
We will mostly focus on the first question, leaving a detailed treatment of the second to a course in statistics or machine learning.
There is quite a bit of choice as to which functions one can use as regression models. The most common form is linear regression, where \(f\) is a linear function. But polynomials, trigonometric polynomials, and neural networks are also used.
Defining “best” fit #
Suppose, as in the example above, that we have decided to use a linear function to make a model for a data set. Being greedy, we are not interested in just any line, but one that fits our data “best.” But what exactly is a line of “best” fit? I suppose we could graph a variety of lines through our data and pick the one that fits the data in some aesthetically pleasing way, but there is a better and more precise way to do this. Suppose first that our data set consists of \(n\) input-output pairs:
\[\mathcal{D}: \; \; \; (x_1,y_1), \; \; \; (x_2,y_2), \; \; \; \ldots \; \; \;(x_n,y_n)\]
and that we would like to evaluate a candidate function \(f(x)\) to see how well it “fits” the data. The crucial idea is that we can use \(f(x)\) to predict the \(y\) coordinates and then compute the error in the prediction. So when we plug in \(x_1\) into \(f\), the prediction will be \(f(x_1)\), the value our model should have predicted is \(y_1\), resulting in a squared prediction error of:
\[(f(x_1) - y_1)^2\]
for the first data point. Hold on, why did we square the difference? We would like error to be positive regardless of whether \(f(x)\) underestimates or overestimates its prediction! And squaring the difference is a simple way of making sure that we get a non-negative value for error. If we do this for all the points in our data set and average:
The mean squared error of the model \(f(x)\) on a data set \(\mathcal{D}\) is
\[\text{MSE} = \tfrac{1}{n} \sum_{k=1}^n (f(x_k) - y_k)^2,\]
or just the averaged square error for all the points in our data set. Note the fancy \(\Sigma\) notation. It is shorthand for the sum of squared errors our model makes for all the data points.
So which function \(f(x)\) should we use to make predictions from our data? Simple: the one that minimizes the mean squared error of its predictions! Let’s make this more real in the next section.
A line that minimizes mean squared error #
Let’s work through a specific example. Suppose that our data set consists of three points:
\[\mathcal{D}: \;\;\; (1,2), \;\;\; (3,3), \;\;\; \text{and} \;\;\; (4,5).\]
We would like to find a line \(f(x) = mx +b \) that best fits this data. First, let’s use a generic line \(f(x) = mx +b \) to make predictions. We don’t know what \(m\) and \(b\) are quite yet, but we can still make predictions using these undetermined variables.
x-value | prediction \(f(x)\) | y-value | squared error |
---|---|---|---|
1 | \(m \cdot 1 + b\) | 2 | \((m + b -2)^2\) |
3 | \(m \cdot 3 + b\) | 3 | \((3m + b -3)^2\) |
4 | \(m \cdot 4 + b\) | 5 | \((4m + b -5)^2\) |
The mean squared error is the average of the terms in the right-most column. In fact, it is a function of the unknown parameters \(m\) and \(b\). We write this as a function \(\text{E}(m,b)\):
\[\text{MSE} = \text{E}(m,b) = \tfrac{1}{3} \Big((m + b -2)^2+(3m + b -3)^2+ (4m + b -5)^2 \Big) \]
You are ready to put on your optimization hat and take over. Complete the following exercise:
Solution
If we expand \(E(m,b)\), we get:
\[E(m,b)=\tfrac{1}{3} \Big( 3b^2 - 20b +38 -62m + 16bm + 26m^2 \Big)\]
To find its critical points, set both derivatives to zero:
\[\tfrac{\partial E}{\partial b} = \tfrac{1}{3}(16m+6b-20) = 0 \] \[\tfrac{\partial E}{\partial m} = \tfrac{1}{3}(52m+16b-62) = 0 \]
and solve to obtain \(m = \tfrac{13}{14} \text{ and } b = \tfrac{6}{7}.\) This critical point is indeed a global minumum for \(E(m,b)\). One way to see this is to recognize this function as an upward-facing paraboloid.
Another is to compute its Hessian and use the second derivative test. The Hessian is positive, so this critical point is not a saddle point, and both second derivatives are positive, so it is a local minimum. Finally, as it is the only local minimum and there is not boundary to complicate things, this point is in fact a global minimum.
Polynomials that minimize mean squared error #
Sometimes it does not make sense to find a line of best fit. For instance, when the points in the data set appear to lie on a parabola, we may be better off looking for a quadratic polynomial of the form \(f(x) = ax^2 +bx+c\) as a model for our data.
You already have the mathematical tools to tackle this problem, but you will now have to deal with an error function \(E(a,b,c)\) with three variables. Complete the following exercise:
Find a quadratic polynomial \(f(x) = a x^2 + bx + c\) that minimizes the mean squared prediction error on the data set \( \mathcal{D}: \; \; \; (1,1), \; \; \; (2,1), \; \; \; (3,2), \; \; \; \text{and} \; \; \; (5,1). \)
Use tools from optimization to explain why the answer you found is indeed a global minimum of the error function. Repeat the exercise, this time with polynomials of degree one (that is, a line), and degree three. Feel free to use this notebook to solve systems of equations.
If you think about it, we can repeat the same exercise and try to find a polynomial model of an even greater degree that minimizes the mean squared error of its predictions. There will be more variables and the computations will be more involved, but the idea is exactly the same. So which degree should be use to model a data set? This is an important question; we will address it in the next section.
A general formula for linear regression #
It turns out that if we only are interested in finding lines \(f(x) = mx+b\) that minimize mean squared error, it is possible to derive a general formula for the slope \(m\) and \(y\)-intercept \(b\) and never have to do the entire computation again. You will do so in the following exercises. One advantage of a general formula is that once you know it, you can avoid a lot of computations in the future. One disadvantage is that you have to spend the time to derive it in the first place!
Suppose that we are trying to fit a line \(f(x)= mx+b\) to a data set consisting of \(n\) input-output pairs: \[\mathcal{D}: \; \; \; (x_1,y_1), \; \; \; (x_2,y_2), \; \; \; \ldots \; \; \;(x_n,y_n).\]
As we saw above, to do this we will have to find values of \(m\) and \(b\) that minimize the mean squared error function.
Note on summation notation
These exercises make extensive use of summation notation. If you have not seen it before or need a refresher, the notation means:
\[ \sum_{k = 1}^n y_k = y_1 + y_2 + \ldots +y_n \]
which is a succinct way to avoid the use of dots-dots in long sums. Read more about it at the link above. It takes a while to get used to working with summations directly, so feel free to convert everything to the more familiar dot-dot notation as you complete these exercises.
Verify that \begin{align} \tfrac{\partial E}{\partial b} &= \tfrac{2}{n} \sum_{k=1}^n \big((mx_k+b)-y_k\big) \\ \tfrac{\partial E}{\partial m} &= \tfrac{2}{n} \sum_{k=1}^n \big((mx_k+b)-y_k\big) \cdot x_k \\ \end{align}
Keep in mind that \(m\) and \(b\) are variables while the \(x_k\) and \(y_k\) are actual constants that come from specific points in the data set.
Using the partial derivatives you computed in the previous question we can try to find the critical points of the error function by setting both equal to zero. Do this and explain why the critical points \((m,b)\) of \(E(m,b)\) satisfy the equations: \begin{align} Ab + Bm &= D \\ Bb + Cm &= E\\ \end{align}
where
\begin{align} A &= n\ \\ B &= \sum_{k=1}^n x_k\\ C &= \sum_{k=1}^n x_k^2\\ D&=\sum_{k=1}^n y_k\\ E&=\sum_{k=1}^n x_k y_k\\ \end{align}
Note that even though these five constants look quite complicated, they are just constants that can be easily computed from the data set, especially for a computer!
Solve the equations in the previous exercise for \(m\) and \(b\). You should obtain:
\begin{align} b = \frac{BE-CD}{B^2-AC} \\ m = \frac{BD-AE}{B^2-AC} \\ \end{align}
Show your work. This is the general formula for the line minimizing mean squared error for the data set \(\mathcal{D}\).
Find the line of best fit for the data set:
\[\mathcal{D}: \;\;\; (1,2), \;\;\; (3,3), \;\;\; \text{and} \;\;\; (4,5).\]
This is the example we worked out above, so you already know the answer! This time, directly use the general formula we just derived.