Regression: solutions #
Polynomials that minimize mean squared error #
Find a quadratic polynomial \(f(x) = a x^2 + bx + c\) that minimizes the mean squared prediction error on the data set \( \mathcal{D}: \; \; \; (1,1), \; \; \; (2,1), \; \; \; (3,2), \; \; \; \text{and} \; \; \; (5,1). \)
Use tools from optimization to explain why the answer you found is indeed a global minimum of the error function. Repeat the exercise, this time with polynomials of degree one (that is, a line), and degree three.
Solution
Let’s start with a line. Let \(f(x) = ax+b\). Then \[\text{E} = \frac{1}{4} \Big( (a \cdot 1 + b -1)^2 + (a \cdot 2 + b -1)^2+ (a \cdot 3 + b -2)^2+ (a \cdot 1 + b -1)^2\Big).\] Setting \(\frac{\partial E}{\partial a} = \frac{\partial E}{\partial b} =0\) gives two equations in two unknowns that can be solved directly or by using software. We get \(a = \tfrac{1}{35}\) and \(b = \tfrac{41}{35}\) and the line of best fit is \(f(x) = \frac{1}{35}x+\frac{41}{35}\).
Working in a similar way, we find a quadratic model: \(f(x) = -\frac{2}{11}x^2+\frac{13}{55}x - \frac{7}{55}\); and a cubic model: \(f(x) = -\frac{1}{4}x^3+2x^2-\frac{17}{4}x + \frac{7}{2}\).
The question remains. Are these lines of best fit, worst fit, or neither? After all, all we did was find a critical point in each case. There are two ways to settle this. One way is to follow our usual procedure: compute the Hessian (in all cases it is positive indicating that the point is not a saddle), and then use the second derivative with respect to any of the variables (they are also all positive) indicating that the critical point is a local minimum. Finally, since there are no other restrictions on our variables (there is no boundary to check) the only local minimum is a global minimum for the error function.
An alternate way to proceed is by inspecting the error function \(E\) itself. It has the form of a function whose graph is either a paraboloid or a saddle. Since the the sign of both the \(a^2\) and \(b^2\) terms is positive, the graph is in fact an upward-facing paraboloid and the sole critical point must have been a global minimum.
A general formula for linear regression #
Solution
We will use the function \(f(x)=mx+b\) to make predictions. If the input is \(x_k\), our model’s prediction will be \(f(x_k)=mx_k+b\). If the model was perfect, it would have predicted \(y_k\). But it probably isn’t and the square of the error in our prediction is
\[ ((mx_k+b) - y_k)^2. \]
Taking the average over all values of \(k\) gives us the formula for the MSE above.
Verify that \begin{align} \tfrac{\partial E}{\partial b} &= \tfrac{2}{n}\sum_{k=1}^n \big((mx_k+b)-y_k\big) \\ \tfrac{\partial E}{\partial m} &= \tfrac{2}{n} \sum_{k=1}^n \big((mx_k+b)-y_k\big) \cdot x_k \\ \end{align}
Keep in mind that \(m\) and \(b\) are variables while the \(x_k\) and \(y_k\) are actual constants that come from specific points in the data set.
Solution
Since \(E\) is a sum, we can differentiate it term by term. We use the chain rule to find the derivative of the \(k\) term \(((mx_k+b) - y_k)^2\) with respect to \(b\). It is:
\[ \tfrac{2}{n} ((mx_k+b)-y_k)) \]
and so \(\tfrac{\partial E}{\partial b}\) is the sum of all such terms over all values of \(k\).
Similarly, we use the chain rule to find the derivative of the \(k\) term \(((mx_k+b) - y_k)^2\) with respect to \(m\). It is:
\[ \tfrac{2}{n} ((mx_k+b)-y_k)) x_k \]
and so \(\tfrac{\partial E}{\partial m}\) is the sum of all such terms over all values of \(k\).
Explain why the critical points \((m,b)\) of \(E(m,b)\) satisfy the equations: \begin{align} Ab + Bm &= D \\ Bb + Cm &= E\\ \end{align}
where
\begin{align} A &= n\ \\ B &= \sum_{k=1}^n x_k\\ C &= \sum_{k=1}^n x_k^2\\ D&=\sum_{k=1}^n y_k\\ E&=\sum_{k=1}^n x_k y_k\\ \end{align}
Note that even though these five constants look quite complicated, they are just constants that can be easily computed from the data set, especially for a computer!
Solution
To find the critical point(s) for the error function \(E\), we need to solve the two simultaneous equations
\[\tfrac{\partial E}{\partial b} = 0 \hspace{.2in} \tfrac{\partial E}{\partial m}=0.\]
This follows from a bit of arithmetic from the previous problem:
\begin{align*} 0 = \tfrac{\partial E}{\partial b} &= \tfrac{2}{n} \sum_{k=1}^n \big((mx_k+b)-y_k\big) \\ &= \tfrac{2}{n}\big( mx_1 + b - y_1 + mx_2 + b - y_2 + \ldots mx_n + b -y_n \big) \\ &= \tfrac{2}{n}\big( m(x_1 + x_2+ \ldots +x_n) +nb - (y_1 + y_2 + \ldots + y_n) \big) \\ &= \tfrac{2}{n} \big( Ab + Bm - D) \\ \end{align*}
so that \(Ab+Bm = D\). A similar process starting with \(\tfrac{\partial E}{\partial m}=0\) gives the second equation.
Solve the equations in the previous exercise for \(m\) and \(b\). You should obtain:
\begin{align} b = \frac{BE-CD}{B^2-AC} \\ m = \frac{BD-AE}{B^2-AC} \\ \end{align}
Show your work. This is the general formula for the line minimizing mean squared error for the data set \(\mathcal{D}\).
Solution
This is a little tedious, but here goes nothing! Start with
\begin{align} Ab + Bm &= D \\ Bb + Cm &= E\\ \end{align}
multiply the top equation by \(B\) and the bottom by \(A\):
\begin{align} ABb + B^2m &= BD \\ ABb + ACm &= AE\\ \end{align}
Now substract the second equation from the first to obtain:
\[(B^2 -AC)m = BD -AE \]
so that \(m = (BD-AE)/(B^2-AC)\). A similar process can be used to solve for \(b\), this time multiplying the first equation by \(C\) and the second by \(B\).
Find the line of best fit for the data set:
\[\mathcal{D}: \;\;\; (1,2), \;\;\; (3,3), \;\;\; \text{and} \;\;\; (4,5).\]
This is the example we worked out above, so you already know the answer! This time, directly use the general formula we just derived.
Solution
Since we have the formula for \(m\) and \(b\) from the last problem, we just need to find the values of \(A,B,C,D,\) and \(E\).
\begin{align} A &= 3, \\ B & =1+3+4=8,\\ C & = 1+9+16=26, \\ D & = 2+3+5 =10, \text{ and } \\ E & = 1 \cdot 2+ 3 \cdot 3 + 4\cdot 5 = 31\\\ \end{align}
So that \(b = \tfrac{BE-CD}{B^2-AC} = \tfrac{8\cdot 31-26\cdot 10}{8^2-3 \cdot 26} = \tfrac{6}{7}\) and \(m = \tfrac{BD-AE}{B^2-AC}= \tfrac{13}{14}\)
.