Logistic regression

Logistic regression #

Background #

There is a simple way to adapt linear regression to a technique that can solve classification questions. In a single-class classification task, the data set \[ \mathcal{D} = \{(\vec{x}_i, y_i)\}_{i=1}^n \] still consists of feature vectors \(\vec{x}_i\) and outputs \(y_i\) but each output takes on either the value \(1\) or \(0\) indicating whether or not the data point belongs to a given class or not. For example, if the feature vector \(\vec{x}_i\) encodes an image of a hand-written digit, \(y_i\) may indicate whether or not that image represents the digit 7. We need to adapt two things.

The prediction function #

Predictions in a classification task should be real numbers in the range \([0,1]\); they can then be interpreted as a probabilities. Mathematically, it is easy to adapt linear regression to do this. In linear regression, the prediction is made by the formula \[\hat{y}_i = h_{\vec{w}}(\vec{x}_i)=\vec{x}_i^t \vec{w}\] where the weight vector \(\vec{w}\) is learned from the data. The value of \(\hat{y}_i\) can be any real number. But now suppose that we have a function \(\sigma : \mathbb{R} \rightarrow [0,1] \) lying around. We can then compose it with the output of usual regression to make our prediction. That is, \[ \hat{y}_i = h_{\vec{w}}(\vec{x}_i) = \sigma(\vec{x}_i^t \vec{w}).\]

Potentially, there is quite a bit of flexibility in choosing \(\sigma\) but by far the most common choice is the sigmoid function defined by \[ \sigma(x) = \frac{1}{1+e^{-x}}.\] It should be clear from the following graph that \(\sigma\) has the desired property, condensing any real-valued output into the interval \([0,1]\).

Exercise: Show that the derivative of \(\sigma\) has a particularly appealing form: \[ \sigma’(x) = \sigma(x)(1-\sigma(x)),\] when \(\sigma\) is the sigmoid function.

The loss function #

The central question of logistic regression is to find a weight vector \(\vec{w}\) that minimizes the average loss for the points in our data set; that is, minimizes the function

\[ \mathcal{L}(\vec{w}, \mathcal{D}) = \frac{1}{n} \sum_{i=1}^n \ell(h_{\vec{w}}(\vec{x}_i), y_i) \]

where \(\ell\) is a function that measures the similarity of the prediction \(h_{\vec{w}}(\vec{x}_i)\) and \(y_i\). We could use square loss as usual, but there is a more interesting choice. The formula itself comes from information theory, although there is quite a bit of background involved.

Definition: Let \(\hat{y}_i = h_{\vec{w}}(\vec{x}_i) \) be our prediction. The cross entropy loss is defined by \[ \ell(\hat{y}_i, y_i) = - y_i \log (\hat{y}_i) - (1-y_i) \log (1-\hat{y}_i)\]

Our aim is to make some sense of this loss function and have at least an intuitive explanation why it is appropriate to use in logistic regression.

Homework exercise:

There are two parts:

First, explain the formula for \(\ell(\hat{y}_i, y_i)\). Graph the output of this loss as a function of \(\hat{y}_i \in [0,1]\). Do this separately for \(y_i=0\) and \(y_i=1\). Focus on the value of the loss function if the prediction is inaccurate. How is it different than squared loss?
Second, follow the outline of this notebook to compare the shape of the square-loss and cross-entropy loss functions on a simple data set. If we are to use gradient descent when we seek to minimize loss, why would you expect cross-entropy loss to yield better results?