Machine learning: Logistic regression

Logistic regression #

Background #

There is a simple way to adapt linear regression to a technique that can solve classification questions. In a single-class classification task, the data set \[ \mathcal{D} = \{(\vec{x}_i, y_i)\}_{i=1}^n \] still consists of feature vectors \(\vec{x}_i\) and outputs \(y_i\) but each output takes on either the value \(1\) or \(0\) indicating whether or not the data point belongs to a given class or not. For example, if the feature vector \(\vec{x}_i\) encodes an image of a hand-written digit, \(y_i\) may indicate whether or not that image represents the digit 7. We need to adapt two things.

The prediction function #

Predictions in a classification task should be real numbers in the range \([0,1]\); they can then be interpreted as a probabilities. Mathematically, it is easy to adapt linear regression to do this. In linear regression, the prediction is made by the formula \[\hat{y}_i = h_{\vec{w}}(\vec{x}_i)=\vec{x}_i^t \vec{w}\] where the weight vector \(\vec{w}\) is learned from the data. The value of \(\hat{y}_i\) can be any real number. But now suppose that we have a function \(\sigma : \mathbb{R} \rightarrow [0,1] \) lying around. We can then compose it with the output of usual regression to make our prediction. That is, \[ \hat{y}_i = h_{\vec{w}}(\vec{x}_i) = \sigma(\vec{x}_i^t \vec{w}).\]

Potentially, there is quite a bit of flexibility in choosing \(\sigma\) but by far the most common choice is the sigmoid function defined by \[ \sigma(x) = \frac{1}{1+e^{-x}}.\] It should be clear from the following graph that \(\sigma\) has the desired property, condensing any real-valued output into the interval \([0,1]\).

The sigmoid function.

Exercise: Show that the derivative of \(\sigma\) has a particularly appealing form: \[ \sigma’(x) = \sigma(x)(1-\sigma(x)),\] when \(\sigma\) is the sigmoid function.

The loss function #

The central question of logistic regression is to find a weight vector \(\vec{w}\) that minimizes the average loss for the points in our data set; that is, minimizes the function

\[ \mathcal{L}(\vec{w}, \mathcal{D}) = \frac{1}{n} \sum_{i=1}^n \ell(h_{\vec{w}}(\vec{x}_i), y_i) \]

where \(\ell\) is a function that measures the similarity of the prediction \(h_{\vec{w}}(\vec{x}_i)\) and \(y_i\). We could use square loss as usual, but there is a more interesting choice. The formula itself comes from information theory, although there is quite a bit of background involved.

Definition: Let \(\hat{y}_i = h_{\vec{w}}(\vec{x}_i) \) be our prediction. The cross entropy loss is defined by \[ \ell(\hat{y}_i, y_i) = - y_i \log (\hat{y}_i) - (1-y_i) \log (1-\hat{y}_i)\]

Our aim is to make some sense of this loss function and have at least an intuitive explanation why it is appropriate to use in logistic regression.

Homework exercise:

There are two parts:

  • First, explain the formula for \(\ell(\hat{y}_i, y_i)\). Graph the output of this loss as a function of \(\hat{y}_i \in [0,1]\). Do this separately for \(y_i=0\) and \(y_i=1\). Focus on the value of the loss function if the prediction is inaccurate. How is it different than squared loss?
  • Second, follow the outline of this notebook to compare the shape of the square-loss and cross-entropy loss functions on a simple data set. If we are to use gradient descent when we seek to minimize loss, why would you expect cross-entropy loss to yield better results?