Perceptron

The perceptron #

Background #

When first constructed by F. Rosenblatt, the perceptron algorithm caused a wave of excitement, nicely summarized in a 1958 New York Times article Electronic Brain Teaches Itself. And since things that burn brightly tend to flame out quickly, the deficiencies of the perceptron were responsible for the first so-called AI winter. While perceptrons are not able to solve a large number of machine learning problems, they do give a really fast solutions for certain types of data.

There are two goals for this series of exercises:

provide an introduction to a simple classification algorithm, and
practice breaking down an algorithm, understanding what it seeks to do, and analyzing it mathematically.

Linearly separable data #

Perceptrons solve a particular type of classification problem. Here is the set up. Suppose we have two classes in a data set \[\mathcal{D} = \{(\vec{x}_i, y_i)\}_{i=1}^n. \] Each feature vector \(\vec{x}_i\) lies in \( \mathbb{R}^n\) and its class is given by \(y_i \in \{-1,+1\}\). A data set of this form is linearly-separable if there is a hyperplane \(\mathcal{H}\) passing through the origin that separates all the feature vectors in class \(+1\) from the feature vectors in class \(-1\). Here is an example in two dimensions:

Exercise: There are simple examples of data sets that are not linearly separable. Find four non-colinear points in \(\mathbb{R}^2\), two in each class, that are not linearly separable.

Let us call vectors on the same side of \(\mathcal{H}\) as the \(+1\) class positive and vectors on the other side negative. We will study the following problem:

Problem: Given a linearly-separable data set \(\mathcal{D}\) :

find a hyperplane \(\mathcal{H}\) that separates its two classes, and
find a quick method to determine whether a feature vector \(\vec{x}\) is positive or negative.

Hyperplanes #

The second part of the problem above turns out to be simpler to solve, so we do it first. First of all, the easiest way to describe a hyperplane \(\mathcal{H}\) through the origin in \(\mathbb{R}^n\) is by specifying a normal vector \(\vec{n}\), that is, a vector perpendicular to \(\mathcal{H}\). For a quick refresher on how this works, normal vectors are covered in a multivariate calculus lecture. Every multiple of a normal vector is another normal vector, so we can pick one that is positive. Here is the crucial observation:

Observation: Suppose that \(\mathcal{D}\) is linearly separated by the hyperplane \(\mathcal{H}\). If a normal vector \(\vec{n}\) to \(\mathcal{H}\) is chosen to be positive, then we have:

\(\vec{n}^t \vec{x}_i > 0 \) whenever \(y_i = +1\), and
\(\vec{n}^t \vec{x}_i < 0 \) whenever \(y_i = -1\). Thus the class of a point can be recovered simply by taking a dot product!

To see why this is true, recall that if \(\theta\) is the angle between \(\vec{v}\) and \(\vec{w}\), then \[ \vec{v}^t \vec{w} = \cos \theta ||\vec{v}|| ||\vec{w}||.\] The quantity \(\cos \theta \) is positive for acute angles and negative for obtuse ones and since \(\vec{n}\) was chosen to be positive, the angle between \(\vec{n}\) and a positive feature vector will be acute and obtuse otherwise.

So, once we have found the hyperplane \(\mathcal{H}\) and identified a positive normal vector \(\vec{n}\), then determining the class of a (perhaps yet unclassified) feature vector will be straightforward: we can simply examine the sign of the dot product of the feature vector with \(\vec{n}\).

The perceptron #

So now suppose we have a data set \(\mathcal{D}\) that we know is linearly separable, but do not know a specific hyperplane that separates the two classes. To determine \(\mathcal{H}\), we will have to learn the normal vector \(\vec{n}\) from the data set \(\mathcal{D}\). From the discussion above, we need a vector \(\vec{n}\) so that \[\text{sign}(\vec{n}^t \vec{x}_i) = y_i \] for all points in \(\mathcal{D}\). Put another way, we always want \(y_i(\vec{n}^t \vec{x}_i) > 0\). The perceptron algorithm is iterative:

The perceptron algorithm

Start with \(\vec{n}_0 = \vec{0}\)
Compute \(y_i(\vec{n}_t^t \vec{x}_i)\) for all points in \(\mathcal{D}\).
Whenever \(y_i(\vec{n}_t^t \vec{x}_i) \leq 0\), let \[\vec{n}_{t+1} = \vec{n}_t + y_i\vec{x_i} \]
Stop if \(y_i(\vec{n}_t^t \vec{x}_i) > 0\) for all points in \(\mathcal{D}\). Otherwise iterate through the data set again, updating \(\vec{n}_t\) as necessary.

Convergence #

The perceptron algorithm was the perhaps the first machine learning algorithm with a formal convergence guarantee; if the underlying data set is linearly-separable, the above algorithm will find a separating hyperplane in a finite number of steps. If not, it will loop forever. The goal of this exercise is to prove this formal guarantee.

Homework exercise: Complete the following four-step outline to show that the perceptron algorithm terminates on a linearly-separable data set.

Step 1. Margin of separation #

First let us define some notation. Suppose that \(\mathcal{H}\) is a separating hyperplane for \(\mathcal{D}\) and \(\vec{n}\) is a positive unit vector for \(\mathcal{H}\). We do not know what either one is, only that they exist. The margin \(\gamma\) for \(\mathcal{H}\) is the minimum distance from \(\mathcal{H}\) to any point in \(\mathcal{D}\), as in the figure below:

Show that \(\gamma = \min_i |\vec{n}^t \vec{x}_i | \).

Step 2: Minimum rate of increase of \(||\vec{n}_t||\) #

Show that \(||\vec{n}_{t}|| \geq t\gamma\); that is, \(||\vec{n}_{t}||\) increases by at least \(\gamma\) at each update.

First show that \(\vec{n}_{t+1}^t \vec{n} \geq \vec{n}_{t}^t \vec{n} + \gamma \), and
then use induction and the Cauchy-Schwarz inequality.

Step 3: Maximum rate of increase of \(||\vec{n}_t||\) #

Our data set is finite and consequently bounded. So assume each feature vector lies in a ball of radius \(R\) around the origin, or in other words, \(||\vec{x}_i|| < R\) for all \(i\).

Show that \(||\vec{n}_{t}||^2 \leq t R^2 \) by showing that \(||\vec{n}_t||^2\) increases by at most \(R^2\) at each update.

Step 4: Maximum number of updates #

Using the inequalities you derived above, show that \[t \leq \frac{R^2}{\gamma^2}.\] Explain why this implies the perceptron algorithm must terminate, or speaking informally, converge. What influence does the size of \(\gamma\) have on the rate of convergence? Does the latter make intuitive sense?

Adding features #

Sometimes a set that is not linearly separable can be make so by adding additional features to the feature vector.

Homework exercise:

The following set of points \(\mathcal{D} = \{(\vec{x}_i, y_i)\}\) is not linearly separable in \(\mathbb{R}^2\). In fact, feature vectors in class \(+1\) all lie below the parabola \(y = (x-1)^2-1\) and feature vectors in class \(-1\) all lie above it.

Show that if each vector \(\vec{x}_i = (x_{1i}, x_{2i})\) is replaced by \(\vec{z}_i = (x_{1i}, x_{2i}, x_{1i}^2)\) where a third coordinate is added, then the new data set \(\mathcal{D}’ = \{(\vec{z}_i, y_i)\}\) is linearly separable in \(\mathbb{R}^3\).

Hint: Find the equation of the plane \(\mathcal{H}\).

Scholium: Often, a data set that is not linearly separable can be made so by adding features to its feature vectors. Thus data sets to which the perceptron does not apply can often be modified to permit its use. As an example, the MNIST data set of handwritten digits is (close to) linearly separable if features that include fourth-order polynomials are added.

Why the fuss? #

If you think about it long enough, all the perceptron does it take a dot product. But it can be cleverly marketed as artificial intelligence by the following picture:

In this form, perceptrons were rudimentary precursors to today’s artificial neurons.