Clustering

Clustering and classification #

One of the main problems in data science and machine learning takes the following simple form:

Problems: Divide a population of individuals or data points into groups.

What we mean by the words individuals and groups requires more explanation. Let’s do this by way of some examples of problems solved by machine learning:

Marketing: Individuals represent Netflix subscribers and the specific problem seeks to group them by the types of movies they like to watch. Doing this well allows Netflix to make good movies recommendations.

Netflix taste clusters
from netflixblog.com

Medicine: Individuals represent tissue samples and the problem is to determine whether they are cancerous or benign.

Language: Individuals represent text samples. One problem is to identify the language used to write each one. Another is to identify the topic of each document.

Languages clustered using their digraph vectors

Computer vision: Individuals represent images and the problem is to identify the class of the object each image represents.

When the groups fall into classes with known labels, such as the language of a text sample, this type of problem is known as classification. If the groups are not known ahead of time, like movie preference groups, the corresponding type of problem is known as clustering.

Below we describe one general approach to solving classification and clustering problems.

Step 1: Constructing feature vectors #

As we saw in the page about feature vectors, it is often possible to identify an individual with an vector that serves as a summary of relevant numerical statistics. Here are some more examples:

Athletes: when the individuals are athletes, a relevant feature vector may summarize their performance statistics,
Politicians: when the individuals are politicians, their voting record can form the basis of a feature vector,
Languages: when the individuals are samples of text, their digraph frequency vectors can form a feature vector, and
Images: when the individuals are images, their pixels values can be concatenated into a feature vector.

Example: Often, feature vectors consists of thousands of values. But sometimes small feature vectors can be quite simple and effective. For instance, if we are interested in the position played by an football player from the National Football League, it is generally enough to know whether they play offense or defense and their height and weights.

Height and weight as features that determine football positions
from craigbooth.com

There two approaches to constructing feature vectors: traditionally, feature vectors were hand crafted by including only those statistics that seemed relevant to the problem at hand. More recently, techniques like neural networks have been used to construct feature vectors automatically from vast data sets. Each approach has its flaws and advantages.

Warning: Constructing feature vectors can be nuanced and delicate. For instance, should race be used as one of the features? Is there a difference if the goal is to determine an appropriate medical treatment versus trying to estimate the probability of recidivism in a parole hearing? This is an active area of research in machine learning and public policy.

Step 2: Measure similarity between feature vectors. #

To answer the original problem, we could hope to say that two individuals belong to the same group if their feature vectors are sufficiently similar. But this creates another problem:

Problem: How should we measure the similarity of two feature vectors? And when are two vectors sufficiently similar to belong to the same class or cluster?

There are no simple answers and machine learning offers many approaches. Here are a few:

Cosine similarity: in certain contexts, measuring the angle between two vectors is a useful measure of similarity. We will use this when we work though the language recognition lab.
Euclidian distance: in certain contexts, measuring the distance between two vectors is appropriate.

The last step once a measure of similarity is established is to actually form the clusters or groups. We will not adress this problem here, yet…