UMAP | Mathematical principles of machine learning

An improvement to an improved UMAP #

Background #

Perhaps the most widespread modern dimension reduction technique is the Uniform Manifold Approximation and Projection, or simply UMAP, algorithm. I will not do a better job explaining how it works than the original authors, but as you can see from its effect on the MNIST dataset:

MNIST digit set reduced to two dimensions
image from our lab

it seems to work pretty well. If you want to interact with the above image, try the UMAP explorer. But the UMAP algorithm is not the end of the road, and a couple of authors have suggested ways to improve it. This project suggests a way of improving an improved version of UMAP.

The IsUMAP algorithm #

In this talk, Jürgen Jost describes a simple way to augment the original UMAP algorithm. Although his work is presented in terms of category theory and Riemannian geometry, the essential idea is just a tweak of part of the original UMAP algorithm. Let me describe the tweak.

Suppose \(\{x_i\}_{i=1}^m \subset \mathbb{R}^n\) represents our data set. Choose an integer \(k\). For a fixed point \(x\), one part of the UMAP algorithm modifies the usual Euclidian metric \(\rho\), but only for \(x\) and its \(k\)-nearest neighbors. We will write \(x_j\) for the j-th nearest neighbor of \(x\). To define this new metric \(\rho_x\), first let \(\rho_x(y,y)=0 \) for all \(y\), and let the distance between \(x\) and any of its \(k\)-nearest neighbors equal \[\rho_x(x,x_j) = \frac{\rho(x,x_j)-\tau_x}{\sigma_x}\] for some values of \(\tau_x\) and \(\sigma_x\). We still need to define the distance between any two neighbors of \(x\), and this is where there is some discretion.

UMAP simply defines \(\rho_x(x_j,x_l) = \infty\). This forces \(\rho_x\) to violate the triangle inequality, and while this may offend you as a mathematician, since the UMAP algorithm works fairly well perhaps we should not judge.
IsUMAP on the other hand defines \(\rho_x(x_j,x_l) = \rho_x(x,x_j) + \rho_x(x,x_l) \), which simply imposes the British Rail metric on \(x\) and its neighbors. Geometrically, to visualize this metric, you can pretend that all of the \(x_i\) lie directly opposite from each other with respect to \(x\).

According to the authors, this modification excels for manifolds with lower dimension because it distorts distances less than UMAP, while being able to uniformize the data distribution better than Isomap. However, the decreased amount of distortion also results in lower clustering capability for high-dimensional datasets. This is easily seen from their examples:

IsMAP on a variety of data sets
image from IsUMAP article

If the reason you are doing dimension reduction is to be able visualize the underlying data manifold, IsUMAP’s lack of distortion seems to lead to more faithful representations.

Improving the IsUMAP algorithm #

There is a simple way to modify IsUMAP that is more faithful to the geometry of data in high dimensions. There are two useful observations:

If \(k\) is relatively small, the \(k\)-nearest neighbors of a fixed point \(x\) in \(\mathbb{R}^n\) when \(n»0\) all lie very close to a sphere of some radius \(\sigma_x\). That is, they are essentially all equidistant from \(x\).
The vectors between \(x\) and each of the \(x_i\) all tend to be very close to being perpendicular to each other.

While both results seem strange to three-dimensional beings like us, they are essential properties of high dimensional space when the data points have been sampled somewhat randomly. They can be more formally proven by integration by computing the volume and surface areas high-dimensional spheres.

So let’s look at how we could modify \(\rho_x(x_j,x_l) \) with this geometry in mind. If the vectors \(x - x_j\) and \(x-x_l\) are nearly perpendicular and nearly of the same length, what should we expect the distance between \(x\) and any of the \(x_i\) to be? There are two natural ways to modify \(\rho_x\):

\(\rho_x(x_j,x_l) = \sqrt{2} \cdot \rho_x(x,x_j)\), or
\(\rho_x(x_j,x_l) = \frac{\sqrt{2}}{2} \rho_x(x,x_j) + \frac{\sqrt{2}}{2} \rho_x(x,x_l) \)

The code for the IsUMAP project is available. There is potential for this to work better.