Machine learning: Transformers

Transformers and natural language processing #

The transformer is a neural network architecture whose success has enabled projects such as ChatGPT.

Transformers
image from sketchy Ebay site

The goal of this project is to formally understand the transformer encoder as a mathematical construct and use it to study the notion of a contextual embedding for English words.

Background #

The full transformer consists of an encoder-decoder pair of neural networks. The role of the encoder is to take a sequence of words and encode them as a sequence of vectors. The decoder takes this encoding and uses it to suggest additional text. This project will be focused on the encoder part of the architecture.

The original transformer architecture
from 'Attention is all you need'

There are multiple details each with its own bit of beautiful mathematics that has to be understood. As you refine the details of your project, feel free to focus on one, some, or all of them. I have suggested a particular project at the end, but there are many others that are possible.

  • Begin by learning about tokenizers. The basic objective is to split the input text into tokens. For instance, here are some random English sentences split into tokens using the OpenAI tokenizer:

    Tokenizer output
    from OpenAI
    Hugging Face has a very readable and watchable introduction to the different ways one can tokenize text. How to best tokenize is itself an interesting and self contained topic if the rest of the project looks scary. But if you are not intimidated, read on.

  • Once tokenized, input text is one-hot-encoded with the dimension of the encoding vector equal to the number of tokens. That is, if the tokenizer uses \(k\) tokens, each token becomes a vector of 0s and 1s in \(\mathbb{R}^k\). Most tokenizers use a few hundred or thousand tokens. ChatGPT uses 32000, so most words tokenized by ChatGPT do not need to be split up and are represented their own tokens. And now the fun begins.

  • The input text then passes through a series of so-called self-attention and embedding layers. The ultimate goal is to produce a vector in \(\mathbb{R}^e\) for each token that captures its meaning and relationship with other tokens in the input text.

  • The key is the notion of self-attention. My favorite article that breaks down what is going on is by The Illustrated Transformer. You can also take a look at the original paper Attention is all you need but it is a little terse. At the moment, it has been cited more than 70,000 times. As you read either one, focus on the story of what is going on, as well as how it works mathematically. Could you write the attention layer as a sequence of matrix operations?

  • After the attention layer, there is a dense neural network at the end of which each token is the text is embedded as a point in \(\mathbb{R}^e\). And then the process repeats some number of times to refine this embedding. The number is often proprietary; for instance, OpenAI has not released the details about the current ChatGPT encoding process.

You are now ready for me to state the main point of this project. The central question is the following. To what extent do word and token embeddings capture meaning? For instance, then the word bank is used in a sentence to refer to a financial institution, is the embedding substantially different than when the word is used to refer to a river bank, for instance. To some extent this problem was been looked at already, for instance, in this tutorial.