Language identification

Automated language identification #

written and developed by Thomas Pietraho

A common and fairly important machine learning task is to identify the language of a document or a snippet of text. There are many solutions to this problem, but the goal of this lab exercise is to introduce you to a particularly efficient one. Our algorithm has one important limitation: it will only work for languages that are based on the Latin script used when writing English.

The vector of digraphs #

A common approach in machine learning is to represent the objects of study as numerical vectors called feature vectors. So we are faced with the following question:

Question: Suppose you are given a sample of text written in a language that uses the Latin alphabet. Find a way of representing it as a numerical vector.

As machine learning has evolved, many ways to do this have been developed. We will study a particularly simple and efficient one. It begins with the following notion:

Definition: A digraph is a pair of consecutive letters that appear in a sample of text. Since the Latin alphabet has 26 letters, there are 676 possible digraphs. The digraph vector \(\vec{v}\) of a text sample counts how many times each digraph appears therein. It is a vector in \(\mathbb{R}^{676}\).

For example, consider the sentence I like homework. The first digraph is li, the second is ik, and the others are ke, ho, om, me, ew, wo, or, and rk. To construct the digraph vector, we must count the number of times each digraph appears in the text. Its first coordinate is the the occurrences of the digraph aa, the second of ab, and so on; the last coordinate counts occurrences of the digraph zz.

As a more interesting example, the following is the digraph vector of the text of the Declaration of Independence:

The digraph vector of the Declaration of Independence

Apparently the founding fathers of the United States mentioned neither aardvarks nor pizza in this document. For languages that use slight modifications of the Latin alphabet, say by incorporating letters such as Ç, É, or Ñ, the digraph vector is formed by using the closest Latin letter for each; for instance, N instead of Ñ

Comparing digraph vectors #

Now suppose that we computed the digraph vectors for two different documents, say \(\vec{v}\) and \(\vec{w}\). We would like to decide whether or not the two text samples were written in the same language. This raises the following question:

Question: How should we measure similarity between digraph vectors so that text samples written in the same language are close to each other, and text samples written in different languages are far from each other?

Again, there are numerous approaches to this question. We will study one method that is especially useful in a variety of other machine learning tasks.

Definition: The cosine similarity score of two feature vectors is the angle \(\theta\) between the two vectors. We can use the dot product to compute it.

Before working with real data, let us decide whether cosine similarity score has a chance of working in our task. Think about the following questions before proceeding.

Exercises:

What is the cosine similarity score for two identical copies of the same text sample? What if the second text sample is formed by cutting-and-pasting the first sample three times?
What would you expect the cosine similarity score to be if the two text samples were written in the same language? Why?
What would you expect the cosine similarity score to be if the two text samples were written in two very different languages? Why?

Based on your answers to the above questions, is it reasonable to expect that the cosine similarity score might work for the language recognition task?

Language recognition using digraph vectors #

We are ready for the main part of the lab. Our first goal is to get a feel for what kind of values one should expect as cosine similarity scores for text samples written in the same language, which angles occur for documents written in related languages, and what scores occur between unrelated languages. For our data set, we will use the texts of Wikipedia articles; they are available in a variety of languages.

You will use a Mathematica notebook. It automates the process of finding articles written in a variety of languages as well as the process of computing digraph vectors. You will have to use our work from class to figure out how to compute the cosine similarity score.

Complete the following outline as you work through the lab. Take notes on your results. You will need them to answer the homework question about the lab.

Lab outline:

Consider four text samples written in English. Compute their digraph vectors and find cosine similarity scores between all possible pairs. What was the average score? The largest? The smallest? Compute the angles in degrees. Report your results in this Google doc.
Do the same, but with four documents written in some other language. You can find the list of available languages on the following list of wikipedias but make sure that you use a Latin script-based language. Report your results.
Based on your work so far, what range of scores do you expect for text samples both of which are written in the same language?
Now consider text samples from languages closely related to English, perhaps French, Spanish, or German. What is the range the cosine similarity scores between these text samples and English text samples? Report your results.
Now consider text samples from languages that are not closely related to English. Some examples are Finnish, Hungarian, and Vietnamese, but there are many others. What is the range of cosine similarity scores between these text samples and those from English? Report your results.
What are your conclusions? Is the digraph vector effective at the language recognition task?

The final problem related to this lab asks you to extend your observations to formulate a more complete language recognition algorithm.

Homework exercise: Using observations from your work above, describe an algorithm that a computer could use to detect the language of a text sample. To be more precise:

assume that you are given a text document written in some language, your job is to identify it, and
you are allowed to pre-compute anything you would like from a large database of text samples written in known languages.

In your write-up, outline the algorithm you propose and using evidence from this lab explain why you expect that this algorithm should work. Two or three detailed paragraphs should be sufficient.

Two optional projects #

There are many extensions of this work. If you have time, choose one of the following two projects to complete the lab.

Project 1: A Better Language Recognition Algorithm #

Quite a lot of research is being done on fast automatic recognition of document language, and a number of the successful algorithms are variations of the procedure we looked at today. One fault of the digraph vector is that it is very long. Often, it is possible to improve the accuracy of the algorithm and improve its speed by looking only at certain important digraphs. Suppose we were interested in determining the language of a document, which we knew was written in one of two languages, say either English or Language X.

How would you choose the important digraphs? Remember, we would like to keep the angle small for documents of the same language, but we’d like a large angle between documents written in different languages. Hint: To maximize the angle, we’d like the dot product to be relatively small.
Pick a specific Language X. What digraphs would be useful in distinguishing it from English? Why?
When constrained to only these digraphs, is the algorithm improved? More specifically, what angles occur between the new digraph vectors for documents written in the same language? Between documents written in Language X and English?
How can you modify this idea to deal with more than just two languages?

Project 2: Document Characteristic Differentiation #

It is tempting to apply the digraph algorithm to differentiate between documents that are written a common language, but either address different subjects, or differ in some other fundamental way. For instance, can the digraph algorithm differentiate between articles written about sports and religion, physics and cartoons, or fiction and nonfiction?

Choose a number of articles that differ in some fundamental way and implement the digraph algorithm. Be creative in your choices.
Does the algorithm work in this case. Why do you think so, or why not? If it works, how could it be improved? If it does not work, can it be altered so that it does?