Datasets #
Below are four standard data sets used in machine learning. In each, our goal is to represent the data about which we are asked to make a prediction as a numerical vector. In some instances, this is simpler than in others.
Boston housing prices #
Housing prices in the Boston area are easy to predict. Today, they are essentially infinite. But it wasn’t always like this. The house price data set from the 1970s collected in Boston and its suburbs. The data set covers roughly 500 neighborhoods; we are given twelve features for each:
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over
25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds
river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centres
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. LSTAT % lower status of the population
Problem: Given this collection of twelve statistics, estimate the median house price in each of Boston’s neighborhoods.
In this example, we can easily create a numerical vector to summarize each neighborhood. Simply use all the numbers for a neighborhood as features. Here is the feature vector for neighborhood “zero”:
\[(0.00632, 18.0 , 2.31 , 0 ,0.538 , 6.575, 65.2 ,4.0900 , 1 , 296.0, 15.3, 4.98)\]
The median value of homes in this neighborhood at the time, that is, the number we are trying to estimate, was \($24,000\).
Iris flower data set #
Ronald Fisher almost single-handedly created modern statistics. His Iris flower data set quantifies the morphologic variation of three different species of irises.
The data set contains four measurements from fifty individual flowers:
1. SEPAL LENGTH
2. SEPAL WIDTH
3. PETAL LENGTH
4. PETAL WIDTH
Problem: Given these four measurements for an individual flower, identify its species.
Again, we can easily create a numerical vector to summarize each individual. Simply use all the measurements as features. Here are the measurements for the first flower in the data set:
\[(5.1, 3.5, 1.4, 0.2)\]
It turns out to be from an Iris setosa.
Language recognition #
There are a couple of common data sets used when studying language, but perhaps the most popular approach is to use Wikipedia articles.
Problem: Given a text sample, decide the language in which it is written.
Here it is a little harder to condense the richness of a text sample into a numerical vector. But the following is fairly effective:
For example, consider the sentence I like homework. The first digraph is li, the second is ik, and the others are ke, ho, om, me, ew, wo, or, and rk. To construct the digraph vector, we must count the number of times each digraph appears in the text. Its first coordinate is the the occurrences of the digraph aa, the second of ab, and so on; the last coordinate counts occurrences of the digraph zz.
As a more interesting example, the following is the digraph vector of the text of the Declaration of Independence:
The identification algorithm should identify the language of this text as English.
Image classification #
One of the oldest and most-studied problems in computer vision is the automated reading of human handwriting, and in particular, automated recognition of hand-written digits. The first implementations of machine ZIP code reading systems occurred in the 1980s, but the results were generally poor and suffered from high error rates.
Problem: Given the image of a hand written digit, identify it.
How does one form a feature vector for each image? The images are are in fact 28-by-28 arrays of pixels. Each pixel represents a shade of gray which we think of as a number between 1 (black) and 0 (white). The 28-by-28 array can be written as one 784-dimensional vector. Here is one:
The task of machine learning is to identify this as representing the digit 7.