Multivariable calculus: Anscombe's quartet

Anscombe’s quartet #

Regression is a powerful tool. Heed the following warning:

Caution: Just because we now know how to find the line of best fit to any data set, it does not mean we always should.

In a seminal article, F.J. Anscombe developed four data sets as a caution for budding statisticians. They are summarized in the following four graphs, along with the line of best fit computed by minimizing the mean squared error made by its predictions, just as we did in the previous sections.

   
   
Four synthetic data sets dubbed the Anscombe's quartet.

In each the line of best fit is exactly the same: \(y=f(x)=0.5x+3\) and has the same mean squared error. But each data set tells a different story:

  1. The first data set suggests a, perhaps slightly noisy, linear relationship between \(x\) and \(y\).

  2. The second data set suggests a strong relationship between \(x\) and \(y\), although it is clear that the function that best fits this data is not a straight line!

  3. The third data set also suggests a strong linear relationship between \(x\) and \(y\) although it probably should be a different line. The regression is confused by one outlier in the data.

  4. The last data set suggests no real relationship between \(x\) and \(y\), but the presence of one point far from the rest of the data confuses the issue.

This quartet illustrates the importance of visually inspecting a data set before analyzing it computationally. Anscombe writes:

Most kinds of statistical calculation rest of assumptions about the behavior of the data. Those assumptions may be false, an then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct and if they are wrong we ought to be able to perceive in what ways they are wrong. Graphs are very valuable for these purposes.