Anscombe’s quartet #
Regression is a powerful tool. Heed the following warning:
In a seminal article, F.J. Anscombe developed four data sets as a caution for budding statisticians. They are summarized in the following four graphs, along with the line of best fit computed by minimizing the mean squared error made by its predictions, just as we did in the previous sections.
In each the line of best fit is exactly the same: \(y=f(x)=0.5x+3\) and has the same mean squared error. But each data set tells a different story:
-
The first data set suggests a, perhaps slightly noisy, linear relationship between \(x\) and \(y\).
-
The second data set suggests a strong relationship between \(x\) and \(y\), although it is clear that the function that best fits this data is not a straight line!
-
The third data set also suggests a strong linear relationship between \(x\) and \(y\) although it probably should be a different line. The regression is confused by one outlier in the data.
-
The last data set suggests no real relationship between \(x\) and \(y\), but the presence of one point far from the rest of the data confuses the issue.
This quartet illustrates the importance of visually inspecting a data set before analyzing it computationally. Anscombe writes: