In her final lecture, my statistics professor described the “7 deadly sins” of statistics in cartoon form. Enjoy

## 1. Correlation ≠ Causation

xkcd: Correlation

Dilbert: Correlation

## 2. Displaying Data Badly

xkcd: Convincing

#### Further reading on displaying data badly

The American Statistician: How to Display Data Badly by Howard Wainer

Johns Hopkins Bloomberg School of Public Health: How to Display Data Badly by Karl Broman

## 3. Failing to Assess Model Assumptions

DavidMLane.com: Statistics Cartoons by Ben Shabad

## 4. Over-Reliance on Hypothesis Testing

xkcd: Null Hypothesis

While we’re on the topic of hypothesis testing, don’t forget…

#### We can *fail to reject* the null hypothesis.

#### But we never *accept* the null hypothesis.

## 5. Drawing Inference from Biased Samples

Dilbert: Inferences

## 6. Data Dredging

If you try hard enough, eventually you can build a model that fits your data set.

Steve Moore: Got one

The key is to test the model on a new set of data, called a validation set. This can be done by splitting your data before building the model. Build the model using 80% of your original data, called a training set. Validate the model on the last 20% that you set aside at the beginning. Compare how the model performs on each of the two sets.

For example, let’s say you built a regression model on your training set (80% of the original data). Maybe it produces an R-squared value of 0.50, suggesting that your model predicts 50% of the variation observed in the training set. In other words, the R-squared value is a way to assess how “good” the model is at describing the data, and at 50% it’s not that great.

Then, lets say you try the model on the validation set (20% of the original data), and it produces an R-squared value of 0.25, suggesting your model predicts 25% of the variation observed in the validation set. The predictive ability of the model seems to depend on which data set is used; on the training set (R-squared 50%) it is better than on the validation set (R-squared 25%). This is called **overfitting** of the model to the training set. It gives off the impression that the model is more accurate than it really is. The true ability of the model can only be assessed once it has been validated on new data.

## 7. Extrapolating Beyond Range of Data

xkcd: Extrapolating

## Similar Ideas Elsewhere

Columbia: “Lies, damned lies, and statistics”: the seven deadly sins

Child Neuropsychology: Statistical practices: the seven deadly sins

Annals of Plastic Surgery: The seven deadly sins of statistical analysis

#### Sources

xkcd: Correlation

Dilbert: Correlation

xkcd: Convincing

The American Statistician: How to Display Data Badly by Howard Wainer

Johns Hopkins Bloomberg School of Public Health: How to Display Data Badly by Karl Broman

DavidMLane.com: Statistics Cartoons by Ben Shabad

xkcd: Null Hypothesis

Dilbert: Inferences

Steve Moore: Got one

Wiki: Overfitting

xkcd: Extrapolating

Columbia: “Lies, damned lies, and statistics”: the seven deadly sins

Child Neuropsychology: Statistical practices: the seven deadly sins

Annals of Plastic Surgery: The seven deadly sins of statistical analysis

## Leave a Reply