Simpson’s Paradox

From Wikipedia

“Simpson’s paradox, or the Yule–Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title reversal paradox or amalgamation paradox.”

This seems counterintuitive, but the 5 minute video below explains the concept well.


Source

Wikipedia: Simpson’s paradox

Minute Physics: Simpson’s Paradox

Best Data Science Courses Online

The Best Free Data Science Courses on the Internet

Data science is blossoming as a field at the moment. Popular jargon from traditional statistics to new machine learning techniques are used colloquially in both online articles and day-to-day exchanges. One of the excellent things about data science, noted by David Venturi, is that by nature the field is computer-based. Why not learn about it all for free online then? Venturi has written several articles enumerating lists of massive open online courses (MOOC) relevant to someone interested in only a single highly-ranked data science class, or a complete masters degree in data science for the more dedicated individual. One of the benefits of these courses is they are more poignant and focus on only the knowledge relevant to applying data science skills. Another perk is the nonexistent price tag, as opposed to the tens or hundreds of thousands of dollars of student loans one could thrust themselves into while pursuing a data science masters at a formal institution. Venturi explains why he left grad school to learn about data science before finishing his first semester. If nothing else, some of these courses may be useful to supplement a graduate school education.


Sources

FreeCodeCamp.org: David Venturi

FreeCodeCamp.org: The best Data Science courses on the internet, ranked by your reviews

FreeCodeCamp.org: If you want to learn Data Science, take a few of these statistics classes

Medium.com: I Dropped Out of School to Create My Own Data Science Master’s — Here’s My Curriculum

The 7 Deadly Sins of Data Analysis

In her final lecture, my statistics professor described the “7 deadly sins” of statistics in cartoon form. Enjoy


1. Correlation ≠ Causation

Correlation

xkcd: Correlation

CausCorr2_Optimized.jpg

Dilbert: Correlation


2. Displaying Data Badly

Convincing

xkcd: Convincing

Further reading on displaying data badly

The American Statistician: How to Display Data Badly by Howard Wainer

Johns Hopkins Bloomberg School of Public Health: How to Display Data Badly by Karl Broman


3. Failing to Assess Model Assumptions

FailingModelAssumptions.png

DavidMLane.com: Statistics Cartoons by Ben Shabad


4. Over-Reliance on Hypothesis Testing

Null Hypothesis

xkcd: Null Hypothesis

While we’re on the topic of hypothesis testing, don’t forget…

We can fail to reject the null hypothesis.

But we never accept the null hypothesis.


5. Drawing Inference from Biased Samples

DilbertInferences.gif

Dilbert: Inferences


6. Data Dredging

If you try hard enough, eventually you can build a model that fits your data set.

DataDredging_Optimized.jpg

Steve Moore: Got one

The key is to test the model on a new set of data, called a validation set. This can be done by splitting your data before building the model. Build the model using 80% of your original data, called a training set. Validate the model on the last 20% that you set aside at the beginning. Compare how the model performs on each of the two sets.

For example, let’s say you built a regression model on your training set (80% of the original data). Maybe it produces an R-squared value of 0.50, suggesting that your model predicts 50% of the variation observed in the training set. In other words, the R-squared value is a way to assess how “good” the model is at describing the data, and at 50% it’s not that great.

Then, lets say you try the model on the validation set (20% of the original data), and it produces an R-squared value of 0.25, suggesting your model predicts 25% of the variation observed in the validation set. The predictive ability of the model seems to depend on which data set is used; on the training set (R-squared 50%) it is better than on the validation set (R-squared 25%). This is called overfitting of the model to the training set. It gives off the impression that the model is more accurate than it really is. The true ability of the model can only be assessed once it has been validated on new data.


7. Extrapolating Beyond Range of Data

Extrapolating

xkcd: Extrapolating


Similar Ideas Elsewhere

Columbia: “Lies, damned lies, and statistics”: the seven deadly sins

Child Neuropsychology: Statistical practices: the seven deadly sins

Annals of Plastic Surgery: The seven deadly sins of statistical analysis

Statistics done wrong


Sources

xkcd: Correlation

Dilbert: Correlation

xkcd: Convincing

The American Statistician: How to Display Data Badly by Howard Wainer

Johns Hopkins Bloomberg School of Public Health: How to Display Data Badly by Karl Broman

DavidMLane.com: Statistics Cartoons by Ben Shabad

xkcd: Null Hypothesis

Dilbert: Inferences

Steve Moore: Got one

Wiki: Overfitting

xkcd: Extrapolating

Columbia: “Lies, damned lies, and statistics”: the seven deadly sins

Child Neuropsychology: Statistical practices: the seven deadly sins

Annals of Plastic Surgery: The seven deadly sins of statistical analysis

Statistics done wrong

ANOVA: Analysis of Variance

Conceptual Introduction to ANOVA

Brandon Foltz has an excellent YouTube course on introductory statistics. He explains concepts first for motivation to learning the techniques in subsequent, detailed videos. After learning about ANOVA in general, he provides videos on how to understand them mathematically:

One-way ANOVA with an example in Excel

Two-way ANOVA without replication with an example in Excel

Brandon has other videos in this course on ANVOA, but these are the basic concepts.


 

One-Way ANOVA Table Quick Math

A clear concise explanation of where the numbers come from in a one-way ANOVA table.


 

Interactive ANOVA

SeeingTheoryANOVA.jpg

Finally, this website made by students at Brown University allows for tinkering of a graph that is designed to illustrate ANOVA concepts. It has other interactive graphs for other statistical concepts as well.


 

Sources

Statistics 101: ANOVA, A Visual Introduction by Brandon Foltz

Statistics 101: One-way ANOVA, A Visual Tutorial by Brandon Foltz

Statistics 101: One-way ANOVA, Understanding the Calculation by Brandon Foltz

Statistics 101: Two-way ANOVA w/o Replication, A Visual Guide by Brandon Foltz

Statistics 101: Two-way ANOVA w/o Replication, The Calculation by Brandon Foltz

ArmstrongPSYC2190: One Way ANOVA

Seeing Theory: Analysis of Variance

Fixed Effects vs Random Effects Models

What is a fixed effects model? What is a random effects model? What is the difference between them? Many people around me have been using these terms over and over in the past few weeks. I finally compiled several 5-10 min videos of people answering these questions well online.

IndianJDermatol_2014_59_2_134_127671_f3_Vertical

If I had to answer the question of what fixed and random effects models are in one image, I would choose this one from the Indian Journal of Dermatology. Watch the videos and come back to this image for a quick reminder of these concepts.


Motivating Example: Meta-Analysis of Bieber Fever

This silly example is a simplistic demonstration of when fixed and random effects models should be used in designing a meta-analysis. This video is for the medical student.


Summary of Fixed and Random Effects Models

This summary video is a bit more technical and is aimed at a student of epidemiology or biostatistics.


What is Heterogeneity?

The concept of heterogeneity kept coming up in these videos. How is it different from random chance? This is a clear explanation of the difference that defines concepts alluded to in the previous videos.


Sources

Indian Journal of Dermatology: Understanding and evaluating systematic reviews and meta-analyses

Brian Cohn: Fixed and Random Effects Models and Bieber Fever

Terry Shaneyfelt: Fixed Effects and Random Effects Models

Terry Shaneyfelt: What is Heterogeneity?

Powered by WordPress.com.

Up ↑