Why Correlation does not equal Causation in Statistical Analysis

Bishwajit Ghose


In the world of statistics, correlation does not equal causation. This phrase means that just because two things seem to occur together, it doesn’t mean that one causes the other. For example, you might notice that ice cream sales and shark attacks go up in the summertime; this correlation does not imply that eating ice cream causes people to be attacked by sharks, despite what the headlines might say! Although correlation does not imply causation, it can be used to investigate causation when other factors are controlled or accounted for.


It is relatively easy to find correlations

When two variables are related, we say they are correlated. This simply means that as one variable changes, the other variable also changes. For example, you might find that as the temperature outside increases, so does the number of ice cream sales. 

However, just because two variables are correlated does not mean that one causes the other (three sentences): Just because two things change together does not mean that one caused the other. We can see this by looking at which direction each variable is changing: If there is a negative correlation between something and something else, it means that when one goes up, the other goes down. If there is a positive correlation between something and something else, it means that when one goes up, the other goes up.


When you look at two variables, you are almost always going to find some sort of correlation between them

However, just because two variables are correlated does not mean that one causes the other. In order to establish causation, you would need to do a controlled experiment where you change one variable and see how it affects the other. Even then, there might be other confounding factors that you haven’t accounted for. So, while correlation can be a helpful tool in statistical analysis, it’s important to remember that it doesn’t necessarily equal causation.  

For example, if you look at students’ grades and their class attendance, it is very likely that they will show some sort of positive correlation—students who attend class regularly tend to have higher grades than those who skip class frequently.

Correlation does not equal causation – Third Paragraph: Is poor classroom attendance actually what’s causing lower grades? Maybe—but there could also be another factor at play. Say, for instance, that a student has dyslexia and struggles with reading comprehension. They may feel embarrassed about being different from their classmates or think that others won’t understand them. Consequently, this student may miss more classes than his or her peers as an attempt to avoid such things as reading aloud in front of the whole class or participating in group discussions where everyone else seems smarter than they are. The student may be better able to concentrate on the material when he or she is alone, so he or she skips class every day but stays home studying every night. The result is that the student misses out on valuable social interactions as well as teacher instruction and ends up falling behind academically due to these missed opportunities. Of course, there are many factors that contribute to a student’s performance in school; however, the point here is that something other than simple attendance may be driving low grades. If the only thing you consider when examining the relationship between grades and attendance is poor classroom attendance is statistically associated with lower grades without accounting for any other possible influences, then you’re committing an error known as ‘correlation fallacy’.


Beware that when you find correlative data it may be because one variable causes another (causal relationship) or simply because both correlate with something else (correlational relationship)

In other words, correlation tells us that two things are related, but it doesn’t tell us why they’re related. This is important to keep in mind because it’s easy to assume that if two things are correlated, then one must be causing the other. However, this is not always the case. Just because two things are related does not mean that one is causing the other. There may be another factor that is causing both of them. It could also be that there is a third thing (a common cause) that is affecting both variables. A good example of this would be cigarette smoking and lung cancer. If we plotted their data on a graph and found a strong correlation, we might conclude that smoking causes cancer. But when epidemiologists investigated further they found out there was an additional variable: heavy smokers have higher levels of radioactive substances in their lungs than light smokers do, which can lead to lung cancer over time (source).


Examples of how being aware of this distinction can affect your research

When researching a potential causal relationship between two variables, X and Y, be sure to investigate if there is a third variable, Z, that is influencing both X and Y. Just because two variables are correlated does not mean that one causes the other. A variety of factors could be at play. For example, let’s say you want to research the effect of studying on test scores. You might find that students who study more tend to get higher grades on their tests. But what if they also happen to have higher IQs? The correlation between studying and getting better grades would disappear when controlling for IQ, so it would be wrong to conclude that studying improves test scores. In this case, we have a spurious correlation: students with high IQs tend to do well in school because they work harder and choose subjects they like better; they don’t necessarily do well in school because they’re smarter. Similarly, just because people eat chocolate and weigh less doesn’t mean that eating chocolate makes them lose weight. It may be a result of exercise or any number of confounding factors. Finally, just because smoking correlates with lung cancer doesn’t mean that smoking causes lung cancer. There are many contributing factors to an individual developing lung cancer including age, genetics, environment, occupation and race. Smoking can lead to lung cancer because it exposes your lungs to carcinogens which increases your risk of developing lung cancer but it is not enough by itself. Cigarette smoke contains over 7,000 chemicals and 200 toxic gases. These include hydrogen cyanide, carbon monoxide, benzene, ammonia and formaldehyde – some of the most toxic substances known to man. Exposure to these toxins affects various parts of the body (lungs in particular) but again – smoking by itself cannot cause cancer!