The Dangers of Deceptive Data Part 2–Base Proportions and Bad Statistics

An accessible dive into correlation, base proportions, summary statistics, and uncertainty. The post The Dangers of Deceptive Data Part 2–Base Proportions and Bad Statistics appeared first on Towards Data Science.

May 9, 2025 - 05:29
 0
The Dangers of Deceptive Data Part 2–Base Proportions and Bad Statistics

This is a follow-up to my earlier article: The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines. My first article focused on how visualizations can be used to mislead, diving into a form of data presentation widely used in public matters.

In this article, I go a bit deeper, looking at how a misunderstanding of statistical ideas is breeding ground for being deceived by data. Specifically, I’ll walk through how correlation, base proportions, summary statistics, and misinterpretation of uncertainty can lead people astray.

Let’s get right into it.

Correlation ≠ Causation

Let’s start with a classic to get in the right frame of mind for some more complex ideas. From the earliest statistics classes in grade school, we are all told that correlation is not equal to causation.

If you do a bit of Googling or reading, you can find “statistics” that show a high correlation between cigarette consumption and average life expectancy [1]. Interesting. Well, does that mean we should all start smoking to live longer?

Of course not. We’re missing a confounding factor: buying cigarettes requires money, and countries with higher wealth understandably have higher life expectancies. There is no causal link between cigarettes and age. I like this example because it is so blatantly misleading and highlights the point well. In general, it’s important to be wary of any data that only shows a correlational link.

From a scientific standpoint, a correlation can be identified via observation, but the only way to claim causation is to actually conduct a randomized trial controlling for potential confounding factors—a fairly involved process.

I chose to start here because while being introductory, this concept also highlights a key idea that underpins understanding data effectively: The data only shows what it shows, and nothing else.

Keep that in mind as we move forward.

Remember Base Proportions

In 1978, Dr. Stephen Casscells and his team famously asked a group of 60 physicians, residents, and students at Harvard Medical School the following questions:

“If a test to detect a disease whose prevalence is 1 in 1,000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease, assuming you know nothing about the person’s symptoms or signs?”

Though presented in medical terms, this question is really about statistics. Accordingly, it also has connections to data science. Take a second to think about your own answer to this question before reading further.

Photo by Getty Images on Unsplash

The answer is (approximately) 2%. Now, if you looked through this quickly (and aren’t up to speed with your statistics), you may have guessed significantly higher.

This was certainly the case with the medical school folks. Only 11/60 people correctly answered the question, with 27/60 going as high as 95% in their response (presumably just subtracting the false positive rate from 100).

It is easy to assume that the actual value should be high due to the positive rest result, but this assumption contains a crucial reasoning error: It fails to account for the extremely low prevalence of the disease in the population.

Said another way, if only 1 in every 1,000 people has the disease, this needs to be taken into account when calculating the probability of a random person having the disease. The probability does not rely only on the positive test result. As soon as the test accuracy falls below 100%, the influence of the base rate comes into play quite significantly.

Formally, this reasoning error is known as the base rate fallacy.

To see this more clearly, imagine that only 1 in every 1,000,000 people had the disease, but the test still has a false positive rate of 5%. Would you still assume that a positive test result immediately indicates a 95% chance of having the disease? What if it was 1 in a billion?

Base rates are extremely important. Remember that.

Statistical Measures Are NOT Equivalent to the Data

Let’s take a look at the following quantitative data sets (13 of them, to be precise), all of which are visualized as a scatter plot. One is even in the shape of a dinosaur.

Image By Author. Generated using code available under MIT license at https://jumpingrivers.github.io/datasauRus/

Do you see anything interesting about these data sets?

I’ll point you in the right direction. Here is a set of summary statistics for the data:

X-Mean54.26
Y-Mean47.83
X-SD (Standard Deviation)16.76
Y-SD26.93
Correlation-0.06

If you’re wondering why there is only one set of statistics, it’s because they’re all the same. Every single one of the 13 Charts above has the same mean, standard deviation, and correlation between variables.

This famous set of 13 data sets is known as the Datasaurus Dozen [5], and was published some years ago as a stark example of why summary statistics cannot always be trusted. It also highlights the value of visualization as a tool for data exploration. In the words of renowned statistician John Tukey,

The greatest value of a picture is when it forces us to notice what we never expected to see.

Understanding Uncertainty

To conclude, I want to talk about a slight variation of deceptive data, but one that is equally important: mistrusting data that is actually correct. In other words, false deception.

The following chart is taken from a study analyzing the sentiments of headlines taken from left-leaning, right-leaning, and centrist news outlets [6]:

“Average yearly sentiment of headlines grouped by the ideological leanings of news outlets” by Authors of the study: David Rozado, Ruth Hughes, Jamin Halberstadt is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/?ref=openverse.

There is quite a bit going on in the chart above, but there is one particular aspect I want to draw your attention to: the vertical lines extending from each plotted point. You may have seen these before. Formally, these are called error bars, and they are one way that scientists often depict uncertainty in the data.

Let me say that again. In statistics and Data Science, “error” is synonymous with “uncertainty.” Crucially, it does not mean something is wrong or incorrect about what is being shown. When a chart depicts uncertainty, it depicts a carefully calculated measure of the range of a value and the level of confidence at various points within that range. Unfortunately, many people just take it to mean that whoever made the chart is essentially guessing.

This is a serious error in reasoning, for the damage is twofold: Not only does the data at hand get misinterpreted, but the presence of this misconception also contributes to the dangerous societal belief that science is not to be trusted. Being upfront about the limitations of knowledge should actually increase our confidence in a claim’s reliability, but mistaking that limitation as admission of foul play leads to the opposite effect.

Learning how to interpret uncertainty is challenging but incredibly important. At the minimum, a good place to start is realizing what the so-called “error” is actually trying to convey.

Recap and Final Thoughts

Here’s a cheat sheet for being wary of deceptive data:

  • Correlation ≠ causation. Look for the confounding factor.
  • Remember base proportions. The probability of a phenomenon is highly influenced by its prevalence in the population, no matter how accurate your test is (with the exception of 100% accuracy, which is rare).
  • Beware summary Statistics. Means and medians will only take you so far; you need to explore your data.
  • Don’t misunderstand uncertainty. It isn’t an error; it’s a carefully considered description of confidence levels.

Remember these, and you’ll be well positioned to tackle the next data science problem that makes its way to you.

Until next time.

References

[1] How Charts Lie, Alberto Cairo

[2] https://pmc.ncbi.nlm.nih.gov/articles/PMC4955674

[3] https://data88s.org/textbook/content/Chapter_02/04_Use_and_Interpretation.html?utm_source=chatgpt.com

[4] https://visualizing.jp/the-datasaurus-dozen

[5] https://dl.acm.org/doi/abs/10.1145/3025453.3025912?casa_token=AU6PWgCWQuMAAAAA:5a9-oA38RxxzmVGZiIFJdrNdOMII2kmsFLJK22WJgaAk37PECCmAQjwVzAiapGiV4MAOPTJ8-uax0g

[6] https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276367

The post The Dangers of Deceptive Data Part 2–Base Proportions and Bad Statistics appeared first on Towards Data Science.