{"id":1706,"date":"2020-09-18T21:02:15","date_gmt":"2020-09-18T21:02:15","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/18\/what-is-simpsons-paradox-and-how-to-automatically-detect-it\/"},"modified":"2020-09-18T21:02:15","modified_gmt":"2020-09-18T21:02:15","slug":"what-is-simpsons-paradox-and-how-to-automatically-detect-it","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/18\/what-is-simpsons-paradox-and-how-to-automatically-detect-it\/","title":{"rendered":"What is Simpson\u2019s Paradox and How to Automatically Detect it"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/erichart07\" target=\"_blank\" rel=\"noopener noreferrer\">Eric Hart, Ph.D.<\/a> and Mariam Walaa, Altair<\/b>.<\/p>\n<p>When we want to study relationships in data, we can plot, cross-tabulate, or model that data. When we do this, we might come across cases where the relationships we see from two different views of a single dataset lead us to opposing conclusions. These are cases of Simpson\u2019s Paradox.<\/p>\n<p>Finding these cases can help us understand our data better and discover interesting relationships. This article gives some examples of where these cases happen, discusses how and why they happen, and suggests ways to automatically detect these situations in your own data.<\/p>\n<p>\u00a0<\/p>\n<h3>What is Simpson\u2019s Paradox?<\/h3>\n<p>\u00a0<\/p>\n<p>Simpson\u2019s Paradox refers to a situation where you believe you understand the direction of a relationship between two variables, but when you consider an additional variable, that direction appears to reverse.<\/p>\n<p>\u00a0<\/p>\n<h3>Why does Simpson\u2019s Paradox happen?<\/h3>\n<p>\u00a0<\/p>\n<p>Simpson\u2019s Paradox happens because disaggregation of the data (e.g., splitting it into subgroups) can cause certain subgroups to have an imbalanced representation compared to other subgroups. This might be due to the relationship between the variables, or simply due to the way that the data has been partitioned into subgroups.<\/p>\n<p><strong>Example #1: Admissions<\/strong><\/p>\n<p>A famous example of Simpson\u2019s Paradox appears in the admissions data for graduate school at UC Berkeley in 1973 [<a href=\"https:\/\/en.wikipedia.org\/wiki\/Simpson's_paradox#UC_Berkeley_gender_bias\" target=\"_blank\" rel=\"noopener noreferrer\" name=\"_ftnref1\">source<\/a>]. In this example, when looking at the graduate admissions data <em>overall,<\/em> it appeared that men were more likely to be admitted than women (gender discrimination!), but when looking at the data <em>for each department individually,<\/em> men were less likely to be admitted than women in most of the departments.<\/p>\n<p><img class=\"aligncenter size-full wp-image-116444\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Fig1-Walaa-simpsons-paradox.jpg\" alt=\"\" width=\"90%\"><\/p>\n<p>Here is an explanation of why this happens:<\/p>\n<ol>\n<li>Different departments had very different acceptance rates (some were much \u201charder\u201d to get into than others)<\/li>\n<li>More females applied to the \u201charder&#8221; departments<\/li>\n<li>Therefore, females had a lower acceptance rate in aggregate<\/li>\n<\/ol>\n<p>This leads us to ask: which view is the correct view? Do men or women have a higher acceptance rate? Is there a gender bias in admissions at this university?<\/p>\n<p>In this case, it seems most reasonable to conclude that looking at the admissions rates by department makes more sense, and the disaggregated view is correct.<\/p>\n<p><strong>Example #2: Baseball<\/strong><\/p>\n<p>Another example of Simpson\u2019s Paradox can be found in the batting averages of two famous baseball players, Derek Jeter and David Justice, from 1995 and 1996 [<a href=\"https:\/\/en.wikipedia.org\/wiki\/Simpson's_paradox#Batting_averages\" target=\"_blank\" rel=\"noopener noreferrer\" name=\"_ftnref2\">source<\/a>]. David Justice had a higher batting average in both 1995 and 1996 individually, but Derek Jeter had a higher batting average over the two years combined.<\/p>\n<p><img class=\"aligncenter size-full wp-image-116445\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Fig2-Walaa-simpsons-paradox.jpg\" alt=\"\" width=\"90%\"><\/p>\n<p>Here is an explanation of why this happens:<\/p>\n<ol>\n<li>Both players had significantly higher batting averages in 1996 than in 1995<\/li>\n<li>Derek Jeter had significantly more at-bats in 1996; David Justice had significantly more in 1995<\/li>\n<li>Therefore, Derek Jeter had a higher batting average in aggregate<\/li>\n<\/ol>\n<p><img class=\"aligncenter size-full wp-image-116446\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Fig3-Walaa-simpsons-paradox.jpg\" alt=\"\" width=\"90%\"><\/p>\n<p><em>Figure 1: Knowledge Studio Decision Tree displaying the imbalanced number of at-bats by each player in 1995 and 1996.<\/em><\/p>\n<p>Again, we can ask: which view is the correct view? Was Derek Jeter or David Justice the better hitter? In this case, it seems most reasonable to conclude that the aggregated view is the correct view, and Derek Jeter was the better hitter over the two years.<\/p>\n<p>With these two examples, it\u2019s clear why Simpson\u2019s Paradox can be an issue. It\u2019s hard to draw conclusions from data when the data tells us two opposing stories at the same time. One might be tempted to think that the disaggregated view is always better since it contains more information, but it\u2019s possible that disaggregating on an additional variable provides an unnecessary or confusing perspective.<\/p>\n<p>As we see in the examples above, both cases are possible: sometimes the aggregated view is correct, and sometimes the disaggregated view is correct.<\/p>\n<p>\u00a0<\/p>\n<h3>What to do about Simpson\u2019s Paradox<\/h3>\n<p>\u00a0<\/p>\n<p>Without enough domain knowledge, it\u2019s hard to know which view of the relationship between two variables makes more sense \u2013 the one with or without the third variable.<\/p>\n<p>But before we think about how to deal with Simpson\u2019s Paradox, we need to find a way to efficiently detect it in a dataset. As mentioned earlier, it\u2019s possible to find an instance of Simpson\u2019s Paradox (a \u201cSimpson\u2019s Pair\u201d) simply by disaggregating a contingency table or a plot of data points and studying the results. However, there are other ways we can find Simpson\u2019s Pairs using models, e.g.:<\/p>\n<ol>\n<li>By building decision trees and comparing the distributions, or<\/li>\n<li>By building regression models and comparing the signs of the coefficients<\/li>\n<\/ol>\n<p>There are benefits to both, however, this can get difficult very quickly, especially when working with big datasets. It\u2019s hard to know which variables in the dataset may reverse the relationship between two other variables, and it can be hard to check all possible pairs of variables manually. Imagine we have a dataset with only 20 variables: we\u2019d need to check almost 400 pairs to be sure to find all cases of Simpson\u2019s Paradox.<\/p>\n<p>There are also further challenges to consider, even if we have searched for (and found) all possible Simpson&#8217;s Pairs. These challenges relate to interpretation, for example:<\/p>\n<ul>\n<li>Does the trend need to reverse in every subgroup to consider something a Simpson\u2019s Pair? Or is a majority of the subgroups enough?<\/li>\n<li>Does the size of the subgroups matter? What if the trend reverses in a lot of small subgroups, but not the largest subgroup?<\/li>\n<\/ul>\n<p>These last challenges don\u2019t disappear when attempting to automatically detect Simpson\u2019s Paradox, but by being forced to make decisions up front, we can at least handle them in a systematic and consistent way.<\/p>\n<p>\u00a0<\/p>\n<h3>Existing Tools for Automatically Detecting Simpson\u2019s Paradox<\/h3>\n<p>\u00a0<\/p>\n<p>Luckily, some tools have already been developed to deal with Simpson\u2019s Paradox in datasets:<\/p>\n<ol>\n<li>An R package, <a href=\"https:\/\/rdrr.io\/cran\/Simpsons\/man\/Simpsons.html\" target=\"_blank\" rel=\"noopener noreferrer\">Simpsons<\/a>, can detect Simpson\u2019s Paradox for continuous data by having the user specify the independent variable, dependent variable, and the variable they would like to disaggregate their data with. However, this only works on continuous data and doesn\u2019t check for Simpson\u2019s Paradox in the whole dataset (e.g., you must know where to look in advance, which can be the hard part).<\/li>\n<li>The paper, <a href=\"https:\/\/arxiv.org\/abs\/1801.04385\" target=\"_blank\" rel=\"noopener noreferrer\">Can you Trust the Trend: Discovering Simpson&#8217;s Paradoxes in Social Data<\/a>, discusses an algorithm to identify \u201cSimpson\u2019s Pairs,\u201d and the authors helpfully include code on GitHub. This code only works for datasets with binary dependent variables.<\/li>\n<\/ol>\n<p>\u00a0<\/p>\n<h3>How We Automatically Detect Simpson\u2019s Paradox<\/h3>\n<p>\u00a0<\/p>\n<p>We wrote our own function to automatically find Simpson\u2019s pairs in a dataset. There are two versions: one using Decision Trees (which can currently only be used inside of Altair\u2019s Knowledge Studio software), and one using Regression models, which works in Python and is available for <a href=\"https:\/\/github.com\/ehart-altair\/SimpsonsParadox\" target=\"_blank\" rel=\"noopener noreferrer\">download<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3>Conclusion<\/h3>\n<p>\u00a0<\/p>\n<p>Simpson&#8217;s Paradox is a tricky issue, but a good analyst or data scientist can handle it with the right tools and knowledge. We hope our new work can help others deal with this issue in an easier and more efficient manner.<\/p>\n<p>\u00a0<\/p>\n<p><strong>Bio: <\/strong><a href=\"https:\/\/www.linkedin.com\/in\/erichart07\" target=\"_blank\" rel=\"noopener noreferrer\">Dr. Eric Hart<\/a> is a Senior Data Scientist on the services team at Altair, and <a href=\"https:\/\/www.linkedin.com\/in\/mariamwalaa\" target=\"_blank\" rel=\"noopener noreferrer\">Mariam Walaa<\/a> is an intern on the services team at Altair and an undergraduate student at the University of Toronto. This blog post was written as part of a summer research project at Altair.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/09\/simpsons-paradox.html<\/p>\n","protected":false},"author":0,"featured_media":1707,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1706"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=1706"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1706\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/1707"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=1706"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=1706"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=1706"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}