{"id":327,"date":"2020-08-12T15:35:41","date_gmt":"2020-08-12T15:35:41","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/12\/introduction-to-statistics-for-data-science\/"},"modified":"2020-08-12T15:35:41","modified_gmt":"2020-08-12T15:35:41","slug":"introduction-to-statistics-for-data-science","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/12\/introduction-to-statistics-for-data-science\/","title":{"rendered":"Introduction to Statistics for Data Science"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/diogomenezesborges\/\" target=\"_blank\" rel=\"noopener noreferrer\">Diogo Menezes Borges<\/a>, Data Engineer<\/b>.<\/p>\n<p>In Statistics, to infer the value of an unknown parameter, we use estimators.\u00a0Estimation is the process used to make inferences, from a sample, about an unknown population parameter.<\/p>\n<p data-selectable-paragraph=\"\">Based on a random sample of a population, a point estimate is the best estimate, although it is not absolutely accurate. Furthermore, if you continuously retrieve random samples from the same population, it is expected that the point estimate would vary from sample to sample.<\/p>\n<p data-selectable-paragraph=\"\">On the other hand, a confidence interval is an estimate constructed on the assumption that the true parameter will fall within a specified proportion regardless of the number of samples analysed.<\/p>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>A population estimator is an approximation depending solely on sample information, while on the other hand, a\u00a0<strong>specific value<\/strong>\u00a0is called an estimate.<\/em><\/p>\n<\/blockquote>\n<p data-selectable-paragraph=\"\">As we\u2019ve referred, there are two types of estimates:<\/p>\n<ul>\n<li>\n<strong>Point Estimates<\/strong>\u2014 single number.<\/li>\n<li>\n<strong>Confidence Interval Estimates\u00a0<\/strong>\u2014 provide much more information, and are preferred when making inferences.<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*QYZX8IUnuehc_wrVD5iJcA.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">The two are related since the point estimate is in the middle of the confidence interval estimate. However, confidence intervals provide much more information and are preferred when making inferences.<\/p>\n<p>\u00a0<\/p>\n<h3>Point Estimates<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">We\u2019ve seen so far point estimators in earlier posts. For example, the sample mean (<strong>x\u0305<\/strong><strong>)<\/strong><strong>\u00a0<\/strong>is a point estimation of the population\u2019s mean (<strong>\u03bc)<\/strong>. The same goes for the sample variance, which is an estimate of the population\u2019s variance.<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/760\/1*bDxWkuQ1C9kirxdQAocBdQ.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">All estimators have two properties,\u00a0<strong>efficiency\u00a0<\/strong>and\u00a0<strong>bias<\/strong>:<\/p>\n<ul>\n<li>\n<strong>Bias<\/strong>\u2014 an unbiased estimator has an expected value equal to the population parameter.<\/li>\n<li>\n<strong>Efficiency\u00a0<\/strong>\u2014 the most efficient estimators are the ones with the least variability of outcomes. The most efficient estimator is the unbiased estimator with the smallest variance.<\/li>\n<\/ul>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>We\u2019re always looking for the most efficient and unbiased estimators.<\/em><\/p>\n<\/blockquote>\n<p>\u00a0<\/p>\n<h3>Confidence Interval Estimates<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\"><strong>Point estimators are not very reliable!<\/strong>\u00a0A confidence interval is a much more accurate representation of reality. However, there are still some uncertainties left. We can never be 100% confident unless we go through the whole population.<\/p>\n<p data-selectable-paragraph=\"\">Imagine you decide to randomly measure 40 men in your city, and you get a sample average height of\u00a0<strong>x\u0305<\/strong>\u00a0=175 cm. You might get close to the population\u2019s real height (<strong>\u03bc)<\/strong>, but the chances are that the true value is somewhere between 170 cm and 180 cm. It is most accurate to say that the average height for men in your city is somewhere between a specific interval [170 cm, 180 cm].<\/p>\n<p data-selectable-paragraph=\"\"><em>Nevertheless, there is still some uncertainty left, which we measure in levels of confidence.<\/em><\/p>\n<p data-selectable-paragraph=\"\">For example, we can say that we\u2019re 95% positive that the average men height in our city falls somewhere between 175 cm and 180 cm. Keep in mind that you can never say you are 100% confident since for that you would have to go through the entire population (i.e., all men in the city). Therefore, there is still a 5% chance that the population parameter is outside the expected range.<\/p>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>A\u00a0confidence interval is a range within which you expect the population parameter to be.<\/em><\/p>\n<\/blockquote>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>It is denoted by 1-\u03b1 and is called the confidence level of the interval.<\/em><\/p>\n<\/blockquote>\n<p data-selectable-paragraph=\"\">\u03b1 is a value between 0 and 1. Let\u2019s go back to the previous example. If you wish to be 95% confident that our population parameter is inside that interval, \u03b1 must be 5%. Hence, if we wish a higher level of confidence, for example, 99% then \u03b1 will be 1%.<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/848\/1*aEhdnWT221VWJ8dwdzJlQg.png\" width=\"90%\"><\/p>\n<ul>\n<li><em><strong>How to calculate the Confidence Interval?<\/strong><\/em><\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\">There are two situations when it is possible to calculate the Confidence Interval Estimates:<\/p>\n<p><em><strong>a) When the population variance is known.<\/strong><\/em><\/p>\n<p><strong>b) <\/strong><em><strong>When the population variance is unknown.<\/strong><\/em><\/p>\n<p><em>a) Known Population Variance<\/em><\/p>\n<p data-selectable-paragraph=\"\">An important factor for this calculation is the assumption that we\u2019re dealing with a population that is\u00a0<a href=\"https:\/\/medium.com\/diogo-menezes-borges\/introduction-to-statistics-for-data-science-6c246ed2468d\" target=\"_blank\" rel=\"noopener noreferrer\">Normally Distributed<\/a>. Even if we\u2019re not dealing with a normal distribution, but we\u2019re working a sample which is large enough, we should take advantage of the\u00a0<a href=\"https:\/\/medium.com\/diogo-menezes-borges\/introduction-to-statistics-for-data-science-a67a3199dcd4\" target=\"_blank\" rel=\"noopener noreferrer\">Central Limit Theorem<\/a>\u00a0to help us out.<\/p>\n<p data-selectable-paragraph=\"\"><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/268\/1*HmLWX8nk1MFuj73g3OO9Ug.png\" width=\"214\" height=\"54\"><\/p>\n<p data-selectable-paragraph=\"\">It is always simpler to understand these concepts with real-life examples.<\/p>\n<p data-selectable-paragraph=\"\">Imagine you wish to become a Data Scientist, and you want to learn how much on average a Data Scientist earns. You got to Glassdoor and start to retrieve salary information from several testimonies. You become aware that the standard deviation (\u03c3) for a Data Scientist salary is around $15,000. Furthermore, make use of the CLT you can assume your sample of 30 salaries (n = 30) are normally distributed.<\/p>\n<p data-selectable-paragraph=\"\">Therefore, assuming a normal distribution, we are able to calculate the confidence interval\u00a0<strong>with a known variance\u00a0<\/strong>using the following formula:<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/833\/1*688IfTB6T3qSLcZa_dS0BA.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\"><strong class=\"jo kw\">x\u0305<\/strong>\u2014 Sample mean is our point estimator, of $100,200<\/p>\n<p data-selectable-paragraph=\"\"><strong>Z \u03b1\/2 <\/strong>\u2014 Reliability factor, if we assume a confidence level of 95% thus \u03b1=5%<\/p>\n<p data-selectable-paragraph=\"\"><strong>\u03c3\/sqrt(n)<\/strong>\u00a0\u2014 Standard Error, 15,000\/sqrt(30) = $2,739<\/p>\n<p data-selectable-paragraph=\"\">To get our reliability factor (<strong>Z \u03b1\/2), <\/strong>we have to make use of the\u00a0<a href=\"https:\/\/www.math.arizona.edu\/~rsims\/ma464\/standardnormaltable.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">z-table<\/a>\u00a0for the standard Normal Distribution.<\/p>\n<p data-selectable-paragraph=\"\">We\u2019ve stated that we\u2019re confident that in 95% of the cases the true population parameter would fall into the specified interval. Hence we must retrieve the reliability factor value of\u00a0<strong>Z 0.05\/2 =&gt; Z 0.025.<\/strong><\/p>\n<p data-selectable-paragraph=\"\">In the table, the value will match the value of 1\u20130.025=0.975. The corresponding Z comes with the sum of the table row and column associated with that cell.<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/854\/1*uciUz3XXUYOLWl8nIvURqQ.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">Therefore, for our pratical case, the\u00a0<strong>critical value<\/strong>\u00a0(commonly used term for Z) for this confidence interval is\u00a0<strong>Z 0.025 =\u00a0<\/strong>1.9+0.06 = 1.96.<\/p>\n<p data-selectable-paragraph=\"\">Therefore, by replacing, each component in the formula we get the following confidence interval.<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/830\/1*_eTQp560vmMDl2kanpkFGQ.png\" width=\"90%\"><\/p>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>Hence, we\u2019re able to state that we\u2019re 95% sure that the average Data Scientist salary will fall into the specified interval of [$94 883, $105 568]<\/em><\/p>\n<\/blockquote>\n<p data-selectable-paragraph=\"\">Now, try it out for a confident interval of 99%. The final result is [$93 135, $107 206].<\/p>\n<p><em>b) Unknown Population Variance<\/em><\/p>\n<p data-selectable-paragraph=\"\">Until now, we\u2019ve seen how to calculate the confidence interval if the population variance is known. What if it\u2019s not? How should we proceed then?\u00a0<strong>The Student\u2019s Distribution is the answer!<\/strong><\/p>\n<ul>\n<li>Student\u2019s T Distribution<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\">The Student\u2019s T Distribution allows inference through small samples for unknown population\u2019s variance. It has a similar shape as the normal distribution but fatter tails which allow for higher dispersion of variables as there is more uncertainty.<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*RtDsly8BB1R_1VKmqs_Geg.png\" width=\"90%\"><\/p>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>Confidence intervals based on small samples, from normally distributed populations, are calculated with the T statistics.<\/em><\/p>\n<\/blockquote>\n<p data-selectable-paragraph=\"\">Let\u2019s revisit the example we saw before.<\/p>\n<p data-selectable-paragraph=\"\">Again, we look it up in Glassdoor, but this time we only find 9 compensations. We know the sample standard deviation is of $13 932, the sample mean of $92 533 and therefore we can calculate a standard error of $4 644. Nevertheless, we are unaware of the population\u2019s variance. Therefore, we\u2019ll use the Student\u2019s T Distribution!<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/810\/1*UFuO6rFpV1EhQN4pnFFbTQ.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">Instead of\u00a0<strong>z statistics,<\/strong>\u00a0we have\u00a0<strong>t statistics<\/strong>. Moreover, instead of the population standard deviation, we have sample standard deviation (s).<\/p>\n<p data-selectable-paragraph=\"\">For the\u00a0<a href=\"http:\/\/math.mit.edu\/~vebrunel\/Additional%20lecture%20notes\/t%20(Student%27s)%20table.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Student\u2019s T Distribution<\/a>,\u00a0there are n-1 degrees of freedom. Since our sample has n=9 observations, we have 8 degrees of freedom. For this example, we\u2019ll maintain our desirable confidence level of 95% and therefore\u00a0<strong>\u03b1 = 5%.<\/strong><\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/731\/1*t2Oyc0iLkCnwjm_ApFqsDw.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">We can conclude that our associated t-statistic is 2,31. Finally, we have all the information needed, so we just need to insert it to the corresponding equation and calculate our confidence interval.<\/p>\n<p data-selectable-paragraph=\"\"><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/401\/1*nOd0n5j1YFt584gVcJMAoQ.png\" width=\"321\" height=\"47\"><\/p>\n<p data-selectable-paragraph=\"\">Therefore, our confidence interval will be of [$81 806, $103 261]. Notice that when comparing our two results, we observe that when we know the population variance, we get a narrower confidence interval. In contrast, when we don\u2019t know the population\u2019s variance, there is a higher uncertainty reflected by wider boundaries for our interval.<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*-GPC4Vsh-b_mPrQZSR1nSw.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">In conclusion, what we\u2019ve learned is that when we do not know the population\u2019s variance, we can still make predictions, but they will be less accurate.<\/p>\n<p><a href=\"https:\/\/medium.com\/diogo-menezes-borges\/introduction-to-statistics-for-data-science-16a188a400ca\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/08\/introduction-statistics-data-science.html<\/p>\n","protected":false},"author":0,"featured_media":328,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/327"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=327"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/327\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/328"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=327"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=327"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=327"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}