{"id":41,"date":"2020-08-04T12:06:00","date_gmt":"2020-08-04T12:06:00","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/04\/top-10-ways-your-machine-learning-models-may-have-leakage\/"},"modified":"2020-08-04T12:06:00","modified_gmt":"2020-08-04T12:06:00","slug":"top-10-ways-your-machine-learning-models-may-have-leakage","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/04\/top-10-ways-your-machine-learning-models-may-have-leakage\/","title":{"rendered":"Top 10 ways your Machine Learning models may have leakage"},"content":{"rendered":"<div>\n<h3><strong>Rayid Ghani, Joe Walsh, Joan Wang<\/strong><\/h3>\n<p>If you\u2019ve ever worked on a real-world machine learning problem, you\u2019ve probably introduced (and hopefully discovered and fixed) leakage into your system at some point. Leakage is when your model has access to data at training\/building time that it wouldn\u2019t have at test\/deployment\/prediction time. The result is an overoptimistic model that performs much worse when deployed.<\/p>\n<p>The most common forms of leakage happen because of temporal issues \u2013 including data from the future in your model because you have that when you\u2019re doing model selection but there are many other ways leakage gets introduced. Here are the most common ones we\u2019ve found working on different real-world problems over the last few years. Hopefully, people will find this useful, add to it, and more importantly, start creating the equivalent of \u201cunit tests\u201d that can detect them before these systems get deployed (see<a href=\"https:\/\/github.com\/dssg\/randomize_your_data\"> initial work<\/a> by Joe Walsh and Joan Wang).<\/p>\n<p><a href=\"http:\/\/www.dssgfellowship.org\/wp-content\/uploads\/2020\/01\/back-to-the-future-trilogy-1122951-1280x0-1.jpg\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-18443\" src=\"http:\/\/www.dssgfellowship.org\/wp-content\/uploads\/2020\/01\/back-to-the-future-trilogy-1122951-1280x0-1.jpg\" alt=\"\" width=\"1280\" height=\"718\"><\/a><\/p>\n<p><strong>The Big (and obvious) One<\/strong><\/p>\n<p>1. Using a <strong>proxy for the outcome<\/strong> variable (label) as a feature. This one is often easy to detect because you get perfect performance but is more nuanced when the proxy is some approximation of the label\/outcome variable and the performance increase is more subtle to detect easily.<\/p>\n<p><strong>Doing any transformation or inference using the entire dataset<\/strong><\/p>\n<p>2. Using the entire data set for <strong>Imputations<\/strong>. Always do imputation based on your training set only, for each training set. Including the test set allows information to leak in to your models, especially in cases where the world changes in the future (when does it not?!)<\/p>\n<p>3. Using the entire data set for <strong>discretizations<\/strong> or <strong>normalizations\/scaling<\/strong> or many other data-based transformations. Same reason as #2. The range of a variable (age for example) can change in the future and knowing that will make your models do\/look better than they actually are.<\/p>\n<p>4. Using the entire data set for <strong>Feature Selection<\/strong>. Same reasons as #2 and #3. To play it safe, first split into train and test sets, and then do everything you need to do using that data.<\/p>\n<p><strong>Using information from the future (that will not available at training or prediction time)<\/strong><\/p>\n<p>5.\u00a0 Using (proxies\/transformation of) <strong>future outcomes as features<\/strong>: Similar to #1<\/p>\n<p>6. Doing standard <strong>k fold cross-validation when you have temporal data<\/strong>. If you have temporal data (that is non-stationary\u00a0 \u2013 again, when is it not!), k-fold cross validation will shuffle the data and a training set will (probably) contain data from the future and a test set will (probably) contain data from the past.<\/p>\n<p>7.\u00a0 <strong>Using data (as features) that<\/strong> <strong>happened before model training time but is not available until later<\/strong>. This is fairly common in cases where there is lag\/delay in data collection or access. An event may happen today but it doesn\u2019t appear in the database until a week, a month, or a year later and while it will be available in the data set you\u2019re using to build and select ML models, it will not be available at prediction time in deployment.<\/p>\n<p>8. <strong>Using data (as rows) in the training set based on information from the future. <\/strong>Including rows that match certain criteria (in the future) in the training set, such as everyone who got a social service in the next 3 months) leaks information to your model via a biased training set.<\/p>\n<p><strong>Humans using knowledge from the future<\/strong><\/p>\n<p>9.\u00a0 Selecting certain models, features, and other <strong>design choices that are based on<\/strong> <strong>humans (ML developers, domain experts)\u00a0 knowing what happened in the future<\/strong>. This is a gray area \u2013 we do want to use all of our domain knowledge to build more effective systems but sometimes that may not generalize into the future and result in overfitted\/over-optimistic models at training time and disappointment once they\u2019re deployed.<\/p>\n<p><strong>10. That\u2019s where you come in. What are your favorite leakage stories or examples?<\/strong><\/p>\n<p>\u00a0<\/p>\n<p>Some useful references:<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>http:\/\/www.dssgfellowship.org\/2020\/01\/23\/top-10-ways-your-machine-learning-models-may-have-leakage\/<\/p>\n","protected":false},"author":0,"featured_media":42,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/41"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=41"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/41\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/42"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=41"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=41"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=41"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}