{"id":3660,"date":"2020-10-12T14:04:46","date_gmt":"2020-10-12T14:04:46","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/10\/12\/how-to-be-a-10x-data-scientist\/"},"modified":"2020-10-12T14:04:46","modified_gmt":"2020-10-12T14:04:46","slug":"how-to-be-a-10x-data-scientist","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/10\/12\/how-to-be-a-10x-data-scientist\/","title":{"rendered":"How to be a 10x data scientist"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/twitter.com\/daarkecloud\" target=\"_blank\" rel=\"noopener noreferrer\">Daoud Clarke<\/a>, Co-founder of DataPastry<\/b>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*PK6KoNS2RaACnxO-.jpg\" width=\"90%\"><\/p>\n<p>I\u2019m going to tell you what it takes to be a 10x data scientist. What is a 10x data scientist? Someone who runs ten times as many experiments as the average data scientist.<\/p>\n<p data-selectable-paragraph=\"\">Why experiments? Data scientists do other things, too: data munging, analysis, and writing implementations of machine learning algorithms for production. But experiments are what defines a data scientist. That\u2019s where the\u00a0<em>science<\/em>\u00a0is, and it is what distinguishes them from a data analyst or a machine learning engineer.<\/p>\n<p data-selectable-paragraph=\"\">So to be a great data scientist, you have to be great at doing experiments.<\/p>\n<p data-selectable-paragraph=\"\">Why 10 times more experiments? You can never guarantee you\u2019ll get ten times better results by being cleverer or faster. But you can run more experiments. And the more experiments you run, the more likely you are to get better results, the faster you\u2019ll iterate and the faster you\u2019ll learn.<\/p>\n<p data-selectable-paragraph=\"\">Why do you want to be a 10x data scientist? I don\u2019t know. Maybe because it sounds cool. Maybe because it\u2019s fun. Maybe because you\u2019ll have more time to eat pastries. That\u2019s up to you.<\/p>\n<p data-selectable-paragraph=\"\">I\u2019m going to assume that you can run experiments correctly. You\u2019re a data scientist, right? Nevertheless, there\u2019s one thing I\u2019ve seen many data scientists get wrong. It\u2019s this:<\/p>\n<p>\u00a0<\/p>\n<h3>1. Measure your uncertainty<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">What\u2019s the point in improving 5% over the baseline if you don\u2019t know whether the result is statistically significant? Data scientists know (or\u00a0<em>should<\/em>\u00a0know) statistics, but they are often too lazy to apply it to their own work.<\/p>\n<p data-selectable-paragraph=\"\">There is no shortage of options for this. My favourite method is one I learned in my physics degree: estimate the uncertainty as the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Standard_error\" target=\"_blank\" rel=\"noopener noreferrer\">standard error in the mean<\/a>. Of course, this means that the value you report has to be the mean of something, whether it\u2019s the mean F1 score on five folds in cross-validation, or whether it\u2019s the mean precision at 10 over rankings for 1,000 different queries.<\/p>\n<p data-selectable-paragraph=\"\">You don\u2019t need to do statistical significance tests between all your results. But you need to have a handle on how uncertain your results are. The standard error in the mean gives you that \u2014 if your results are separated by more than three times the standard error, chances are the difference is significant.<\/p>\n<p data-selectable-paragraph=\"\">You probably also want to consider what\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Effect_size\" target=\"_blank\" rel=\"noopener noreferrer\">effect size<\/a>\u00a0you\u2019re looking for. If a 0.1% improvement isn\u2019t useful to you, then there\u2019s no point in running experiments that can detect this sort of change.<\/p>\n<p>\u00a0<\/p>\n<h3>2. Big data is not cool<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Big data is slow. You don\u2019t want to be slow. So use small data. Most of the time, you don\u2019t\u00a0need\u00a0big data. If you think you need it, spend a bit of time rethinking to make sure you really do.<\/p>\n<p data-selectable-paragraph=\"\">You want your dataset to be big enough such that the uncertainty in your result is\u00a0<em>small<\/em>\u00a0enough to distinguish between differences that you care about. You don\u2019t want it to be any bigger: that\u2019s just a waste of time.<\/p>\n<p data-selectable-paragraph=\"\">You don\u2019t have to use all the data you have available. Depending on your experiment, you may be able to estimate how much data you need. Otherwise, look at how the metric you care about varies with training set size. If it levels off fairly quickly, then you\u2019ll know you can get away with discarding a lot of data. Do more experiments to figure out how much data you need to make the uncertainty low enough for the insights you\u2019re looking for.<\/p>\n<p data-selectable-paragraph=\"\">The number one cause of slow experiments is using too much data. Just don\u2019t do it.<\/p>\n<p>\u00a0<\/p>\n<h3>3. Don\u2019t use big data tools<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">If you have small data, you don\u2019t need big data tools. Don\u2019t use Spark, it will be horribly slow, and the results will be poor compared to something like Pandas and Scikit-learn. Use that instead.<\/p>\n<p>\u00a0<\/p>\n<h3>4. Use a good IDE<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Use a decent integrated development environment like\u00a0<a href=\"https:\/\/www.jetbrains.com\/pycharm\/\" target=\"_blank\" rel=\"noopener noreferrer\">PyCharm<\/a>\u00a0\u2014 actually, just use PyCharm, as nothing really compares. Learn how to use it properly.<\/p>\n<p data-selectable-paragraph=\"\">These are the things that I find most useful:<\/p>\n<ul>\n<li>Autocompletion, especially in combination with typed code.<\/li>\n<li>Viewing parameters and documentation for a function or class.<\/li>\n<li>Quickly search the whole codebase for a file, class, or function.<\/li>\n<li>Refactor to extract a variable, function or method, and inline variables.<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\">I can\u2019t bear watching people struggle with a text editor for this kind of thing. Please stop.<\/p>\n<p data-selectable-paragraph=\"\">Jupyter notebooks are OK for exploratory work, but if you want to be a 10x data scientist, you need to use an IDE for experiments.<\/p>\n<p>\u00a0<\/p>\n<h3>5. Cache intermediate steps<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">An experiment can include preprocessing the data, extracting features, feature selection, and so on. Each of these steps takes time to run. The chances are, once you\u2019ve found a good set of features, you will keep them more or less fixed while you experiment with models. If the preprocessing step takes a long time, it makes sense to cache the intermediate steps so that you perform these costly computations just once. This can make a huge difference in how long it takes to run experiments.<\/p>\n<p data-selectable-paragraph=\"\">I will typically do this with one or more preprocessing scripts that generate files to be used by later stages. Make sure that you keep track of how these files relate to the source data so that you can track your experiment results back to the original data, either through file naming conventions or a tool designed for the job such as\u00a0<a href=\"https:\/\/www.pachyderm.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Pachyderm<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3>6. Optimise your code<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">If your experiment is still slow when you\u2019ve reduced your dataset size, then you may benefit from optimising your code. Balance running experiments with optimising your code while experiments are running.<\/p>\n<p data-selectable-paragraph=\"\">You should know the basics of how to optimise code. Here\u2019s the basics: use a profiler. The profiler will tell you which bits are slow. Change those bits until they aren\u2019t slow any more. Then run the profiler and find other bits that are slow. Repeat.<\/p>\n<p data-selectable-paragraph=\"\">Run the profiler on a small sample so that you can quickly find out which bits are slow. (You need to optimise the optimising too.)<\/p>\n<p>\u00a0<\/p>\n<h3>7. Keep track of your results<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">If you lose the results of your experiments, then it\u2019s a waste. So keep careful track. Use a tool designed for the job like\u00a0<a href=\"https:\/\/mlflow.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">MLFlow<\/a>,\u00a0<a href=\"https:\/\/github.com\/IDSIA\/sacred\" target=\"_blank\" rel=\"noopener noreferrer\">Sacred<\/a>, or my own pet project,\u00a0<a href=\"https:\/\/github.com\/datapastry\/pypastry\" target=\"_blank\" rel=\"noopener noreferrer\">PyPastry<\/a>. If you\u2019re copying results around, then you\u2019re wasting time and likely to make errors. Don\u2019t.<\/p>\n<p data-selectable-paragraph=\"\">If you do all the above things, running an experiment will likely take less than five minutes, ideally less than two. That\u2019s long enough to think about what the next experiment will be.<\/p>\n<p data-selectable-paragraph=\"\">This means you can potentially run hundreds of experiments in a day. When you\u2019re running that many experiments, you need a good way to keep track.<\/p>\n<p>\u00a0<\/p>\n<h3>7a. Eat lots of pastries<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">This one isn\u2019t actually good advice. Your brain needs up to 400 calories worth of glucose per day, but eating pastries may not be the healthiest option to achieve this. But it would be tasty.<\/p>\n<p data-selectable-paragraph=\"\">Instead, you could consider contacting\u00a0<a href=\"https:\/\/datapastry.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">DataPastry<\/a>, the data science consultancy I run with my cofounder Iskander. If you\u2019d like any advice or need help with a data science project,\u00a0<a href=\"mailto:hello@datapastry.com\" target=\"_blank\" rel=\"noopener noreferrer\">we\u2019d love to hear from you<\/a>\u00a0, and we don\u2019t bite (except for pastries).<\/p>\n<p>\u00a0<\/p>\n<h3>Conclusion<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">If you do all the above things, then I\u2019m pretty sure you will run at least ten times as many experiments as the data scientist sitting next to you (unless you\u2019re sitting next to me). Most data scientists don\u2019t do any of them. So if you do them all, you\u2019ll probably run fifty times more experiments. Does this make you fifty times more valuable? I don\u2019t know, but it can\u2019t hurt. And you\u2019ll have more time to eat pastries.<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/how-to-be-a-10x-data-scientist-4718accf7d3f\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/10\/10x-data-scientist.html<\/p>\n","protected":false},"author":0,"featured_media":3661,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/3660"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=3660"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/3660\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/3661"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=3660"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=3660"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=3660"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}