{"id":3104,"date":"2020-10-08T15:56:09","date_gmt":"2020-10-08T15:56:09","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/10\/08\/6-lessons-learned-in-6-months-as-a-data-scientist\/"},"modified":"2020-10-08T15:56:09","modified_gmt":"2020-10-08T15:56:09","slug":"6-lessons-learned-in-6-months-as-a-data-scientist","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/10\/08\/6-lessons-learned-in-6-months-as-a-data-scientist\/","title":{"rendered":"6 Lessons Learned in 6 Months as a Data Scientist"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/twitter.com\/Nicole_Janeway\" target=\"_blank\" rel=\"noopener noreferrer\">Nicole Janeway Bills<\/a>, Data Scientist at Atlas Research<\/b>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/1250\/1*hIQySvx1TQUxXFabIJ4vgQ.jpeg\" width=\"90%\"><\/p>\n<p><em>Photo by\u00a0<a href=\"https:\/\/unsplash.com\/photos\/8AsKha7aIvk\" target=\"_blank\" rel=\"noopener noreferrer\">Artem Beliaikin<\/a>\u00a0on\u00a0Unsplash.<\/em><\/p>\n<p data-selectable-paragraph=\"\">Since my title flipped from consultant to data scientist six months ago, I\u2019ve experienced a higher level of job satisfaction than I would have thought possible. To celebrate my first half year in this engaging field, here are six lessons I\u2019ve collected along the way.<\/p>\n<p>\u00a0<\/p>\n<h3>#1 \u2014 Read the arXiv paper<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Probably you\u2019re aware that reviewing arXiv is a good idea. It\u2019s a wellspring of remarkable ideas and state-of-the-art advancements.<\/p>\n<p data-selectable-paragraph=\"\">I\u2019ve been pleasantly surprised, though, by the amount of actionable advice I come across on the platform. For example, I might not have access to\u00a0<a href=\"https:\/\/syncedreview.com\/2019\/06\/27\/the-staggering-cost-of-training-sota-ai-models\/\" target=\"_blank\" rel=\"noopener noreferrer\">16 TPUs and $7k to train BERT from scratch<\/a>, but the recommended hyperparameter settings from the Google Brain team are a great place to start fine-tuning (<a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" target=\"_blank\" rel=\"noopener noreferrer\">check Appendix A.3<\/a>).<\/p>\n<p data-selectable-paragraph=\"\">Hopefully, your favorite new package will have an enlightening read on arXiv to add color to its documentation. For example, I learned to deploy BERT using the supremely readable and abundantly useful\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2004.10703\" target=\"_blank\" rel=\"noopener noreferrer\">write-up on ktrain<\/a>, a library that sits atop Keras and provides a streamlined machine learning interface for text, image, and graph applications.<\/p>\n<p>\u00a0<\/p>\n<h3>#2 \u2014 Listen to podcasts for tremendous situational awareness<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Podcasts won\u2019t improve your coding skills but will improve your understanding of recent developments in machine learning, popular packages and tools, unanswered questions in the field, new approaches to old problems, underlying psychological insecurities common across the profession, etc.<\/p>\n<p data-selectable-paragraph=\"\">The podcasts I listen to on the day-to-day have helped me feel engaged and up-to-date on fast-moving developments in data science.<\/p>\n<p data-selectable-paragraph=\"\">Here are my favorite podcasts right now:\u00a0<a href=\"https:\/\/towardsdatascience.com\/supercharge-data-science-5bb7376d8572\" target=\"_blank\" rel=\"noopener noreferrer\">Resources to Supercharge your Data Science Learning in 2020<\/a><\/p>\n<p data-selectable-paragraph=\"\">Recently I\u2019ve been particularly excited to learn about\u00a0<a href=\"https:\/\/dataskeptic.com\/blog\/journalclub\/2020\/dark-secrets-of-bert-radioactive-data-and-vanishing-gradients\" target=\"_blank\" rel=\"noopener noreferrer\">advancements in NLP<\/a>, follow the\u00a0<a href=\"https:\/\/soundcloud.com\/theaipodcast\/ai-jonah-alben\" target=\"_blank\" rel=\"noopener noreferrer\">latest developments in GPUs<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.thecloudcast.net\/2020\/07\/2020-in-review-midyear-edition.html\" target=\"_blank\" rel=\"noopener noreferrer\">cloud computing<\/a>, and question the\u00a0<a href=\"https:\/\/braininspired.co\/podcast\/79\/\" target=\"_blank\" rel=\"noopener noreferrer\">potential symbiosis<\/a>\u00a0between advancements in artificial neural nets and neurobiology.<\/p>\n<p>\u00a0<\/p>\n<h3>#3 \u2014 Read GitHub Issues<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Based on my experience trawling this ocean of complaints for giant tuna of wisdom, here are three potential wins:<\/p>\n<ol>\n<li>I often get ideas from the ways others are using and\/or misusing a package.<\/li>\n<li>It\u2019s also useful to understand in what kinds of situations a package will tend to break in order to develop your sense of potential failure points in your own work.<\/li>\n<li>As you\u2019re in your pre-work phase of setting up your environment and conducting\u00a0<a href=\"https:\/\/medium.com\/atlas-research\/model-selection-d190fb8bbdda\" target=\"_blank\" rel=\"noopener noreferrer\">model selection<\/a>, you\u2019d do well to take the responsiveness of developers and the community into account before adding an open source tool into your pipeline.<\/li>\n<\/ol>\n<p>\u00a0<\/p>\n<h3>#4 \u2014 Understand the algorithm-hardware link<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">I\u2019ve done a lot of NLP in the last six months, so let\u2019s talk about BERT again.<\/p>\n<p data-selectable-paragraph=\"\">In October 2018,\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1810.04805\" target=\"_blank\" rel=\"noopener noreferrer\">BERT<\/a>\u00a0emerged and shook the world. Kind of like Superman after leaping a tall building in a single bound (<em>crazy to think Superman couldn\u2019t fly when originally introduced!<\/em>)<\/p>\n<p data-selectable-paragraph=\"\">BERT represented a step-change in the capacity of machine learning to tackle text processing tasks. Its state-of-the-art results are based in the parallelism of its\u00a0<a href=\"http:\/\/jalammar.github.io\/illustrated-transformer\/\" target=\"_blank\" rel=\"noopener noreferrer\">transformer architecture<\/a>\u00a0running on\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=MXxN4fv01c8\" target=\"_blank\" rel=\"noopener noreferrer\">Google\u2019s TPU computer chip<\/a>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/media.giphy.com\/media\/10bKPDUM5H7m7u\/giphy.gif\" width=\"50%\"><\/p>\n<p><em>The feeling of training on GPUs for the first time. via\u00a0<a href=\"https:\/\/giphy.com\/gifs\/superman-vintage-cartoon-10bKPDUM5H7m7u\/links\" target=\"_blank\" rel=\"noopener noreferrer\">GIPHY<\/a>.<\/em><\/p>\n<p>Understanding the implications of TPU and\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=6eBpjEdgSm0\" target=\"_blank\" rel=\"noopener noreferrer\">GPU-based machine learning<\/a>\u00a0is important for advancing your own capabilities as a data scientist. It is also a critical step toward sharpening your intuition about the inextricable link between\u00a0<a href=\"https:\/\/medium.com\/@karpathy\/software-2-0-a64152b37c35\" target=\"_blank\" rel=\"noopener noreferrer\">machine learning software<\/a>\u00a0and the physical constraints of the hardware on which it runs.<\/p>\n<p data-selectable-paragraph=\"\">With Moore\u2019s law petering out around 2010, increasingly creative approaches will be needed to overcome the limitations in the data science field and continue to make progress toward truly intelligent systems.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*0C432a1dvX9uXa1SUsW3rQ.png\" width=\"90%\"><\/p>\n<p><em>Chart from\u00a0<a href=\"https:\/\/youtu.be\/EBCtwWbbamw\" target=\"_blank\" rel=\"noopener noreferrer\">Nvidia presentation<\/a>\u00a0showing transistors per square millimeter by year. This highlights the stagnation in transistor count around 2010 and the rise of GPU-based computing.<\/em><\/p>\n<p>I\u2019m bullish on the rise of\u00a0<a href=\"https:\/\/twimlai.com\/twiml-talk-391-the-case-for-hardware-ml-model-co-designwith-diana-marculescu\/\" target=\"_blank\" rel=\"noopener noreferrer\">ML model-computing hardware co-design<\/a>, increased reliance on\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2002.00585\" target=\"_blank\" rel=\"noopener noreferrer\">sparsity and pruning<\/a>, and even\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=zmbCZhlN1xk\" target=\"_blank\" rel=\"noopener noreferrer\">\u201cno-specialized hardware\u201d machine learning<\/a>\u00a0that looks to disrupt the dominance of the current GPU-centric paradigm.<\/p>\n<p>\u00a0<\/p>\n<h3>#5 \u2014 Learn from the Social Sciences<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">There\u2019s a lot our young field can learn from the reproducibility crisis in the Social Sciences that took place in the mid-2010s (and which, to some extent, is still taking place):<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/625\/1*gejeBAG1aRDFWOL8Uxz3dg.png\" width=\"90%\"><\/p>\n<p><em>\u201cp-value hacking\u201d for data scientists.\u00a0<a href=\"https:\/\/xkcd.com\/1838\/\" target=\"_blank\" rel=\"noopener noreferrer\">Comic by Randall Monroe of xkcd<\/a>.<\/em><\/p>\n<p>In 2011, an\u00a0<a href=\"https:\/\/osf.io\/ezcuj\/wiki\/home\/\" target=\"_blank\" rel=\"noopener noreferrer\">academic crowdsourced collaboration<\/a>\u00a0aimed to reproduce 100 published experiments and correlational psychological studies. And it failed \u2014 just 36% of the replications reported statistically significant results, compared to 97% of the originals.<\/p>\n<p data-selectable-paragraph=\"\">Psychology\u2019s reproducibility crisis reveals the danger, and responsibility, associated with sticking \u201cscience\u201d alongside shaky methodology.<\/p>\n<p data-selectable-paragraph=\"\">Data science needs testable, reproducible approaches to its problems. To eliminate p-hacking, data scientists need to set limits on how they investigate their data for predictive features and on the number of tests they run to evaluate metrics.<\/p>\n<p data-selectable-paragraph=\"\">There are many tools that can help with experimentation management. I have experience with\u00a0<a href=\"https:\/\/mlflow.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">ML Flow<\/a>\u00a0\u2014\u00a0<a href=\"https:\/\/towardsdatascience.com\/the-most-useful-ml-tools-2020-e41b54061c58\" target=\"_blank\" rel=\"noopener noreferrer\">this excellent article<\/a>\u00a0by\u00a0<a href=\"https:\/\/medium.com\/u\/a0eb4622a0ca?source=post_page-----e875e69aab0a--------------------------------\" target=\"_blank\" rel=\"noopener noreferrer\">Ian Xiao<\/a>\u00a0mentions six others \u2014 as well as suggestions across four other areas of the machine learning workflow.<\/p>\n<p data-selectable-paragraph=\"\">We can also draw many lessons from the fair share of missteps and algorithmic malpractice within the data science field in recent years.<\/p>\n<p data-selectable-paragraph=\"\">For example, interested parties need to look no further than social engineering recommendation engines, discriminatory credit algorithms, and criminal justice systems that deepen the status quo.\u00a0<a href=\"https:\/\/medium.com\/atlas-research\/model-selection-d190fb8bbdda\" target=\"_blank\" rel=\"noopener noreferrer\">I\u2019ve written a bit about these social ills and how to avoid them with effective human-centered design<\/a>.<\/p>\n<p data-selectable-paragraph=\"\">The good news is that there are many intelligent and driven practitioners working to address these challenges and prevent future breaches in public trust. Check out\u00a0<a href=\"https:\/\/ai.google\/responsibilities\/responsible-ai-practices\/\" target=\"_blank\" rel=\"noopener noreferrer\">Google\u2019s PAIR<\/a>,\u00a0<a href=\"https:\/\/github.com\/columbia\/fairtest\" target=\"_blank\" rel=\"noopener noreferrer\">Columbia\u2019s FairTest<\/a>, and\u00a0<a href=\"https:\/\/www.ibm.com\/blogs\/research\/2019\/08\/ai-explainability-360\/\" target=\"_blank\" rel=\"noopener noreferrer\">IBM\u2019s Explainability 360<\/a>. Collaborations with social scientist researchers can yield fruitful results, such as this project on\u00a0<a href=\"https:\/\/www.pnas.org\/content\/early\/2020\/07\/27\/1912790117\" target=\"_blank\" rel=\"noopener noreferrer\">algorithms to audit for discrimination<\/a>.<\/p>\n<p data-selectable-paragraph=\"\">Of course, there are many other things we can learn from the social sciences, such as how to give an effective presentation.<\/p>\n<p data-selectable-paragraph=\"\">It\u2019s crucial to study the social sciences to understand where human intuition about data inference is likely to fail. Humans are very good at drawing conclusions from data in certain situations. The ways our reasoning breaks down is highly systematic and predictable.<\/p>\n<p data-selectable-paragraph=\"\">Much of what we understand about this aspect of human psychology is captured in Daniel Kahneman\u2019s excellent\u00a0<em>Thinking Fast and Slow<\/em>. This book should be required reading for anyone interested in decision sciences.<\/p>\n<p data-selectable-paragraph=\"\">One element of Kahneman\u2019s research that\u2019s likely to be immediately relevant to your work is his treatment of the anchoring effect, which \u201coccurs when people consider a particular value for an unknown quantity.\u201d<\/p>\n<p data-selectable-paragraph=\"\">When communicating results from modeling (i.e., numbers representing accuracy, precision, recall, f-1, etc.), data scientists need to take special care to manage expectations. It can be useful to provide a degree of hand-waviness on a scale of \u201cwe are still hacking away at this problem, and these metrics are likely to change\u201d to \u201cthis is the final product, and this is about how we expect our ML solution to perform in the wild.\u201d<\/p>\n<p data-selectable-paragraph=\"\">If you\u2019re presenting intermediate results, Kahneman would recommend providing a range of values for each metric, rather than specific digits. For example, \u201cThe f-1 score, which represents the harmonic mean of other metrics represented in this table (precision and recall), falls roughly between 80\u201385%. This indicates some room for improvement.\u201d This \u201chand-wavy\u201d communication strategy decreases the risk that the audience will\u00a0<em>anchor<\/em>\u00a0on the specific value you\u2019re sharing, rather than gain a directionally correct message about the results.<\/p>\n<\/p>\n<p>\u00a0<\/p>\n<h3>#6 \u2014 Connect data to business outcomes<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Before you start work, make sure that the problem you\u2019re solving is worth solving.<\/p>\n<p data-selectable-paragraph=\"\">Your organization isn\u2019t paying you to build a model with 90% accuracy, write them a report, piddle around in Jupyter Notebook, or even to enlighten yourself and others on the\u00a0<a href=\"https:\/\/towardsdatascience.com\/you-should-really-learn-about-graph-databases-heres-why-d03c9d706a3\" target=\"_blank\" rel=\"noopener noreferrer\">quasi-magical properties of graph databases<\/a>.<\/p>\n<p data-selectable-paragraph=\"\">You\u2019re there to connect data to business outcomes.<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/6-months-data-science-e875e69aab0a\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><strong>Bio:<\/strong>\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/nicole-janeway-bills\/\" target=\"_blank\" rel=\"noopener noreferrer\">Nicole Janeway Bills<\/a>\u00a0is a m<span class=\"lt-line-clamp__line\">achine learning engineer with experience in commercial consulting with proficiency in Python, SQL, and Tableau, as well as business<\/span>\u00a0<span class=\"lt-line-clamp__line\">experience in natural language processing (NLP), cloud computing, statistical testing, pricing analysis, and ETL processes.<\/span>\u00a0Nicole<span class=\"lt-line-clamp__line lt-line-clamp__line--last\">\u00a0focuses on connecting data with business outcomes and continues to develop personal technical skillsets.<\/span><\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/10\/6-lessons-6-months-data-scientist.html<\/p>\n","protected":false},"author":0,"featured_media":3105,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/3104"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=3104"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/3104\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/3105"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=3104"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=3104"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=3104"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}