{"id":8255,"date":"2021-05-10T03:16:59","date_gmt":"2021-05-10T03:16:59","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/05\/10\/we-dont-need-data-engineers-we-need-better-tools-for-data-scientists\/"},"modified":"2021-05-10T03:16:59","modified_gmt":"2021-05-10T03:16:59","slug":"we-dont-need-data-engineers-we-need-better-tools-for-data-scientists","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/05\/10\/we-dont-need-data-engineers-we-need-better-tools-for-data-scientists\/","title":{"rendered":"We Don\u2019t Need Data Engineers, We Need Better Tools for Data Scientists"},"content":{"rendered":"<div id=\"post-\">\n   <!-- post_author Devin Petersohn -->  <\/p>\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/devinpetersohn\/\" target=\"_blank\" rel=\"noopener\">Devin Petersohn<\/a>, PhD Student at UC Berkeley<\/b>.<\/p>\n<p><img class=\"aligncenter size-full wp-image-126809\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/dont-need-data-engineers-need-better-tools-data-scientists.jpeg\" alt=\"Tools\" width=\"90%\"><\/p>\n<p>In most companies, Data Engineers support Data Scientists in various ways. Often this means translating or productionizing the notebooks and scripts that a Data Scientist has written.\u00a0<strong>A large portion of the Data Engineer\u2019s role could be replaced with better tooling for Data Scientists, freeing Data Engineers to do more impactful (and scalable) work.<\/strong><\/p>\n<p>\u00a0<\/p>\n<h3>Why does this matter?<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">There\u2019s a sentiment making its way around the internet (again): We don\u2019t need Data Scientists, we need Data Engineers.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/678\/1*5pPpe67NMRH_DcY3os6aJQ.png\" width=\"90%\"><\/p>\n<p><em>(<a href=\"https:\/\/www.mihaileric.com\/posts\/we-need-data-engineers-not-data-scientists\/\" target=\"_blank\" rel=\"noopener\">source<\/a>)<\/em><\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/574\/1*Si2VrDpzovUx0QXWAJ_hOQ.png\" width=\"90%\"><\/p>\n<p><em>(<a href=\"https:\/\/www.forbes.com\/sites\/forbestechcouncil\/2019\/02\/04\/why-there-will-be-no-data-science-job-titles-by-2029\/?sh=31ad8ac93a8f\" target=\"_blank\" rel=\"noopener\">source<\/a>)<\/em><\/p>\n<p data-selectable-paragraph=\"\">These articles focus on the\u00a0number of available job positions for the title of \u201cData Engineer\u201d vs. \u201cData Scientist.\u201d Let\u2019s put aside the fact that the hiring managers who post these positions often don\u2019t know the difference between the two jobs and use them interchangeably (or use whatever is in style at the moment). For the sake of this article, we can take the existence of the positions at face value. The question then becomes:\u00a0<strong>Is the surplus of available Data Engineer positions solely a personnel problem?<\/strong><\/p>\n<p>\u00a0<\/p>\n<h3>Data Science is messy because it reflects the real world<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Data Scientists are domain experts (on top of knowing statistics), and they don\u2019t often have a strong background in programming. I\u2019ve seen this expertise discounted in multiple Twitter and forum threads, with software engineers and other \u201ctechnical people\u201d asking questions like \u201cWhy don\u2019t they just learn Spark?\u201d This type of mentality completely misses the fact that Data Scientists can already do what they want to do at smaller scales with their existing tools. Data Scientists want to gain insights, not worry about building elegant pipelines. Companies want something actionable, not beautiful.<\/p>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>Insights are more important than elegant pipelines.<\/em><\/p>\n<\/blockquote>\n<p data-selectable-paragraph=\"\">Popular Data Science tools are also criticized by more technical people and academics: \u201cWhy would anyone use pandas?\u201d\u00a0<strong>pandas must be the most popular tool to hate by people who have no use for it<\/strong>. It is loved (or at least appreciated) by the Data Scientists who use it daily, however.<\/p>\n<p data-selectable-paragraph=\"\">pandas, among other tools, was built to handle the\u00a0<strong>messiness<\/strong>\u00a0of the real world. Just look at how many parameters\u00a0<em>read_csv<\/em>\u00a0has:<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*2THguNqJ4jKuvCTe0aCDaQ.png\" width=\"90%\"><\/p>\n<p><em><a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.read_csv.html\" target=\"_blank\" rel=\"noopener\">(read_csv reference)<\/a><\/em><\/p>\n<p>If pandas is so bad, why has nothing unseated it as the standard dataframe for Python Data Science? Why does it continue to grow in adoption year after year? It\u2019s not the fastest, it\u2019s not the most robust, so why?<\/p>\n<p>\u00a0<\/p>\n<h3>Data Engineers have to handle the messiness that scalable tools can\u2019t<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">The scalable systems (e.g., Apache Spark) that are robust enough for production use can\u2019t handle the messiness of the real world as-is. It\u2019s difficult to scale without clean and simple assumptions, and the messier the problem, the harder it is to scale. Data Engineers handle the messiness because scalable tools can\u2019t.<\/p>\n<p data-selectable-paragraph=\"\">Messiness, in this case, can mean:<\/p>\n<ul>\n<li>Group\/Join Key Skew<\/li>\n<li>Partitioning<\/li>\n<li>Debugging Distributed Systems<\/li>\n<li>Cluster configuration and resource provisioning<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\">None of these are things that you have to worry about with smaller-scale systems. Outside of the Bay Area, most Data Engineers spend time debugging and translating to a distributed system, usually Spark.<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/1250\/0*bkp0lTAz2UAmYpJ9.png\" width=\"90%\"><\/p>\n<p><em>Multiple rewrites are necessary to turn one-time insights into production jobs.<\/em><\/p>\n<p data-selectable-paragraph=\"\">We can\u2019t really fault anyone here. The people who built the scalable tools in use today were building for highly technical users like themselves. Highly technical people don\u2019t need their tooling to handle messiness for them, and often they want knobs to tune.\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Eating_your_own_dog_food\" target=\"_blank\" rel=\"noopener\">Dogfooding<\/a>\u00a0is a popular concept in system engineering: \u201cthose that built it also use it.\u201d I think worrying so heavily about dogfooding can in part cause the landscape we are seeing in data science today: \u201conly people as technical as those that built the system<strong>\u00a0can<\/strong>\u00a0use it.\u201d<\/p>\n<p>\u00a0<\/p>\n<h3>What, then, should Data Engineers do?<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">The Data Science ecosystem needs systems that don\u2019t\u00a0<strong>only<\/strong>\u00a0focus on the problems of those building it. Data Scientists have been mostly stuck using the same or similar tools for the last 10+ years. The explanation for this is twofold: (1) Data Scientists love using their existing tools because they understand them, and (2) those who are capable of building large-scale systems have largely (unintentionally) overlooked the problems of those less technical than they.<\/p>\n<p data-selectable-paragraph=\"\">We need Data Engineers to help build scalable tools that empower Data Scientists, not translate pandas to Spark. Who better to help build the next generation of Data Science tools than today\u2019s Data Engineers?<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/we-dont-need-data-engineers-we-need-better-tools-for-data-scientists-84a06e6f3f7f\" target=\"_blank\" rel=\"noopener\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2021\/05\/dont-need-data-engineers-need-better-tools-data-scientists.html<\/p>\n","protected":false},"author":0,"featured_media":8256,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8255"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8255"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8255\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8256"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8255"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8255"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8255"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}