{"id":1671,"date":"2020-09-17T14:53:57","date_gmt":"2020-09-17T14:53:57","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/17\/unpopular-opinion-data-scientists-should-be-more-end-to-end\/"},"modified":"2020-09-17T14:53:57","modified_gmt":"2020-09-17T14:53:57","slug":"unpopular-opinion-data-scientists-should-be-more-end-to-end","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/17\/unpopular-opinion-data-scientists-should-be-more-end-to-end\/","title":{"rendered":"Unpopular Opinion \u2013 Data Scientists Should Be More End-to-End"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/eugeneyan\/\" target=\"_blank\" rel=\"noopener noreferrer\">Eugene Yan<\/a>, Applied Science at Amazon, Writer &amp; Speaker<\/b>.<\/p>\n<p><img class=\"aligncenter size-full wp-image-55220\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/data-science-field.jpg\" alt=\"\" width=\"90%\"><\/p>\n<p>Recently, I came across a\u00a0<a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/i48b5q\/for_those_that_work_for_a_team_that_has_both_data\/\" target=\"_blank\" rel=\"noopener noreferrer\">Reddit thread<\/a>\u00a0on the different roles in data science and machine learning: data scientist, decision scientist, product data scientist, data engineer, machine learning engineer, machine learning tooling engineer, AI architect, etc.<\/p>\n<p>I found this\u00a0<em>worrying<\/em>. It\u2019s difficult to be effective when the data science process (problem framing, data engineering, ML, deployment\/maintenance) is split across different people. It leads to coordination overhead, diffusion of responsibility, and lack of a big picture view.<\/p>\n<p>IMHO,\u00a0<strong>I believe data scientists can be more effective by being end-to-end<\/strong>. Here, I\u2019ll discuss the\u00a0<a href=\"https:\/\/eugeneyan.com\/writing\/end-to-end-data-science\/#from-start-identify-the-problem-to-finish-solve-it\" target=\"_blank\" rel=\"noopener noreferrer\">benefits<\/a>\u00a0and\u00a0<a href=\"https:\/\/eugeneyan.com\/writing\/end-to-end-data-science\/#but-we-need-specialist-experts-too\" target=\"_blank\" rel=\"noopener noreferrer\">counter-arguments<\/a>,\u00a0<a href=\"https:\/\/eugeneyan.com\/writing\/end-to-end-data-science\/#the-best-way-to-pick-it-up-is-via-learning-by-doing\">how to<\/a>\u00a0become end-to-end, and the experiences of\u00a0<a href=\"https:\/\/eugeneyan.com\/writing\/end-to-end-data-science\/#end-to-end-in-stitch-fix-and-netflix\" target=\"_blank\" rel=\"noopener noreferrer\">Stitch Fix and Netflix<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3>From start (identify the problem) to finish (solve it)<\/h3>\n<p>\u00a0<\/p>\n<p>You may have come across similar\u00a0<em>labels<\/em>\u00a0and definitions, such as:<\/p>\n<ul>\n<li>\n<a href=\"https:\/\/towardsdatascience.com\/why-you-shouldnt-be-a-data-science-generalist-f69ea37cdd2c\" target=\"_blank\" rel=\"noopener noreferrer\">Generalist<\/a>: Focused on roles (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Product_manager\" target=\"_blank\" rel=\"noopener noreferrer\">PM<\/a>,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Business_analyst\" target=\"_blank\" rel=\"noopener noreferrer\">BA<\/a>,\u00a0<a href=\"https:\/\/www.oreilly.com\/content\/data-engineering-a-quick-and-simple-definition\/\" target=\"_blank\" rel=\"noopener noreferrer\">DE<\/a>,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Category:Data_scientists\" target=\"_blank\" rel=\"noopener noreferrer\">DS<\/a>,\u00a0<a href=\"https:\/\/www.quora.com\/What-exactly-does-a-machine-learning-engineer-do\" target=\"_blank\" rel=\"noopener noreferrer\">MLE<\/a>); some negative connotation<\/li>\n<li>\n<a href=\"https:\/\/skillcrush.com\/blog\/front-end-back-end-full-stack\/\" target=\"_blank\" rel=\"noopener noreferrer\">Full-stack<\/a>: Focused on tech (Spark, Torch, Docker); popularized by full-stack devs<\/li>\n<li>\n<a href=\"https:\/\/www.infoworld.com\/article\/3429185\/stop-searching-for-that-data-science-unicorn.html\" target=\"_blank\" rel=\"noopener noreferrer\">Unicorn<\/a>: Focused on mythology; believed not to exist<\/li>\n<\/ul>\n<p>I find these definitions to be more prescriptive than I prefer. Instead, I have a simple (and pragmatic) definition: An end-to-end data scientist can\u00a0<strong>identify and solve problems with data to deliver value<\/strong>. To achieve the goal, they\u2019ll wear as many (or as little) hats as required. They\u2019ll also learn and apply whatever tech, methodology, and process that works. Throughout the process, they ask questions such as:<\/p>\n<ul>\n<li>What is the problem? Why is it important?<\/li>\n<li>Can we solve it? How should we solve it?<\/li>\n<li>What is the estimated value? What was the actual value?<\/li>\n<\/ul>\n<blockquote>\n<p><em><strong>Data Science Processes<\/strong><\/em><\/p>\n<p>Another way of defining end-to-end data science is via processes. These processes are usually complex and I\u2019ve left them out of the main discussion. Nonetheless, here are a few in case you\u2019re curious:<\/p>\n<ul>\n<li>\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Cross-industry_standard_process_for_data_mining\" target=\"_blank\" rel=\"noopener noreferrer\">CRISP-DM<\/a>: Cross-Industry Standard Process for Data Mining (1997).<\/li>\n<li>\n<a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_mining#Process\" target=\"_blank\" rel=\"noopener noreferrer\">KDD<\/a>: Knowledge Discovery in Databases.<\/li>\n<li>\n<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/team-data-science-process\/overview\" target=\"_blank\" rel=\"noopener noreferrer\">TDSP<\/a>: Team Data Science Process, proposed by Microsoft in 2018.<\/li>\n<li>\n<a href=\"https:\/\/github.com\/dslp\/dslp\" target=\"_blank\" rel=\"noopener noreferrer\">DSLP<\/a>: Data Science Lifecycle Process.<\/li>\n<\/ul>\n<p>Don\u2019t worry if these processes seem heavy and overwhelming. You don\u2019t have to adopt them wholesale\u2014start bit by bit, keep what works and adapt the rest.<\/p>\n<\/blockquote>\n<p>.<\/p>\n<p>\u00a0<\/p>\n<h3>More context, faster iteration, greater satisfaction<\/h3>\n<p>\u00a0<\/p>\n<p>For most data science roles, being more end-to-end improves your ability to make a meaningful impact. (Nonetheless, there are\u00a0<a href=\"https:\/\/nvidia.wd5.myworkdayjobs.com\/en-US\/NVIDIAExternalCareerSite\/job\/US-CA-Santa-Clara\/Senior-Deep-Learning-Data-Scientist--RAPIDS---AI_JR1929838\" target=\"_blank\" rel=\"noopener noreferrer\">roles<\/a>\u00a0that focus on machine learning.)<\/p>\n<p><strong>Working end-to-end provides increased context.<\/strong>\u00a0While specialized roles can increase efficiency, it reduces context (for the data scientist) and leads to suboptimal solutions.<\/p>\n<blockquote>\n<p>The trick to forgetting the big picture is to look at everything close-up. \u2013 Chuck Palahniuk<\/p>\n<\/blockquote>\n<p>It\u2019s hard to design a holistic solution without the full context of the upstream problem. Let\u2019s say conversion has decreased, and a PM raises a request to improve our search algorithm. However, what\u2019s causing the decrease in the first place? There could be various causes:<\/p>\n<ul>\n<li>Product: Is fraudulent\/poor quality products reducing customer trust?<\/li>\n<li>Data pipelines: Has data quality been compromised, or are there delays\/outages?<\/li>\n<li>Model refresh: Is the model not refreshing regularly\/correctly?<\/li>\n<\/ul>\n<p>More often than not, the problem\u2014and solution\u2014lies\u00a0<em>outside<\/em>\u00a0of machine learning. A solution to\u00a0<em>improve the algorithm<\/em>\u00a0would miss the root cause.<\/p>\n<p>Similarly, it\u2019s risky to develop a solution without awareness of downstream engineering and product constraints. There\u2019s no point:<\/p>\n<ul>\n<li>Building a near-real time recommender if infra and engineer cannot support it<\/li>\n<li>Building an infinite scroll recommender if it doesn\u2019t fit in our product and app<\/li>\n<\/ul>\n<p>By working end-to-end, data scientists will have the full context to identify the right problems and develop usable solutions. It can also lead to innovative ideas that specialists, with their narrow context, might miss. Overall, it increases the ability to deliver value.<\/p>\n<p><strong>Communication and coordination overhead is reduced.<\/strong>\u00a0With multiple roles comes additional overhead. Let\u2019s look at an example of a data engineer (DE) cleaning the data and creating features, a data scientist (DS) analysing the data and training the model, and a machine learning engineer (MLE) deploying and maintaining it.<\/p>\n<blockquote>\n<p>What one programmer can do in one month, two programmers can do in two months. \u2013 Frederick P. Brooks<\/p>\n<\/blockquote>\n<p>The DE and DS need to\u00a0<em>communicate<\/em>\u00a0on what data is (and is not) available, how it should be cleaned (e.g., outliers, normalisation), and which features should be created. Similarly, the DS and MLE have to discuss how to deploy, monitor, and maintain the model, as well as how often it should be refreshed. When issues occur, we\u2019ll need three people in the room (likely with a PM) to triage the root cause and next steps to fix it.<\/p>\n<p>It also leads to additional coordination, where schedules need to be aligned as work is executed and passed along in a sequential approach. If the DS wants to experiment with additional data and features, we\u2019ll need to wait for the DE to ingest the data and create the features. If a new model is ready for A\/B testing, we\u2019ll need to wait for the MLE to (convert it to production code) and deploy it.<\/p>\n<p>While the actual development work may take days, the communication back-and-forth and coordination can take weeks, if not longer. With end-to-end data scientists, we can minimize this overhead as well as prevent technical details from being lost in translation.<\/p>\n<p>(But, can an end-to-end DS really do all that? I think so. While the DS might not be as proficient in some tasks as a DE or MLE, they will be able to perform most tasks effectively. If they need help with scaling or hardening, they can always get help from specialist DEs and MLEs.)<\/p>\n<blockquote>\n<p><em><strong>The Cost of Communication and Coordination<\/strong><\/em><\/p>\n<p>Richard Hackman, a Harvard psychologist, showed that the number of relationships in a team is <em>N(N-1) \/ 2,<\/em>\u00a0where\u00a0<em>N<\/em>\u00a0is the number of people. This leads to exponential growth in links, where:<\/p>\n<ul>\n<li>A start-up team of 7 has 21 links to maintain<\/li>\n<li>A group of 21 (i.e., three start-up teams) has 210 links<\/li>\n<li>A group of 63 has almost 2,000 links.<\/li>\n<\/ul>\n<p>In our simple example, we only had three roles (i.e., six links). But as a PM, BA, and additional members are included, this leads to greater than linear growth in communication and coordination costs. Thus, while each additional member increases total team productivity, the increased overhead means productivity grows at a decreasing rate. (Amazon\u2019s\u00a0<a href=\"https:\/\/buffer.com\/resources\/small-teams-why-startups-often-win-against-google-and-facebook-the-science-behind-why-smaller-teams-get-more-done\/\" target=\"_blank\" rel=\"noopener noreferrer\">two-pizza teams<\/a>\u00a0are a possible solution to this.)<\/p>\n<\/blockquote>\n<p><strong>Iteration and learning rate is increased.<\/strong>\u00a0With greater context and lesser overhead, we can now iterate, fail (read: learn), and deliver value faster.<\/p>\n<p>This is especially important for developing data and algorithmic products. Unlike software engineering (a far more mature craft), we can\u2019t do all the learning and design before we start building\u2014our blueprints, architectures, and design patterns are not as developed. Thus, rapid iteration is essential for the design-build-learn cycle.<\/p>\n<p><strong>There\u2019s greater ownership and accountability.<\/strong>\u00a0Having the data science process split across multiple people can lead to diffusion of responsibility, and worse, social loafing.<\/p>\n<p>A common anti-pattern observed is \u201c<a href=\"https:\/\/wiki.c2.com\/?ThrownOverTheWall\" target=\"_blank\" rel=\"noopener noreferrer\">throw over the wall<\/a>.\u201d For example, the DE creates features and throws a database table to the DS, the DS trains a model and throws\u00a0R\u00a0code over to the MLE, and the MLE translates it to\u00a0Java\u00a0to production.<\/p>\n<p>If things get lost-in-translation or if results are unexpected, who is responsible? With a strong culture of ownership, everyone steps up to contribute in their respective roles. But without it, work can degenerate into ass-covering and finger-pointing while the issue persists and customers and the business suffers.<\/p>\n<p>Having the end-to-end data scientist take ownership and responsibility for the entire process can mitigate this. They should be empowered to take action from start to finish, from the customer problem and input (i.e., raw data) to the output (i.e., deployed model) and measurable outcomes.<\/p>\n<blockquote>\n<p><em><strong>Diffusion of Responsibility &amp; Social Loafing<\/strong><\/em><\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Diffusion_of_responsibility\" target=\"_blank\" rel=\"noopener noreferrer\">Diffusion of responsibility<\/a>: We are less likely to take responsibility and act when there are others present. Individuals feel less responsibility and urgency to help if we know that there are others also watching the situation.<\/p>\n<p>One form of this is the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Diffusion_of_responsibility#Bystander_effect\" target=\"_blank\" rel=\"noopener noreferrer\">Bystander effect<\/a>, where\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Murder_of_Kitty_Genovese\" target=\"_blank\" rel=\"noopener noreferrer\">Kitty Genovese<\/a>\u00a0was stabbed outside the apartment building across the street from where she lived. While there were 38 witnesses who saw or heard the attack, none called the police or helped her.<\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Social_loafing\" target=\"_blank\" rel=\"noopener noreferrer\">Social loafing<\/a>: We exert less effort when we work in a group vs. working alone. In the 1890s, Ringelmann made people pull on ropes both separately and in groups. He measured how hard they pulled and found that members of a group tended to exert less effort in pulling a rope than did individuals alone.<\/p>\n<\/blockquote>\n<p><strong>For (some) data scientists, it can lead to increased motivation and job satisfaction<\/strong>, which is\u00a0<a href=\"https:\/\/www.clearpointstrategy.com\/how-employees-are-motivated-autonomy-mastery-purpose\/\" target=\"_blank\" rel=\"noopener noreferrer\">closely tied<\/a>\u00a0to autonomy, mastery, and purpose.<\/p>\n<ul>\n<li>\n<strong>Autonomy:<\/strong>\u00a0By being able to solve problems independently. Instead of waiting and depending on others, end-to-end data scientists are able to identify and define the problem, build their own data pipelines, and deploy and validate a solution.<\/li>\n<li>\n<strong>Mastery:<\/strong>\u00a0In the problem, solution, outcome from end-to-end. They can also pick up the domain and tech as required.<\/li>\n<li>\n<strong>Purpose<\/strong>: By being deeply involved in the entire process, they have a more direct connection with the work and outcomes, leading to an increased sense of\u00a0<em>purpose<\/em>.<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>But, we need specialist experts too<\/h3>\n<p>\u00a0<\/p>\n<p>Being end-to-end is not for everyone (and every team) though, for reasons such as:<\/p>\n<p><strong>Wanting to specialize<\/strong>\u00a0in machine learning, or perhaps a specific niche in machine learning such as neural text generation (read:\u00a0<a href=\"https:\/\/mc.ai\/the-subtle-art-of-priming-gpt-3\/\" target=\"_blank\" rel=\"noopener noreferrer\">GPT-3 primer<\/a>). While being end-to-end is valuable, we also need such world-class experts in research and industry who push the envelope. Much of what we have in ML came from academia and pure research efforts.<\/p>\n<blockquote>\n<p>No one achieves greatness by becoming a generalist. You don\u2019t hone a skill by diluting your attention to its development. The only way to get to the next level is focus. \u2013 John C. Maxwell<\/p>\n<\/blockquote>\n<p><strong>Lack of interest.<\/strong>\u00a0Not everyone is keen to engage with customers and businesses to define the problem, gather requirements, and write design documents. Likewise, not everyone is interested in software engineering, production code, unit tests, and CI\/CD pipelines.<\/p>\n<p><strong>Working on large, high leverage systems where 0.01% improvement has giant impact.<\/strong>\u00a0For example, algorithmic trading and advertising. In such situations, hyper-specialization is required to eke out those improvements.<\/p>\n<p>Others have also made arguments for why data scientists should specialize (and not be end-to-end). Here are a few articles to provide balance and counter-arguments:<\/p>\n<p>\u00a0<\/p>\n<h3>The best way to pick it up is via learning by doing<\/h3>\n<p>\u00a0<\/p>\n<p>If you\u2019re still keen on becoming more end-to-end, we\u2019ll now discuss how to do so. Before that, without going into specific technologies, here are the buckets of skills that end-to-end data scientists commonly use:<\/p>\n<ul>\n<li>Product: Understand customer problems, define and prioritize requirements<\/li>\n<li>Communication: Facilitate across teams, get buy-in, write docs, share results<\/li>\n<li>Data engineering: Move and transform data from point A to B<\/li>\n<li>Data analysis: Understand and visualize data, A\/B testing &amp; inference<\/li>\n<li>Machine learning: The usual plus experimentation, implementation, and metrics<\/li>\n<li>Software engineering: Production code practices including unit tests, docs, logging<\/li>\n<li>Dev Ops: Basic containerization and cloud proficiency, build and automation tools<\/li>\n<\/ul>\n<p>(This list is neither mandatory nor exhaustive. Most projects don\u2019t require all of them.)<\/p>\n<p>Here are four ways you can move closer to being an end-to-end data scientist:<\/p>\n<p><strong>Study the right books and courses.<\/strong>\u00a0(Okay, this is\u00a0<em>not<\/em>\u00a0learning by doing, but we all need to start somewhere). I would focus on courses that cover tacit knowledge rather than specific tools. While I\u2019ve not come across such materials, I\u2019ve heard good reviews about\u00a0<a href=\"https:\/\/course.fullstackdeeplearning.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Full Stack Deep Learning<\/a>.<\/p>\n<p><strong>Do your own projects end-to-end<\/strong>\u00a0to get first-hand experience of the entire process. At the risk of oversimplifying it, here are some steps I would take with their associated skills.<\/p>\n<blockquote>\n<p>I hear and I forget. I see and I remember. I do and I understand. \u2013 Confucius<\/p>\n<\/blockquote>\n<p>Start with identifying a problem to solve and determining the success metric (<em>product<\/em>). Then, find some\u00a0<a href=\"https:\/\/datasetsearch.research.google.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">raw data<\/a>\u00a0(i.e., not Kaggle competition data); this lets you clean and prepare the data and create features (<em>data engineering<\/em>). Next, try various ML models, examining learning curves, error distributions, and evaluation metrics (<em>data science<\/em>).<\/p>\n<p>Assess each model\u2019s performance (e.g., query latency, memory footprint) before picking one and writing a basic\u00a0<a href=\"https:\/\/eugeneyan.com\/writing\/product-categorization-api-part-3-creating-an-api\/#creating-a-titlecategorize-class\" target=\"_blank\" rel=\"noopener noreferrer\">inference class<\/a>\u00a0around it for production (<em>software engineering<\/em>). (You might also want to build a simple user interface). Then, containerise and deploy it online for others to use via your preferred cloud provider (<em>dev ops)<\/em>.<\/p>\n<p>Once that\u2019s done, go the extra mile to share about your work. You could write an article for your site or speak about it at a meetup (<em>communication<\/em>). Show what you found in the data via meaningful visuals and tables (<em>data analysis<\/em>). Share your work on GitHub.\u00a0<a href=\"https:\/\/www.swyx.io\/writing\/learn-in-public\/\" target=\"_blank\" rel=\"noopener noreferrer\">Learning<\/a>\u00a0and working in public is a great way to get feedback and find potential collaborators.<\/p>\n<p><strong>Volunteer through groups like\u00a0<a href=\"https:\/\/www.datakind.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">DataKind<\/a>.<\/strong>\u00a0DataKind works with social organizations (e.g., NGOs) and data professionals to address humanitarian issues. By collaborating with these NGOs, you get the opportunity to work as part of a team to tackle real problems with real(ly messy) data.<\/p>\n<p>While volunteers may be assigned specific roles (e.g., PM, DS), you\u2019re always welcome to tag along and observe. You\u2019ll see (and learn) how PMs engage with NGOs to frame the problem, define solutions, and organize the team around it. You\u2019ll learn from fellow volunteers how to work with data to develop working solutions. Volunteering in hackathon-like\u00a0<a href=\"https:\/\/www.datakind.org\/datadives\" target=\"_blank\" rel=\"noopener noreferrer\">DataDives<\/a>\u00a0and longer-term\u00a0<a href=\"https:\/\/www.datakind.org\/datacorps\" target=\"_blank\" rel=\"noopener noreferrer\">DataCorps<\/a>\u00a0is a great way to contribute to the data science process end-to-end.<\/p>\n<p><strong>Join a startup-like team.<\/strong>\u00a0Note: A startup-like team is not synonymous with a startup. There are big organizations that run teams in a startup-like manner (e.g., two-pizza teams) and startups made up of specialists. Find a lean team where you\u2019re encouraged, and have the opportunity, to work end-to-end.<\/p>\n<p>\u00a0<\/p>\n<h3>End-to-end in Stitch Fix and Netflix<\/h3>\n<p>\u00a0<\/p>\n<p>Eric Colson of\u00a0<strong>Stitch Fix<\/strong>\u00a0was initially \u201clured to a function-based division of labour by the attraction of process efficiencies\u201d (i.e., the\u00a0<a href=\"https:\/\/multithreaded.stitchfix.com\/blog\/2019\/03\/11\/FullStackDS-Generalists\/\" target=\"_blank\" rel=\"noopener noreferrer\">data science pin factory<\/a>). But over trial and error, he found end-to-end data scientists to be more effective. Now, instead of organizing data teams for specialization and productivity, Stitch Fix organizes them for\u00a0<strong>learning and developing new data and algorithmic products<\/strong>.<\/p>\n<blockquote>\n<p>The goal of data science is not to execute. Rather, the goal is to learn and develop new business capabilities. \u2026 There are no blueprints; these are new capabilities with inherent uncertainty. \u2026 All the elements you\u2019ll need must be learned through experimentation, trial and error, and iteration. \u2013 Eric Colson<\/p>\n<\/blockquote>\n<p>He suggests that data science roles should be made more general, with broad responsibilities agnostic to technical function and optimized for learning. Thus, his team hires and grows generalists who can conceptualize, model, implement, and measure. Of course, this is dependent on a solid data platform that abstracts away the complexities of infra setup, distributed processing, monitoring, automated failover, etc.<\/p>\n<p>Having end-to-end data scientists improved Stitch Fix\u2019s learning and innovation capabilities, enabling them to discover and build more business capabilities (relative to a specialist team).<\/p>\n<p><strong>Netflix<\/strong>\u00a0Edge Engineering initially had specialized roles. However, this created inefficiencies across the product life cycle. Code releases took more time (weeks instead of days), deployment problems took longer to detect and resolve, and production issues required multiple back-and-forth communications.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/eugeneyan.com\/assets\/sldc-specialists.jpg\" width=\"90%\"><\/p>\n<p><em>At the extreme, each function area\/product is owned by 7 people (<a href=\"https:\/\/netflixtechblog.com\/full-cycle-developers-at-netflix-a08c31f83249\" target=\"_blank\" rel=\"noopener noreferrer\">source<\/a>).<\/em><\/p>\n<p>To address this, they experimented with\u00a0<a href=\"https:\/\/netflixtechblog.com\/full-cycle-developers-at-netflix-a08c31f83249\">Full Cycle Developers<\/a>\u00a0who were empowered to work across the entire software life cycle. This required a mindset shift\u2014instead of just considering design and development, devs also had to consider deployment and reliability.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/eugeneyan.com\/assets\/full-cycle-dev.jpg\" width=\"90%\"><\/p>\n<p><em>Instead of multiple roles and people, we now have the full cycle dev (<a href=\"https:\/\/netflixtechblog.com\/full-cycle-developers-at-netflix-a08c31f83249\" target=\"_blank\" rel=\"noopener noreferrer\">source<\/a>).<\/em><\/p>\n<p>To support full cycle devs, centralized teams built tooling to automate and simplify common development processes (e.g., build and deploy pipelines, monitoring, managed rollbacks). Such tooling is reusable across multiple teams, acts as a force multiplier, and helped devs be effective across the entire cycle.<\/p>\n<p>With the full cycle developer approach, Edge Engineering was able to iterate quicker (instead of coordinating across teams), with faster and more routine deployments.<\/p>\n<p>\u00a0<\/p>\n<h3>Did it work for me? Here are a few examples<\/h3>\n<p>\u00a0<\/p>\n<p>At IBM, I was on a team that created job recommendations for staff. Running the entire pipeline took a very long time. I thought we could halve the time by moving the data prep and feature engineering pipelines into the database. But, the database guy didn\u2019t have time to test this. Being impatient, I ran some benchmarks and reduced overall run time by 90%. This allowed us to experiment 10x faster and save on compute costs in production.<\/p>\n<p>While building Lazada\u2019s ranking system, I found\u00a0Spark\u00a0necessary for data pipelines (due to the large data volume). However, our cluster only supported the\u00a0Scala\u00a0API, which I was unfamiliar with. Not wanting to wait (for data engineering support), I chose the faster\u2014but painful\u2014route of figuring out Scala Spark and writing the pipelines myself. This likely halved dev time and gave me a better understanding of the data to build a better model.<\/p>\n<p>After a successful A\/B test, we found that business stakeholders didn\u2019t trust the model. As a result, they were manually picking top products to display, decreasing online metrics (e.g., CTR, conversion). To understand more, I made trips to our marketplaces (e.g., Indonesia, Vietnam). Through mutual education, we were able to address their concerns and reduce the amount of manual overwriting and reap the gains.<\/p>\n<p>In the examples above,\u00a0<strong>going out of the regular DS &amp; ML job scope helped with delivering more value, faster<\/strong>. In the last example, it was necessary to unblock our data science efforts.<\/p>\n<p>\u00a0<\/p>\n<h3>Try it out<\/h3>\n<p>\u00a0<\/p>\n<p>You may not be end-to-end now. That\u2019s okay\u2014few people are. Nonetheless, consider its benefits and stretching closer towards it.<\/p>\n<p>Which aspects would disproportionately improve your ability to deliver as a data scientist? Increased engagement with customers and stakeholders to design more holistic, innovative solutions? Building and orchestrating your own data pipelines? Greater awareness of engineering and product constraints for faster integration and deployments?<\/p>\n<blockquote class=\"twitter-tweet\">\n<p dir=\"ltr\" lang=\"en\">Unpopular view: Data scientists should be more end-to-end.<\/p>\n<p>While this is frowned upon (too generalist!), I&#8217;ve seen it lead to more context, faster iteration, greater innovation\u2014more value, faster.<\/p>\n<p>More details and Stitch Fix &amp; Netflix&#8217;s experience \ud83d\udc47 <a href=\"https:\/\/t.co\/aOBjuBSsSz\">https:\/\/t.co\/aOBjuBSsSz<\/a><\/p>\n<p>\u2014 Eugene Yan (@eugeneyan) <a href=\"https:\/\/twitter.com\/eugeneyan\/status\/1293360153916407808?ref_src=twsrc%5Etfw\">August 12, 2020<\/a><\/p>\n<\/blockquote>\n<p><a href=\"https:\/\/eugeneyan.com\/writing\/end-to-end-data-science\/\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><strong>Bio: <\/strong>Eugene Yan (<a href=\"https:\/\/twitter.com\/eugeneyan\" target=\"_blank\" rel=\"noopener noreferrer\">@eugeneyan<\/a>)\u00a0works at the intersection of consumer data and tech to build machine learning systems that help customers.\u00a0Eugene also writes about how to be effective in data science, learning, and career. Currently, Eugene is an Applied Scientist at Amazon helping users read more, and get more out of reading.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/09\/data-scientists-should-be-more-end-to-end.html<\/p>\n","protected":false},"author":0,"featured_media":1672,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1671"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=1671"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1671\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/1672"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=1671"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=1671"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=1671"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}