{"id":2945,"date":"2020-10-07T14:39:43","date_gmt":"2020-10-07T14:39:43","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/10\/07\/a-step-by-step-guide-for-creating-an-authentic-data-science-portfolio-project\/"},"modified":"2020-10-07T14:39:43","modified_gmt":"2020-10-07T14:39:43","slug":"a-step-by-step-guide-for-creating-an-authentic-data-science-portfolio-project","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/10\/07\/a-step-by-step-guide-for-creating-an-authentic-data-science-portfolio-project\/","title":{"rendered":"A step-by-step guide for creating an authentic data science portfolio project"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/felix-vemmer\/\" target=\"_blank\" rel=\"noopener noreferrer\">Felix Vemmer<\/a>, Operational Intelligence Data Analyst at N26<\/b>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*9HyyS57vSdBbfY7pGsClIw.png\" width=\"90%\"><\/p>\n<p>As an inspiring data scientist,<strong>\u00a0building interesting portfolio projects is key to showcase your skills.<\/strong> When I\u00a0<a href=\"https:\/\/medium.com\/@felix.vemmer\/how-what-and-why-you-should-learn-python-as-a-business-student-ea539df88698\" target=\"_blank\" rel=\"noopener noreferrer\">learned coding and data science as a business student<\/a>\u00a0through online courses, I disliked that datasets were made up of\u00a0fake data or were solved before\u00a0like\u00a0<a href=\"https:\/\/www.kaggle.com\/vikrishnan\/boston-house-prices\" target=\"_blank\" rel=\"noopener noreferrer\">Boston House Prices<\/a>\u00a0or the\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/titanic\" target=\"_blank\" rel=\"noopener noreferrer\">Titanic dataset<\/a>\u00a0on Kaggle.<\/p>\n<p>In this blog post, I want to\u00a0<strong>show you how I develop interesting data science project ideas and implement them step by step<\/strong>, such as exploring Germany\u2019s biggest frequent flyer forum Vielfliegertreff. If you are\u00a0short on time, feel free to skip to the conclusion TLDR.<\/p>\n<p>\u00a0<\/p>\n<h3>Step 1: Choose your passion topic that is relevant<\/h3>\n<p>\u00a0<\/p>\n<p>As a first step, I think about a potential project that fulfills the following three requirements to make it the most interesting and enjoyable:<\/p>\n<ol>\n<li>Solving\u00a0<strong>my own problem or burning question.<\/strong>\n<\/li>\n<li>Connected to some<strong>recent event to be relevant<\/strong>\u00a0or especially interesting.<\/li>\n<li>Has not\u00a0<strong>been solved or covered before.<\/strong>\n<\/li>\n<\/ol>\n<p>As these ideas are still quite abstract, let me give you a rundown of how my three projects fulfilled the requirements:<\/p>\n<p><em>Overview of my own data science portfolio projects fulfilling the three outlined requirements.<\/em><\/p>\n<p>As a beginner,\u00a0do not strive for perfection, but choose something you are\u00a0genuinely curious about\u00a0and write down all the questions you want to explore in your topic.<\/p>\n<p>\u00a0<\/p>\n<h3>Step 2: Start scraping together your own dataset<\/h3>\n<p>\u00a0<\/p>\n<p>Given that you followed my third requirement, there will be no dataset publicly available, and you will have to scrape data together yourself. Having scraped a couple of websites, there are\u00a0<strong>3 major frameworks<\/strong> I use for different scenarios:<\/p>\n<p><em>Overview of the 3 major frameworks I use for scraping.<\/em><\/p>\n<p>For Vielfliegertreff, I used\u00a0<a href=\"https:\/\/scrapy.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">scrapy<\/a>\u00a0as a framework for the following reasons:<\/p>\n<ol>\n<li>There was\u00a0<strong>no JavaScript\u00a0<\/strong>enabled elements that were hiding data.<\/li>\n<li>The website\u00a0<strong>structure was complex, <\/strong>having to go from each forum subject, to all the threads and from all the treads to all post website pages. With\u00a0<strong>scrapy you can easily implement complex logic\u00a0<\/strong>yielding requests that lead to new callback functions in an organized way.<\/li>\n<li>There were quite a lot of posts, so crawling the entire forum will definitely take some time. Scrapy allows you to\u00a0<strong>asynchronously scrape websites at an incredible speed<\/strong>.<\/li>\n<\/ol>\n<p>To give you just an idea of how powerful scrapy is, I quickly benchmarked my MacBook Pro (13-inch, 2018, Four Thunderbolt 3 Ports) with a 2,3 GHz Quad-Core Intel Core i5 that was able to\u00a0scrape around 3000 pages\/minute:<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/1250\/1*sv9mKLoe7Wx1_vN0h2ez9A.png\" width=\"90%\"><\/p>\n<p><em>Scrapy scraping benchmark. (Image by Author)<\/em><\/p>\n<p>To be nice and not to get blocked, it is important that you scrape gently, by for example, enabling scrapy\u2019s\u00a0<a href=\"https:\/\/docs.scrapy.org\/en\/latest\/topics\/autothrottle.html\" target=\"_blank\" rel=\"noopener noreferrer\">auto-throttle feature<\/a>. Furthermore, I also saved all data to an SQL lite database via an items pipeline\u00a0to avoid duplicates and turned on to log each URL request to make sure I do not put more load on the server if I stop and restart the scraping process.<\/p>\n<p>Knowing how to scrape gives you the\u00a0freedom to collect datasets\u00a0by yourself and\u00a0teaches you important concepts about how the internet works, what a request is, and the structure of HTML\/XPath.<\/p>\n<p>For my project, I ended up with\u00a01.47 GB of data which was close to 1 million posts in the forum.<\/p>\n<p>\u00a0<\/p>\n<h3>Step 3: Cleaning your dataset<\/h3>\n<p>\u00a0<\/p>\n<p>With your own scraped messy dataset, the most challenging part of the portfolio project comes, where\u00a0<strong>data scientists spend on average 60% of their time<\/strong>:<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*UO5rbRe0qCn42CayJIyRUQ.jpeg\" width=\"90%\"><\/p>\n<p><em>Image by\u00a0<a href=\"https:\/\/visit.figure-eight.com\/rs\/416-ZBE-142\/images\/CrowdFlower_DataScienceReport_2016.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">CrowdFlower<\/a>\u00a02016.<\/em><\/p>\n<p>Unlike clean Kaggle datasets, your own dataset allows you to\u00a0build skills in data cleaning\u00a0and show a future employer that you are\u00a0ready to deal with messy real-life datasets. Additionally, you can explore and take advantage of the Python ecosystem by\u00a0leveraging libraries that solve some common data cleaning tasks\u00a0that others solved before.<\/p>\n<p>For my dataset from Vielfliegertreff, there were a couple of common tasks like turning the\u00a0dates into a pandas timestamps, converting numbers from strings into actual numeric data types, and cleaning a very messy HTML post text\u00a0to something readable and usable for NLP tasks. While some tasks are a bit more complicated, I would like to\u00a0<strong>share my top 3 favourite librarie<\/strong>s that solved some of my common data cleaning problems:<\/p>\n<ol>\n<li>\n<a href=\"https:\/\/github.com\/scrapinghub\/dateparser\" target=\"_blank\" rel=\"noopener noreferrer\">dateparser<\/a>:\u00a0Easily parse localized dates in almost any string formats commonly found on web pages.<\/li>\n<li>\n<a href=\"https:\/\/github.com\/jfilter\/clean-text\" target=\"_blank\" rel=\"noopener noreferrer\">clean-text<\/a>:\u00a0Preprocess your scraped data with clean-text to create a normalized text representation. This one is also amazing to remove personally identifiable information, such as emails or phone numbers etc.<\/li>\n<li>\n<a href=\"https:\/\/github.com\/seatgeek\/fuzzywuzzy\" target=\"_blank\" rel=\"noopener noreferrer\">fuzzywuzzy<\/a>:\u00a0Fuzzy string matching like a boss.<\/li>\n<\/ol>\n<p>\u00a0<\/p>\n<h3>Step 4: Data Exploration and Analysis<\/h3>\n<p>\u00a0<\/p>\n<p>When completing the Data Science Nanodegree on Udacity, I came across the\u00a0<strong>Cross-Industry Standard Process for Data Mining (CRISP-DM)<\/strong>, which I thought was quite an interesting framework to structure your work in a systematic way.<\/p>\n<p>With our current flow, we implicitly followed the CRISP-DM for our project:<\/p>\n<p>Expressing\u00a0<strong>business understanding<\/strong>\u00a0by coming up with the following questions in step 1:<\/p>\n<ol>\n<li>How is COVID-19 impacting online frequent flyer forums like Vielfliegertreff?<\/li>\n<li>What are some of the best posts in the forums?<\/li>\n<li>Who are the experts that I should follow as a new joiner?<\/li>\n<li>What are some of the worst or best things people say about airlines or airports?<\/li>\n<\/ol>\n<p>And with the scraped data, we are now able to\u00a0translate our initial business questions from above into specific data explanatory questions:<\/p>\n<ol>\n<li>How many posts are posted on a monthly basis? Did the posts decrease at the beginning of 2020 after COVID-19? Is there also some sort of indication that fewer people joined the platform not being able to travel?<\/li>\n<li>What are the top 10 number of posts by the number of likes?<\/li>\n<li>Who is posting the most and also receives, on average, the most likes for the post? These are the users I should regularly follow to see the best content.<\/li>\n<li>Could a sentiment analysis on every post in combination with named entity recognition to identify cities\/airports\/airlines lead to interesting positive or negative comments?<\/li>\n<\/ol>\n<p>For the Vielfliegertreff project, one can definitely say that there has been\u00a0a trend of declining posts over the years. With\u00a0COVID-19, we can clearly see a rapid decrease in posts from January 2020\u00a0onwards when Europe was shutting down and closing borders, which also heavily affected travelling:<\/p>\n<p><em>Posts created by month. (Chart by Author)<\/em><\/p>\n<p>Also,\u00a0user sign-ups have gone down over the years, and the forum seemed to see less and less of its rapid growth since the start in January 2009:<\/p>\n<p><em>Sign up numbers of users over the months. (Chart by author)<\/em><\/p>\n<p>Last but not least, I wanted to check what the most liked post was about. Unfortunately, it is in Germany, but it was indeed a very interesting post, where a German guy was allowed to\u00a0spend some time on a US aircraft carrier and experienced a catapult take off in a C2 airplane. The post has some very nice pictures and interesting details. Feel free to check it out\u00a0<a href=\"https:\/\/www.vielfliegertreff.de\/reiseberichte\/105553-retro-tripreport-flugzeugtraeger.html\">here<\/a>\u00a0if you can understand some German:<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*DGRi75N1O3S4iI8KfGI8OA.jpeg\" width=\"90%\"><\/p>\n<p><em>Sample picture from the most liked post on Vielfliegertreff (Image by\u00a0<a href=\"https:\/\/www.vielfliegertreff.de\/reiseberichte\/105553-retro-tripreport-flugzeugtraeger.html\" target=\"_blank\" rel=\"noopener noreferrer\">fleckenmann<\/a>).<\/em><\/p>\n<p>\u00a0<\/p>\n<h3>Step 5: Share your work via a blog post or web app<\/h3>\n<p>\u00a0<\/p>\n<p>Once you are done with those steps, you can go one step further and create a model that classifies or predicts certain data points. For this project, I did not attempt further to use machine learning in a specific way, although I had some interesting ideas about\u00a0classifying the sentiment of posts in connection with certain airlines.<\/p>\n<p>In another project, however,\u00a0I modeled a price prediction algorithm that allows a user to get a price estimate for any type of tractor. The model was then deployed with the awesome\u00a0<a href=\"https:\/\/www.streamlit.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">streamlit framework<\/a>, which can be found\u00a0<a href=\"https:\/\/traktorpreis.herokuapp.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>\u00a0(be patient with loading, as it might load a bit slower).<\/p>\n<p>Another way to share your work is like me through blog posts on Medium,\u00a0<a href=\"https:\/\/hackernoon.com\/tagged\/python\" target=\"_blank\" rel=\"noopener noreferrer\">Hackernoon<\/a>,\u00a0<a href=\"https:\/\/www.kdnuggets.com\/\">KDNuggets<\/a>, or other popular websites. When writing blog posts about portfolio projects or other topics, such as\u00a0<a href=\"https:\/\/medium.com\/@felix.vemmer\/5-awesome-interactive-ai-apps-for-you-to-try-77bac8ab7554\" target=\"_blank\" rel=\"noopener noreferrer\">awesome interactive AI applications<\/a>, I always try to make them as<strong>\u00a0fun, visual, and interactive\u00a0<\/strong>as possible. Here are some of my top tips:<\/p>\n<ul>\n<li>Include nice pictures for\u00a0<strong>easy understanding and to break up some of the long text.<\/strong>\n<\/li>\n<li>Include interactive elements, like\u00a0tweets or videos that<strong> let the user interact.<\/strong>\n<\/li>\n<li>Change boring tables or charts for\u00a0<strong>interactive ones<\/strong> through tools and frameworks like airtable or plotly<strong>.<\/strong>\n<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Conclusion &amp; TLDR<\/h3>\n<p>\u00a0<\/p>\n<p>Come up with a blog post idea that answers a\u00a0burning question\u00a0you had or\u00a0solves your own problem. Ideally, the timing of the topic is\u00a0relevant and has not been analysed by anyone else before. Based on your experience, website structure, and complexity,\u00a0choose a framework that matches the scraping job best. During data cleaning,\u00a0leverage existing libraries\u00a0to solve painful data cleaning tasks like parsing timestamps or cleaning text. Finally, choose how you can best share your work. Both an interactive deployed model\/dashboard or a well written medium blog post\u00a0can differentiate you from other applicants on the journey to become a data scientist.<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/a-step-by-step-guide-for-creating-an-authentic-data-science-portfolio-project-aa641c2f2403\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><strong>Bio:<\/strong>\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/felix-vemmer\/\" target=\"_blank\" rel=\"noopener noreferrer\">Felix Vemmer<\/a>\u00a0is a Data Analyst at N26 focusing on creating interesting data sets and\u00a0projects through web scraping and machine learning.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/10\/guide-authentic-data-science-portfolio-project.html<\/p>\n","protected":false},"author":0,"featured_media":2946,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/2945"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=2945"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/2945\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/2946"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=2945"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=2945"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=2945"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}