{"id":8007,"date":"2020-12-29T00:26:16","date_gmt":"2020-12-29T00:26:16","guid":{"rendered":"https:\/\/healinglifespan.com\/data-science\/2020\/12\/29\/web-scraping-using-r\/"},"modified":"2020-12-29T00:26:16","modified_gmt":"2020-12-29T00:26:16","slug":"web-scraping-using-r","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/12\/29\/web-scraping-using-r\/","title":{"rendered":"Web Scraping Using R..!"},"content":{"rendered":"<div>\n<p class=\"graf\"><span lang=\"EN-US\">In this blog, I\u2019ll show you, How to <strong>Web Scrape using R..?<\/strong><\/span><\/p>\n<p class=\"graf\"><strong><span lang=\"EN-US\">What is R..?<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">R is a programming language and its environment built for statistical analysis, graphical representation &amp; reporting. R programming is mostly preferred by statisticians, data miners, and software programmers who want to develop statistical software.<\/span><\/p>\n<p class=\"graf\"><strong><span lang=\"EN-US\">R<\/span><\/strong><span lang=\"EN-US\"> is also available as Free Software under the terms of the Free Software Foundation\u2019s GNU General Public License in source code form.<\/span><\/p>\n<div id=\"attachment_5306\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/reasons-to-choose-r-programming.png\"><img aria-describedby=\"caption-attachment-5306\" loading=\"lazy\" class=\"wp-image-5306 size-full\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/reasons-to-choose-r-programming.png\" alt=\"Reasons to choose R\" width=\"639\" height=\"370\"><\/a><\/p>\n<p id=\"caption-attachment-5306\" class=\"wp-caption-text\"><strong>Reasons to choose R<\/strong><\/p>\n<\/div>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Let\u2019s begin our topic of Web Scraping using R.<\/span><\/strong><\/p>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Step 1- Select the website &amp; the data you want to scrape.<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">I picked this website \u201c<strong>https:\/\/www.alexa.com\/topsites\/countries\/IN<\/strong>\u201d and want to scrape data of Top 50 sites in India.<\/span><\/p>\n<div id=\"attachment_5305\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/Data-we-want-to-scrape-using-r.png\"><img aria-describedby=\"caption-attachment-5305\" loading=\"lazy\" class=\"size-medium wp-image-5305\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/Data-we-want-to-scrape-using-r-300x103.png\" alt=\"Data we want to scrape\" width=\"300\" height=\"103\"><\/a><\/p>\n<p id=\"caption-attachment-5305\" class=\"wp-caption-text\"><strong>Data we want to scrape<\/strong><\/p>\n<\/div>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Step 2- Get to know the HTML tags using SelectorGadget.<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">In my previous blog, I already discussed how to inspect &amp; find the proper HTML tags. So, now I\u2019ll explain an easier way to get the HTML tags.<\/span><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">You have to go to Google chrome extension (chrome:\/\/extensions) &amp; search <strong>SelectorGadget<\/strong>. Add it to your browser, it\u2019s a quite good CSS selector.<\/span><\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/SelectorGadget.png\"><img loading=\"lazy\" class=\"aligncenter size-full wp-image-5304\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/SelectorGadget.png\" alt=\"\" width=\"184\" height=\"184\"><\/a><\/p>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Step 3- R Code<\/span><\/strong><\/p>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Evoking Important Libraries or Packages<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">I\u2019m using RVEST package to scrape the data from the webpage; it is inspired by libraries like <strong>Beautiful Soup<\/strong>. If you didn\u2019t install the package yet, then follow the code in the snippet below.<\/span><\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/r-packages-import.png\"><img loading=\"lazy\" class=\"aligncenter wp-image-5303\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/r-packages-import-300x68.png\" alt=\"\" width=\"190\" height=\"43\"><\/a><\/p>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Step 4- Set the url of the website<\/span><\/strong><\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-r-readline.png\"><img loading=\"lazy\" class=\"aligncenter size-medium wp-image-5302\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-r-readline-300x11.png\" alt=\"\" width=\"300\" height=\"11\"><\/a><\/p>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Step 5- Find the HTML tags using SelectorGadget<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">It\u2019s quite easy to find the proper HTML tags in which your data is present.<\/span><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">Firstly, I have to click on data using SelectorGadget which I want to scrape, it automatically selects the data which are similar to selected HTML tags. Before going forward, cross-check the selected values, are they correct or some junk data is also gets selected..? If you noticed our page has only 50 values, but you can see 156 values are selected.<\/span><\/p>\n<div id=\"attachment_5301\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-selection-by-selectorgadget.png\"><img aria-describedby=\"caption-attachment-5301\" loading=\"lazy\" class=\"size-medium wp-image-5301\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-selection-by-selectorgadget-300x142.png\" alt=\"Selection by SelectorGadget\" width=\"300\" height=\"142\"><\/a><\/p>\n<p id=\"caption-attachment-5301\" class=\"wp-caption-text\"><strong>Selection by SelectorGadget<\/strong><\/p>\n<\/div>\n<p class=\"graf\"><span lang=\"EN-US\">So I need to remove unwanted values who get selected, once you click on them to deselect it, it turns red and others will turn yellow except our primary selection which turn to green. Now you can see only 50 values are selected as per our primary requirement but it\u2019s not enough. I have to again cross-check that <strong>some required values are not exchanged with junk values<\/strong>.<\/span><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">If we satisfy with our selection then copy the HTML tag &amp; include it into the code, else repeat this exercise.<\/span><\/p>\n<div id=\"attachment_5300\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-modified-selection-by-selectorgadget.png\"><img aria-describedby=\"caption-attachment-5300\" loading=\"lazy\" class=\"size-medium wp-image-5300\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-modified-selection-by-selectorgadget-300x142.png\" alt=\"\" width=\"300\" height=\"142\"><\/a><\/p>\n<p id=\"caption-attachment-5300\" class=\"wp-caption-text\"><strong>Modified Selection by SelectorGadget<\/strong><\/p>\n<\/div>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Step 6- Include the tag in our Code<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">After including the tags, our code is like this.<\/span><\/p>\n<div id=\"attachment_5299\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-code-snippet.png\"><img aria-describedby=\"caption-attachment-5299\" loading=\"lazy\" class=\"size-medium wp-image-5299\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-code-snippet-300x44.png\" alt=\"\" width=\"300\" height=\"44\"><\/a><\/p>\n<p id=\"caption-attachment-5299\" class=\"wp-caption-text\"><strong>Code Snippet<\/strong><\/p>\n<\/div>\n<p class=\"graf\"><span lang=\"EN-US\">If I run the code, values in each list object will be 50.<\/span><\/p>\n<div id=\"attachment_5298\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-data-stored-in-list-objects.png\"><img aria-describedby=\"caption-attachment-5298\" loading=\"lazy\" class=\"size-medium wp-image-5298\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-data-stored-in-list-objects-300x55.png\" alt=\"\" width=\"300\" height=\"55\"><\/a><\/p>\n<p id=\"caption-attachment-5298\" class=\"wp-caption-text\">Data Stored in List Objects<\/p>\n<\/div>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Step 7- Creating DataFrame<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">Now, we create a dataframe with our list-objects. So for creating a dataframe, we always need to remember one thumb rule that is the number of rows (length of all the lists) should be equal, else we get an error.<\/span><\/p>\n<div id=\"attachment_5297\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-error-appears-when-number-of-rows-differs.png\"><img aria-describedby=\"caption-attachment-5297\" loading=\"lazy\" class=\"size-medium wp-image-5297\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-error-appears-when-number-of-rows-differs-300x18.png\" alt=\"\" width=\"300\" height=\"18\"><\/a><\/p>\n<p id=\"caption-attachment-5297\" class=\"wp-caption-text\"><strong>Error appears when number of rows differs<\/strong><\/p>\n<\/div>\n<p><span lang=\"EN-US\">Finally, Our DataFrame will look like this<\/span>:<\/p>\n<div id=\"attachment_5296\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-final-r-dataframe.png\"><img aria-describedby=\"caption-attachment-5296\" loading=\"lazy\" class=\"size-medium wp-image-5296\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-final-r-dataframe-300x230.png\" alt=\"\" width=\"300\" height=\"230\"><\/a><\/p>\n<p id=\"caption-attachment-5296\" class=\"wp-caption-text\"><strong>Our Final Data<\/strong><\/p>\n<\/div>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Step 8- Writing our DataFrame to CSV file<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">We need our scraped data to be available locally for further analysis &amp; model building or other purposes.<\/span><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">Our final piece of code to write it in CSV file is<\/span>:<\/p>\n<div id=\"attachment_5295\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-writing-to-csv-file-r.png\"><img aria-describedby=\"caption-attachment-5295\" loading=\"lazy\" class=\"wp-image-5295 size-medium\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-writing-to-csv-file-r-300x34.png\" alt=\"\" width=\"300\" height=\"34\"><\/a><\/p>\n<p id=\"caption-attachment-5295\" class=\"wp-caption-text\"><strong>Writing to CSV file<\/strong><\/p>\n<\/div>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Step 9- Check the CSV file<\/span><\/strong><\/p>\n<div id=\"attachment_5294\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-data-written-into-csv-file.png\"><img aria-describedby=\"caption-attachment-5294\" loading=\"lazy\" class=\"size-medium wp-image-5294\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/webscraping-data-written-into-csv-file-262x300.png\" alt=\"\" width=\"262\" height=\"300\"><\/a><\/p>\n<p id=\"caption-attachment-5294\" class=\"wp-caption-text\"><strong>Data written in CSV file<\/strong><\/p>\n<\/div>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Conclusion-<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">I tried to explain Web Scraping using R in a simple way, Hope this will help you in understanding it better.<\/span><\/p>\n<p class=\"graf\"><strong><span lang=\"EN-US\">Find full code on<\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\"><a href=\"https:\/\/medium.com\/r\/?url=https%3A%2F%2Fgithub.com%2Fvgyaan%2FAlexa%2Fblob%2Fmaster%2Fwebscrap.R\" target=\"_blank\" rel=\"noopener noreferrer\"><span>https:\/\/github.com\/vgyaan\/Alexa\/blob\/master\/webscrap.R<\/span><\/a><\/span><\/p>\n<p class=\"graf\"><strong><span lang=\"EN-US\">If you have any questions about the code or web scraping in general, reach out to me on LinkedIn!<br \/><\/span><\/strong><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">Okay, we will meet again with the new exposer.<\/span><\/p>\n<p class=\"graf\"><span lang=\"EN-US\">Till then,<\/span><\/p>\n<p class=\"graf\"><em><b><span lang=\"EN-US\">Happy Coding..!<\/span><\/b><\/em><\/p>\n<p><a href=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/happy-coder.gif\"><img loading=\"lazy\" class=\"aligncenter size-full wp-image-5293\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/happy-coder-gap.jpg\" data-gif=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/happy-coder.gif\" alt=\"\" width=\"680\" height=\"488\"><\/a><\/p>\n<div id=\"author-bio-box\">\n<h3><a href=\"https:\/\/data-science-blog.com\/en\/blog\/author\/gyanvardhan\/\" title=\"All posts by Gyan Vardhan\" rel=\"author\">Gyan Vardhan<\/a><\/h3>\n<div class=\"bio-gravatar\"><img loading=\"lazy\" src=\"https:\/\/data-science-blog.com\/en\/wp-content\/uploads\/sites\/4\/2020\/11\/gyan-vardhan-80x80.jpeg\" width=\"70\" height=\"70\" alt=\"Gyan Vardhan\" class=\"avatar avatar-70 wp-user-avatar wp-user-avatar-70 alignnone photo\"><\/div>\n<p><a target=\"_blank\" rel=\"nofollow noopener noreferrer\" href=\"http:\/\/linkedin.com\/in\/gyan-vardhan-347570163\" class=\"bio-icon bio-icon-linkedin\"><\/a><\/p>\n<p class=\"bio-description\">Gyan Vardhan is a free and independent data scientist based in India.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/data-science-blog.com\/en\/blog\/2020\/11\/18\/web-scraping-using-r\/<\/p>\n","protected":false},"author":0,"featured_media":8008,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8007"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8007"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8007\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8008"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8007"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8007"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8007"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}