{"id":386,"date":"2020-08-13T15:02:04","date_gmt":"2020-08-13T15:02:04","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/13\/5-different-ways-to-load-data-in-python\/"},"modified":"2020-08-13T15:02:04","modified_gmt":"2020-08-13T15:02:04","slug":"5-different-ways-to-load-data-in-python","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/13\/5-different-ways-to-load-data-in-python\/","title":{"rendered":"5 Different Ways to Load Data in Python"},"content":{"rendered":"<div id=\"post-\">\n<div class=\"author-link\"><b>By <a href=\"https:\/\/www.kdnuggets.com\/author\/ahmad-anis\" title=\"Posts by Ahmad Anis\" rel=\"author\">Ahmad Anis<\/a>, Machine learning and Data Science Student.<\/b><\/div>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*_PWjBWF0UuKNCRuCreyXTA.png\" width=\"90%\"><\/p>\n<p>As a beginner, you might only know a single way to load data (normally in CSV) which is to read it using <em>p<\/em><em>andas.read_csv<\/em>\u00a0function. It is one of the most mature and strong functions, but other ways are a lot helpful and will definitely come in handy sometimes.<\/p>\n<p>The ways that I am going to discuss are:<\/p>\n<ul>\n<li>\n<strong>Manual\u00a0<\/strong>function<\/li>\n<li>\n<strong>loadtxt\u00a0<\/strong>function<\/li>\n<li>\n<strong>genfromtxt<\/strong>f unction<\/li>\n<li>\n<strong>read_csv\u00a0<\/strong>function<\/li>\n<li><strong>Pickle<\/strong><\/li>\n<\/ul>\n<p>The dataset that we are going to use to load data can be found\u00a0<a href=\"http:\/\/eforexcel.com\/wp\/downloads-18-sample-csv-files-data-sets-for-testing-sales\/\" target=\"_blank\" rel=\"noopener noreferrer\"><strong>here<\/strong><\/a>. It is named as 100-Sales-Records.<\/p>\n<p><strong>Imports<\/strong><\/p>\n<p>We will use Numpy, Pandas, and Pickle packages so import them.<\/p>\n<div>\n<pre>import <strong>numpy<\/strong> as np\r\nimport <strong>pandas<\/strong> as pd\r\nimport <strong>pickle\r\n\r\n<\/strong><\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<h3>1. Manual Function<\/h3>\n<p>\u00a0<\/p>\n<p>This is the most difficult, as you have to design a custom function, which can load data for you. You have to deal with Python\u2019s normal filing concepts and using that you have to read a\u00a0<em>.csv<\/em>\u00a0file.<\/p>\n<p>Let\u2019s do that on 100 Sales Records file.<\/p>\n<div>\n<pre>def <strong>load_csv<\/strong>(filepath):\r\n    data =  []\r\n    col = []\r\n    checkcol = <strong>False<\/strong>\r\n    with open(filepath) as f:\r\n        for val in f.readlines():\r\n            val = val.replace(\"n\",\"\")\r\n            val = val.split(',')\r\n            if checkcol is False:\r\n                <strong>col<\/strong> = val\r\n                checkcol = True\r\n            else:\r\n                data.append(val)\r\n    df = pd.DataFrame(data=data, columns=col)\r\n    return df\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>Hmmm, What is this????? Seems a bit complex code!!!! Let\u2019s break it step by step so you know what is happening and you can apply similar logic to read a\u00a0<em>.csv<\/em>\u00a0file of your own.<\/p>\n<p>Here, I have created a\u00a0<em>load_csv<\/em>\u00a0a function that takes in as an argument the path of the file you want to read.<\/p>\n<p>I have a list named as\u00a0<em>data\u00a0<\/em>which is going to have my data of CSV file, and another list\u00a0<em>col\u00a0<\/em>which is going to have my column names. Now after inspecting the csv manually, I know that my column names are in the first row, so in my first iteration, I have to store the data of the first row in\u00a0<em>col\u00a0<\/em>and rest rows in\u00a0<em>data<\/em>.<\/p>\n<p>To check the first iteration, I have used a Boolean Variable named as\u00a0<em>checkcol\u00a0<\/em>which is False, and when it is false in the first iteration, it stores the data of first-line in\u00a0<em>col\u00a0<\/em>and then it sets\u00a0<em>checkcol\u00a0<\/em>to True, so we will deal with\u00a0<em>data<\/em> list and store rest of values in\u00a0<em>data<\/em> list.<\/p>\n<p><strong><em>Logic<\/em><\/strong><\/p>\n<p>The main logic here is that I have iterated in the file, using\u00a0<em>readlines()<\/em>\u00a0a function in Python. This function returns a list that contains all the lines inside a file.<\/p>\n<p>When reading through headlines, it detects a new line as\u00a0<em>n<\/em>\u00a0character, which is line terminating character, so in order to remove it, I have used\u00a0<em>str.replace<\/em>\u00a0function.<\/p>\n<p>As it is a\u00a0<em>.csv<\/em>\u00a0file, so I have to separate things based on\u00a0<em>commas\u00a0<\/em>so I will split the string on a <em>,<\/em>\u00a0using\u00a0<em>string.split(&#8216;,&#8217;)<\/em>. For the first iteration, I will store the first row, which contains the column names in a list known as\u00a0<em>col<\/em>. And then I will append all my data in my list known as\u00a0<em>data<\/em>.<\/p>\n<p>To read the data more beautifully, I have returned it as a dataframe format because it is easier to read a dataframe as compared to a numpy array or python\u2019s list.<\/p>\n<p><strong><em>Output<\/em><\/strong><\/p>\n<div>\n<pre>myData = load_csv('100 Sales Record.csv')\r\nprint(myData.head())\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*NWPJGsz7z847MkkBSKPciw.png\" width=\"90%\"><\/p>\n<p><em>Data from Custom Function.<\/em><\/p>\n<p><strong>Pros and Cons<\/strong><\/p>\n<p>The important benefit is that you have all the flexibility and control over the file structure and you can read in whatever format and way you want and store it.<\/p>\n<p>You can also read the files which do not have a standard structure using your own logic.<\/p>\n<p>Important drawbacks of it are that it is complex to write especially for standard types of files because they can easily be read. You have to hard code the logic which requires trial and error.<\/p>\n<p>You should only use it when the file is not in a standard format or you want flexibility and reading the file in a way that is not available through libraries.<\/p>\n<p>\u00a0<\/p>\n<h3>2. Numpy.loadtxt function<\/h3>\n<p>\u00a0<\/p>\n<p>This is a built-in function in Numpy, a famous numerical library in Python. It is a really simple function to load the data. It is very useful for reading data which is of the same datatype.<\/p>\n<p>When data is more complex, it is hard to read using this function, but when files are easy and simple, this function is really powerful.<\/p>\n<p>To get the data of a single type, you can download\u00a0<a href=\"https:\/\/docs.google.com\/spreadsheets\/d\/16mgiYbNz-XaW_r6GXUy2cJ0hy2E-lxwFLaVXAYIAOj0\/edit?usp=sharing\" target=\"_blank\" rel=\"noopener noreferrer\">this<\/a>\u00a0dummy dataset. Let\u2019s jump to code.<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\ndf = np.<strong>loadtxt<\/strong>('convertcsv.csv', delimeter = ',')\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>Here we simply used the\u00a0<em>loadtxt\u00a0<\/em>function as passed in\u00a0<em>delimeter\u00a0<\/em>as\u00a0<em>&#8216;,&#8217;<\/em>\u00a0because this is a CSV file.<\/p>\n<p>Now if we print\u00a0<em>df<\/em>, we will see our data in pretty decent numpy arrays that are ready to use.<\/p>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/541\/1*Ntr88bt3qHXBeb9rSSySWQ.png\" width=\"90%\"><\/p>\n<p>We have just printed the first 5 rows due to the big size of data.<\/p>\n<p><strong>Pros and Cons<\/strong><\/p>\n<p>An important aspect of using this function is that you can quickly load in data from a file into numpy arrays.<\/p>\n<p>Drawbacks of it are that you can not have different data types or missing rows in your data.<\/p>\n<p>\u00a0<\/p>\n<h3>3. Numpy.genfromtxt()<\/h3>\n<p>\u00a0<\/p>\n<p>We will use the dataset, which is \u2018100 Sales Records.csv\u2019 which we used in our first example to demonstrate that we can have multiple data types in it.<\/p>\n<p>Let\u2019s jump to code.<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\ndata = np.<strong>genfromtxt<\/strong>('100 Sales Records.csv', delimiter=',')\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>and to see it more clearly, we can just see it in a dataframe format, i.e.,<\/p>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*fZX3GhcpuRLYSviLGVGPkQ.png\" width=\"90%\"><\/p>\n<p>Wait? What is this? Oh, It has skipped all the columns with string data types. How to deal with it?<\/p>\n<p>Just add another\u00a0<em>dtype\u00a0<\/em>parameter and set\u00a0<em>dtype\u00a0<\/em>to None which means that it has to take care of datatypes of each column itself. Not to convert whole data to single dtype.<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\ndata = np.<strong>genfromtxt<\/strong>('100 Sales Records.csv', delimiter=',', <strong><em>dtype=None<\/em><\/strong>)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>And then for output,<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\n&gt;&gt;&gt; pd.DataFrame(data).head()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*vqYljwx6dTcQwv3weXrprQ.png\" width=\"90%\"><\/p>\n<p>Quite better than the first one, but here our Columns titles are Rows, to make them column titles, we have to add another parameter which is\u00a0<em>names\u00a0<\/em>and set it to\u00a0<em>True\u00a0<\/em>so it will take the first row as the Column Titles.<\/p>\n<p>i.e.,<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\ndata = np.<strong>genfromtxt<\/strong>('100 Sales Records.csv', delimiter=',', dtype=None, <em><strong>names = True<\/strong><\/em>)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>and we can print it as<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\n&gt;&gt;&gt; pd.DataFrame(df3).head()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*e6Goc2skLT8GpbnBqDYfrQ.png\" width=\"90%\"><\/p>\n<p>And here we can see that It has successfully added the names of columns in the dataframe.<\/p>\n<p>Now the last problem is that the columns which are of string data types are not the actual strings, but they are in\u00a0<em>bytes\u00a0<\/em>format. You can see that before every string, we have a\u00a0<em>b&#8217;<\/em>\u00a0so to encounter them, we have to decode them in utf-8 format.<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\ndf3 = np.genfromtxt('100 Sales Records.csv', delimiter=',', dtype=None, names=True, <em><strong>encoding='utf-8'<\/strong><\/em>)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>This will return our dataframe in the desired form.<\/p>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*NbNP7JLb6s5RDGxLoftxQw.png\" width=\"90%\"><\/p>\n<p>\u00a0<\/p>\n<h3>4. Pandas.read_csv()<\/h3>\n<p>\u00a0<\/p>\n<p>Pandas is a very popular data manipulation library, and it is very commonly used. One of it\u2019s very important and\u00a0<strong>mature\u00a0<\/strong>functions is\u00a0<em>read_csv()<\/em>\u00a0which can read any\u00a0<strong>.csv\u00a0<\/strong>file very easily and help us manipulate it. Let\u2019s do it on our 100-Sales-Record dataset.<\/p>\n<p>This function is very popular due to its ease of use. You can compare it with our previous codes, and you can check it.<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\n&gt;&gt;&gt; pdDf = pd.read_csv('100 Sales Record.csv')\r\n&gt;&gt;&gt; pdDf.head()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*diKjsLA3hYVxl5wizlK0Rw.png\" width=\"90%\"><\/p>\n<p>And guess what? We are done. This was actually so simple and easy to use. Pandas.read_csv definitely offers a lot of other parameters to tune in our data set, for example in our\u00a0<em>convertcsv.csv<\/em>\u00a0file, we had no column names so we can read it as<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\n&gt;&gt;&gt; newdf = pd.read_csv('convertcsv.csv', <strong>header=None<\/strong>)\r\n&gt;&gt;&gt; newdf.head()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/304\/1*bP8GdiRUkFJLFy2M4BuL8w.png\" width=\"243\" height=\"195\"><\/p>\n<p>And we can see that it has read the\u00a0<em>csv\u00a0<\/em>file without the header. You can explore all other parameters in the official docs\u00a0<a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.read_csv.html#pandas.read_csv\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3>5. Pickle<\/h3>\n<p>\u00a0<\/p>\n<p>When your data is not in a good, human-readable format, you can use pickle to save it in a binary format. Then you can easily reload it using the pickle library.<\/p>\n<p>We will take our 100-Sales-Record CSV file and first save it in a pickle format so we can read it.<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\n<strong>with<\/strong> open('test.pkl','wb') <strong>as<\/strong> f:\r\n    pickle.dump(pdDf, f)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>This will create a new file\u00a0<em>test.pkl<\/em>\u00a0which has inside it our\u00a0<em>pdDf\u00a0<\/em>from\u00a0<strong>Pandas<\/strong>\u00a0heading.<\/p>\n<p>Now to open it using pickle, we just have to use\u00a0<em>pickle.load<\/em>\u00a0function.<\/p>\n<div>\n<pre><!-- CODE GOES HERE --> \r\n<strong>with<\/strong> open(\"test.pkl\", \"rb\") <strong>as<\/strong> f:\r\n    d4 = pickle.load(f)\r\n\r\n&gt;&gt;&gt; d4.head()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*cPto18xcdRPusS24Y9LDug.png\" width=\"90%\"><\/p>\n<p>And here we have successfully loaded data from a pickle file in\u00a0<em>pandas.DataFrame\u00a0<\/em>format.<\/p>\n<p>\u00a0<\/p>\n<h3>Learning Outcomes<\/h3>\n<p>\u00a0<\/p>\n<p>You are now aware of 5 different ways to load data files in Python, which can help you in different ways to load a data set when you are working in your day-to-day projects.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/08\/5-different-ways-load-data-python.html<\/p>\n","protected":false},"author":0,"featured_media":387,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/386"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=386"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/386\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/387"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=386"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=386"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=386"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}