{"id":1881,"date":"2020-09-24T16:20:47","date_gmt":"2020-09-24T16:20:47","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/24\/introduction-to-time-series-analysis-in-python\/"},"modified":"2020-09-24T16:20:47","modified_gmt":"2020-09-24T16:20:47","slug":"introduction-to-time-series-analysis-in-python","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/24\/introduction-to-time-series-analysis-in-python\/","title":{"rendered":"Introduction to Time Series Analysis in Python"},"content":{"rendered":"<div id=\"post-\">\n<div class=\"author-link\"><b>By <a href=\"https:\/\/www.kdnuggets.com\/author\/ahmad-anis\" title=\"Posts by Ahmad Anis\" rel=\"author\">Ahmad Anis<\/a>, Machine learning and Data Science Student.<\/b><\/div>\n<p>According to Wikipedia:<\/p>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>A\u00a0<strong>time series<\/strong>\u00a0is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.<\/em><\/p>\n<\/blockquote>\n<p data-selectable-paragraph=\"\">So any dataset in which is taken at successive equally spaced points in time. For example, we can see\u00a0<a href=\"https:\/\/fred.stlouisfed.org\/series\/UMTMVS\" target=\"_blank\" rel=\"noopener noreferrer\">this<\/a>\u00a0data set that is\u00a0<strong>Value of Manufacturers\u2019 Shipments for All Manufacturing Industries.<\/strong><\/p>\n<p data-selectable-paragraph=\"\">We will see some important points that can help us in analyzing any time-series dataset. These are:<\/p>\n<ul>\n<li><strong>Loading time series dataset correctly in Pandas<\/strong><\/li>\n<li><strong>Indexing in Time-Series Data<\/strong><\/li>\n<li><strong>Time-Resampling using Pandas<\/strong><\/li>\n<li><strong>Rolling Time Series<\/strong><\/li>\n<li><strong>Plotting Time-series Data using Pandas<\/strong><\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Loading time series dataset correctly in Pandas<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Let\u2019s load the dataset mentioned above in pandas.<\/p>\n<div>\n<pre>df = pd.read_csv('Data\/UMTMVS.csv')\r\ndf.head()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/323\/1*9rmQ4-WQA1InSbEqxrtyRw.png\" width=\"258\" height=\"165\"><\/p>\n<p>Since we want our \u201cDATE\u201d column as our index, but simply by reading, it is not doing it, so we have to add some extra parameters.<\/p>\n<div>\n<pre>df = pd.read_csv(\u2018Data\/UMTMVS.csv\u2019, index_col=\u2019DATE\u2019)\r\ndf.head()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\"><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/281\/1*SInfiOITQ7HTYQPmVE8RMw.png\" width=\"225\" height=\"197\"><\/p>\n<p data-selectable-paragraph=\"\">Great, now we have added our DATE column as the index, but let\u2019s check it\u2019s data type to know that if pandas is dealing with the index as simple objects or pandas built-in DateTime datatype.<\/p>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/860\/1*VmneMlnOdI0I0ejI50eZMQ.png\" width=\"90%\"><\/p>\n<p>Here we can see that Pandas is dealing with our Index column as a simple object, so let\u2019s convert it into DateTime. We can do it as follows:<\/p>\n<div>\n<pre>df.index = pd.to_datetime(df.index)\r\ndf.index\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/835\/1*A9mBSjyq1KGyycHEOYBVdQ.png\" width=\"90%\"><\/p>\n<p>Now we can see that\u00a0<em>dtype<\/em>\u00a0of our dataset is\u00a0<em>datetime64[ns]<\/em>. This \u201c[ns]\u201d shows that it is precise in nanoseconds. We can change it to \u201cDays\u201d or \u201cMonths\u201d if we want.<\/p>\n<p data-selectable-paragraph=\"\">Alternatively, to avoid all this fuss, we can load data in single line of code using Pandas as follows.<\/p>\n<div>\n<pre>df = pd.read_csv(\u2018Data\/UMTMVS.csv\u2019, index_col=\u2019DATE\u2019, parse_dates=True)\r\ndf.index\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/840\/1*Xgt5z8mH2cHJND1x3-8woA.png\" width=\"90%\"><\/p>\n<p>Here we have added\u00a0<em>parse_dates=True<\/em>, so it will automatically use our\u00a0<em>index\u00a0<\/em>as dates.<\/p>\n<p>\u00a0<\/p>\n<h3>Indexing in Time-Series Data<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Let\u2019s say I want to get all the data from\u00a0<em>2000-01-01<\/em>\u00a0till\u00a0<em>2015-05-01<\/em>. In order to do this, we can simply use indexing in Pandas like this.<\/p>\n<div>\n<pre>df.loc['2000-01-01':'2015-01-01']\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/306\/1*63uaub8dXo-uRcymxDVV2w.png\" width=\"245\" height=\"398\"><\/p>\n<p>Here we have data for all the months from\u00a0<em>2000-01-01<\/em>\u00a0till\u00a0<em>2015-01-01<\/em>.<\/p>\n<p data-selectable-paragraph=\"\">Let\u2019s say we want to get all the data of all the first months from\u00a0<em>1992-01-01<\/em>\u00a0to\u00a0<em>2000-01-01<\/em>. We can simply do it by adding another argument that is similar to when we slice the list in python, and we add a step argument in the end.<\/p>\n<p data-selectable-paragraph=\"\">The syntax for this in Pandas is\u00a0<em>[&#8216;starting date&#8217;:&#8217;ending date&#8217;:step].<\/em> Now, if we observe our dataset, it is in months format, so we want data every 12 months, from 1992 till 2000. We can do it as follows.<\/p>\n<div>\n<pre>df.loc['1992-01-01':'2000-01-01':12]\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/190\/1*aR5EL3ALt7PQreY7mLMa-g.png\" width=\"152\" height=\"247\"><\/p>\n<p>And here, we can see that we can get the values of the first month of every year.<\/p>\n<p>\u00a0<\/p>\n<h3>Time-Resampling using Pandas<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Think of resampling as\u00a0<em>groupby()<\/em>\u00a0where we group by based on any column and then apply an aggregate function to check our results. Whereas in the Time-Series index, we can resample based on any\u00a0<em>rule\u00a0<\/em>in which we specify whether we want to resample based on \u201cYears\u201d or \u201cMonths\u201d or \u201cDays or anything else.<\/p>\n<p data-selectable-paragraph=\"\">Some important rules for which we resample our time series index are:<\/p>\n<ul>\n<li>M = Month End<\/li>\n<li>A = Year-End<\/li>\n<li>MS = Month Start<\/li>\n<li>AS = Year Start<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\">and so on. You can check the detailed aliases in the\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/timeseries.html#offset-aliases\" target=\"_blank\" rel=\"noopener noreferrer\">official documentation<\/a>.<\/p>\n<p data-selectable-paragraph=\"\">Let\u2019s apply this to our dataset.<\/p>\n<p data-selectable-paragraph=\"\">Let\u2019s say we want to calculate the mean value of shipment at the start of every year. We can do this by calling resample at\u00a0<em>rule=&#8217;AS&#8217;<\/em>\u00a0for Year Start and then calling the aggregate function\u00a0<em>mean\u00a0<\/em>on it.<\/p>\n<p data-selectable-paragraph=\"\">We can see the\u00a0<em>head\u00a0<\/em>of it as follows.<\/p>\n<div>\n<pre>df.resample(rule='AS').mean().head()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/221\/1*2tcuZT2gcvUwBAaYOkFVcw.png\" width=\"177\" height=\"154\"><\/p>\n<p>Here we have resampled the index based on starting of every year(remember what \u201cAS\u201d does), then applied the\u00a0<em>mean\u00a0<\/em>function on it, and now we have the mean of Shipping at the start of every year.<\/p>\n<p data-selectable-paragraph=\"\">We can even use our own custom functions with\u00a0<em>resample<\/em>. Let\u2019s say we want to calculate the sum of every year with a custom function. We can do that as follows.<\/p>\n<div>\n<pre>def sum_of_year(year_val):\r\n    return year_val.sum()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">And then we can apply it via resampling as follows.<\/p>\n<div>\n<pre>df.resample(rule='AS').apply(year_val)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">We can confirm that it is working correctly by comparing it to<\/p>\n<div>\n<pre>df.resample(rule='AS').apply(my_own_custom) == df.resample(rule='AS').sum()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/188\/1*6Gswct057c2HK7GTETLBoQ.png\" width=\"150\" height=\"418\"><\/p>\n<p>And they both are equal.<\/p>\n<p>\u00a0<\/p>\n<h3>Rolling Time Series<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Rolling is also similar to Time Resampling, but in Rolling, we take a window of any size and perform any function on it. In simple words, we can say that a rolling window of size\u00a0<em>k<\/em>\u00a0means\u00a0<em>k<\/em> consecutive values.<\/p>\n<p data-selectable-paragraph=\"\">Let\u2019s see an example. If we want to calculate the rolling average of 10 days, we can do it as follows.<\/p>\n<div>\n<pre>df.rolling(window=10).mean().head(20) # head to see first 20 values \r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/198\/1*gvfjAiyR-Ear1ry8fjhbBQ.png\" width=\"158\" height=\"493\"><\/p>\n<p>Now here, we can see that the first 10 values are\u00a0<em>NaN\u00a0<\/em>because there are not enough values to calculate the rolling mean for the first 10 values. It starts calculating the mean from the 11th value and goes on.<\/p>\n<p data-selectable-paragraph=\"\">Similarly, we can check out the maximum value from a window of 30 days as follows.<\/p>\n<div>\n<pre>df.rolling(window=30).max()[30:].head(20) # head is just to check top 20 values\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/183\/1*dTYa3jpWT7MgR4Q0cRD3fQ.png\" width=\"146\" height=\"495\"><\/p>\n<p>Note that here I have added<em>\u00a0[30:]\u00a0<\/em>just because the first 30 entries, i.e., the first window, do not have values to calculate the\u00a0<em>max\u00a0<\/em>function, so they are\u00a0<em>NaN<\/em>, and for adding a screenshot, to show the first 20 values, I just skipped the first 30 rows, but you do not need to do it in practice.<\/p>\n<p data-selectable-paragraph=\"\">And here, we can see that we have maximum values over a rolling window of 30 days.<\/p>\n<p>\u00a0<\/p>\n<h3>Plotting Time-series Data using Pandas<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Interestingly, Pandas offer a good set of built-in visualization tools and tricks which can help you in visualizing any kind of data.<\/p>\n<p data-selectable-paragraph=\"\">A basic line plot can be obtained just by calling\u00a0<em>.plot<\/em>\u00a0function over the dataframe.<\/p>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/611\/1*EOKZ2l_FPm4oD44ayT3JvQ.png\" width=\"90%\"><\/p>\n<p>And here, we can see the value of Manufactures Shipment over time. Notice that how nicely Pandas has handled our x-axis, which is our Time Series Index.<\/p>\n<p data-selectable-paragraph=\"\">We can further modify it by adding a title, and y-label by using\u00a0<em>.set<\/em>\u00a0on our plot.<\/p>\n<div>\n<pre>ax = df.plot()\r\nax.set(title='Value of Manufacturers Shipments', ylabel='Value')\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/643\/1*YFoMmVwdcunEsWoWZ_di8g.png\" width=\"90%\"><\/p>\n<p>Similarly, we can change the plot size via\u00a0<em>figsize\u00a0<\/em>parameter in\u00a0<em>.plot<\/em>.<\/p>\n<div>\n<pre>ax = df.plot(figsize=(12,6))\r\nax.set(title='Value of Manufacturers Shipments', ylabel='Value')\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*lN7Gw9axMGHMGp7N4x-Lmg.png\" width=\"90%\"><\/p>\n<p>Let\u2019s now Plot the mean of the starting value of every year. We can do it via calling\u00a0<em>.plot<\/em>\u00a0after resampling with the rule \u2018AS\u2019 as \u2018AS\u2019 is the rule for the starting of the year.<\/p>\n<div>\n<pre>ax = df.resample(rule='AS').mean().plot(figsize=(12,6))\r\nax.set(title='Average of Manufacturers Shipments', ylabel='Value of Mean of Starting of Year')\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/863\/1*1VMmZFx2pAepl0DTgAoX9Q.png\" width=\"90%\"><\/p>\n<p>We can also do the bar plot for the mean of starting of every year by calling\u00a0<em>.bar<\/em>\u00a0on top of\u00a0<em>.plot<\/em>.<\/p>\n<div>\n<pre>ax = df.resample(rule='AS').mean().plot.bar(figsize=(12,6))\r\nax.set(title='Average of Manufacturers Shipments', ylabel='Value of Mean of Starting of Year');\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/855\/1*VH4zF42AAxJYmY9TaJA0zg.png\" width=\"90%\"><\/p>\n<p>Similarly, we can plot the rolling mean and normal mean for the starting of the month as follows.<\/p>\n<div>\n<pre>ax = df['UMTMVS'].resample(rule='MS').mean().plot(figsize=(15,8), label='Resample MS')\r\nax.autoscale(tight=True)\r\ndf.rolling(window=30).mean()['UMTMVS'].plot(label='Rolling window=30')\r\n\r\nax.set(ylabel='Value of Mean of Starting of Month',title='Average of Manufacturers Shipments')\r\nax.legend()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Here, first, we have plotted the mean of the starting of every month via resampling on rule = \u201cMS\u201d (Month start). Then we have set\u00a0<em>autoscale(tight=True)<\/em>. This will remove the extra plot portion, which is empty. Then we have plotted the rolling mean on 30 days window. Remember that the first 30 Days are null, and you will observe this in the plot. Then we have set Label, Title, and Legend.<\/p>\n<p data-selectable-paragraph=\"\">The output of this plot is<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*CPOUxtmQOLM-2YhbfP3ZNw.png\" width=\"90%\"><\/p>\n<p>Notice how the first 30 days are missing in Rolling Average, and since it is rolling average, it is pretty smooth, as compared to resample one.<\/p>\n<p data-selectable-paragraph=\"\">Similarly, you can plot for specific dates as per your choice. Let\u2019s say I want to plot the maximum values for the start of every year from 1995 till 2005. I can do it as follows.<\/p>\n<div>\n<pre>ax = df['UMTMVS'].resample(rule='AS').max().plot(xlim=[\"1999-01-01\",\"2014-01-01\"],ylim=[280000,540000], figsize=(12,7))\r\nax.yaxis.grid(True)\r\nax.xaxis.grid(True)\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Here, we have specified the\u00a0<em>xlim\u00a0<\/em>and\u00a0<em>ylim<\/em>. See how I have added the dates in <em>xlim<\/em>. The main pattern is\u00a0<em>xlim=[&#8216;starting date&#8217;, &#8216;ending date&#8217;]<\/em>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*hnX75eO9p1Q-Eqp6OG_gFw.png\" width=\"90%\"><\/p>\n<p>And here, you can see the output of Maximum Values at the Start of Year from 1999 till 2014.<\/p>\n<p>\u00a0<\/p>\n<h3>Learning Outcomes<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">This brings us to the end of this article. Hopefully, you are now aware of the basics of<\/p>\n<ul>\n<li><strong>Loading time series dataset correctly in Pandas<\/strong><\/li>\n<li><strong>Indexing in Time-Series Data<\/strong><\/li>\n<li><strong>Time-Resampling using Pandas<\/strong><\/li>\n<li><strong>Rolling Time Series<\/strong><\/li>\n<li><strong>Plotting Time-series Data using Pandas<\/strong><\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\">these topics correctly and can apply them in your own datasets too.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/09\/introduction-time-series-analysis-python.html<\/p>\n","protected":false},"author":0,"featured_media":1882,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1881"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=1881"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1881\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/1882"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=1881"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=1881"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=1881"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}