{"id":690,"date":"2020-08-25T14:08:08","date_gmt":"2020-08-25T14:08:08","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/25\/getting-started-with-feature-selection\/"},"modified":"2020-08-25T14:08:08","modified_gmt":"2020-08-25T14:08:08","slug":"getting-started-with-feature-selection","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/25\/getting-started-with-feature-selection\/","title":{"rendered":"Getting Started with Feature Selection"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/kurtispykes\/\" target=\"_blank\" rel=\"noopener noreferrer\">Kurtis Pykes<\/a>, AI Writer<\/b>.<\/p>\n<p>You wouldn\u2019t use the amount of press-ups you can do to determine the bus arrival time would you? In the same way, in predictive modelling, we prune away the non-useful features in order to reduce the complexity of the final model. Simply put, Feature selection reduces the number of input features when developing a predictive model.<\/p>\n<p data-selectable-paragraph=\"\">In this article, I discuss the 3 main categories that feature selection falls into; filter methods, wrapper methods, and embedded methods. Additionally, I use Python examples and leverage frameworks such as scikit-learn (see the\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/\" target=\"_blank\" rel=\"noopener noreferrer\">Documentation<\/a>) for Machine learning, Pandas (<a href=\"https:\/\/pandas.pydata.org\/docs\/\" target=\"_blank\" rel=\"noopener noreferrer\">Documentation<\/a>) for data manipulation, and Plotly (<a href=\"https:\/\/plotly.com\/python\/\" target=\"_blank\" rel=\"noopener noreferrer\">Documentation<\/a>) for interactive data visualization. For access to the code used in this article, visit <a href=\"https:\/\/github.com\/kurtispykes\/demo\" target=\"_blank\" rel=\"noopener noreferrer\">my GitHub<\/a>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*FSs3m_uwBdJCn8Vx\" width=\"90%\"><\/p>\n<p><em>Figure 1: Clothing Rack Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@zuizuii?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener noreferrer\">Zui Hoang<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener noreferrer\">Unsplash<\/a>.<\/em><\/p>\n<p>\u00a0<\/p>\n<h3>Why do Feature Selection?<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">The message that the first paragraph aimed to convey is that sometimes there are features that do not contribute enough useful information to predicting the final outcome, so by including it in our model, we are making our model unneededly more complex. Discarding of non-useful features results in a parsimonious model, which in turn leads to a reduced scoring time. Additionally, Feature Selection also makes interpreting models much easier, which is extremely important in most business cases.<\/p>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>\u201cIn most real-world cases, applying feature selection is unlikely to provide large gains in performance. However, it is still a valuable tool in the toolbox of the feature engineer.\u201d \u2014<\/em> (M\u00fcller, (2016), Introduction to Machine Learning with Python, O\u2019Reilley Media )<\/p>\n<\/blockquote>\n<p data-selectable-paragraph=\"\"><strong>Methods<\/strong><\/p>\n<p data-selectable-paragraph=\"\">There are various methods that could be used to perform Feature Selection, of which they fall into one of 3 categories. Each method has its own advantages and disadvantages. The categories are described in Guyon &amp; Elisseeff (2003) as follows:<\/p>\n<ul>\n<li>\n<strong>Filtering Methods<\/strong>&#8211; Select subsets of variables as a pre-processing step, independently of the chosen predictor.<\/li>\n<li>\n<strong>Wrapper Methods\u00a0<\/strong>&#8211; Utilize the learning machine of interest as a black box to score subsets of variables according to their predictive power.<\/li>\n<li>\n<strong>Embedded Methods\u00a0<\/strong>&#8211; Perform variable selection in the process of training and are usually specific to given learning machines.<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Filtering Methods<\/h3>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/625\/0*fjJfgmQHgcT3NOc2\" width=\"90%\"><\/p>\n<p><em>Figure 2: Filtering a hot drinking Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@jtylernix?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener noreferrer\">Tyler Nix<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener noreferrer\">Unsplash<\/a>.<\/em><\/p>\n<p>Filter methods use univariate statistics to evaluate whether there is a statistically significant relationship from each input feature to the target feature (target variable\/dependent variable) \u2014 what we are attempting to predict. The features that provide the highest confidence are the features that we keep for our final model. Therefore, this method is independent of the choice model that we decide to use for modelling.<\/p>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>\u201cEven when variable ranking is not optimal, it may be preferable to other variable subset selection methods because of its computational and statistical scalability\u201d<\/em>&#8211; Guyon and Elisseeff (2003)<\/p>\n<\/blockquote>\n<p data-selectable-paragraph=\"\">An example of a filtering method is Pearson&#8217;s correlation coefficient \u2014 you may have come across this in high school statistics class. This is a statistic that is used to measure the amount of linear correlation between and input X feature and the output Y feature. It ranges from +1 to -1, where 1 means there is total positive correlation, and -1 means that there is total negative correlation. Therefore, 0 is means that there is no linear correlation.<\/p>\n<p data-selectable-paragraph=\"\">To calculate the Pearson correlation coefficient, take the covariance of the input feature X and output feature Y and divide it by the product of the two features\u2019 standard deviation \u2014 the formula is displayed in Figure 3.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*YURqs7JPHRZwMr1ISuzt1Q.png\" width=\"90%\"><\/p>\n<p><em>Figure 3: Formula for Pearson\u2019s correlation coefficient where Cov is the covariance, \u03c3X is the standard deviation of X, and \u03c3Y is the standard deviation of Y.<\/em><\/p>\n<p data-selectable-paragraph=\"\">For the coding examples that follow, I use the Boston housing prices available in the Scikit-Learn framework \u2014 see\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.load_boston.html\" target=\"_blank\" rel=\"noopener noreferrer\">Documentation<\/a>\u00a0\u2014 as well as Pandas for data manipulation \u2014 see\u00a0<a href=\"https:\/\/pandas.pydata.org\/docs\/\" target=\"_blank\" rel=\"noopener noreferrer\">Documentation<\/a>.<\/p>\n<div>\n<pre>import pandas as pd\r\nfrom sklearn.datasets import load_boston\r\n\r\n# load data\r\nboston_bunch = load_boston()\r\ndf = pd.DataFrame(data= boston_bunch.data,\r\n                  columns= boston_bunch.feature_names)\r\n\r\n# adding the target variable\r\ndf[\"target\"] = boston_bunch.target\r\ndf.head() \r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*j6V3TFZQuoMMtNWK7WF-vg.png\" width=\"90%\"><\/p>\n<p><em>Figure 4: Output from the above code cell to display a preview of the Boston house prices dataset.<\/em><\/p>\n<p>The following code is an example of the Pearson correlation coefficient for feature selection implemented in Python.<\/p>\n<div>\n<pre># Pearson correlation coefficient\r\ncorr = df.corr()[\"target\"].sort_values(ascending=False)[1:]\r\n\r\n# absolute for positive values\r\nabs_corr = abs(corr)\r\n\r\n# random threshold for features to keep\r\nrelevant_features = abs_corr[abs_corr&gt;0.4]\r\nrelevant_features\r\n\r\n&gt;&gt;&gt; RM         0.695360\r\nNOX        0.427321\r\nTAX        0.468536\r\nINDUS      0.483725\r\nPTRATIO    0.507787\r\nLSTAT      0.737663\r\nName: target, dtype: float64\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Then, simply select the input features as follows\u2026<\/p>\n<div>\n<pre>new_df = df[relevant_features.index]\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\"><em><strong>Advantages<\/strong><\/em><\/p>\n<ul>\n<li>Robust against overfitting (that would introduce bias)<\/li>\n<li>Much faster than wrapper methods<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\"><em><strong>Disadvantages<\/strong><\/em><\/p>\n<ul>\n<li>Does not consider interactions between other features<\/li>\n<li>Does not consider the model being employed<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Wrapper Methods<\/h3>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*bbF5HeycNCV0qNmg\" width=\"90%\"><\/p>\n<p><em>Figure 4: Wrapping a box; Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@kadh?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener noreferrer\">Kira auf der Heide<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener noreferrer\">Unsplash<\/a>.<\/em><\/p>\n<p>Wikipedia describes Wrapper methods as using a \u201cpredictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset.\u201d \u2014\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Feature_selection\" target=\"_blank\" rel=\"noopener noreferrer\">Wrapper Methods Wikipedia<\/a>. The algorithms employed by wrapper methods are referred to as greedy because of the attempt to find the optimal combination of features that results in the best performing model.<\/p>\n<blockquote>\n<p data-selectable-paragraph=\"\"><em>\u201cWrapper feature selection methods create many models with various different subsets of the input features and select those features that result in the best performing model according to some performance metric.\u201d \u2014 <\/em>Jason Brownlee<\/p>\n<\/blockquote>\n<p data-selectable-paragraph=\"\">One wrapper method is recursive feature elimination (RFE), and, as the name of the algorithm suggests, it works by recursively removing features, then builds a model using the remaining features and calculates the accuracy of the model.<\/p>\n<p data-selectable-paragraph=\"\"><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.feature_selection.RFE.html\" target=\"_blank\" rel=\"noopener noreferrer\">Documentation<\/a> for RFE implementation in scikit-learn.<\/p>\n<div>\n<pre>from sklearn.feature_selection import RFE\r\nfrom sklearn.linear_model import LinearRegression\r\n\r\n# input and output features\r\nX = df.drop(\"target\", axis= 1)\r\ny = df[\"target\"]\r\n\r\n# defining model to build\r\nlin_reg = LinearRegression()\r\n\r\n# create the RFE model and select 6 attributes\r\nrfe = RFE(lin_reg, 6)\r\nrfe.fit(X, y)\r\n\r\n# summarize the selection of the attributes\r\nprint(f\"Number of selected features: {rfe.n_features_}n\r\nMask: {rfe.support_}n\r\nSelected Features:\", [feature for feature, rank in zip(X.columns.values, rfe.ranking_) if rank==1],\"n\r\nEstimator : {rfe.estimator_}\")\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">The print statement below returns\u2026<\/p>\n<div>\n<pre>Number of selected features: 6\r\n\r\nMask: [False False False  True  True  True False  True False False  True False   True]\r\n\r\nSelected Features: ['CHAS', 'NOX', 'RM', 'DIS', 'PTRATIO', 'LSTAT']  \r\n\r\nEstimator : {rfe.estimator_}\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\"><em><strong>Advantages<\/strong><\/em><\/p>\n<ul>\n<li>Able to detect the interactions that take place between features<\/li>\n<li>Often results in better predictive accuracy than filter methods<\/li>\n<li>Finds the optimal feature subset<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\"><em><strong>Disadvantages<\/strong><\/em><\/p>\n<ul>\n<li>Computationally expensive<\/li>\n<li>Prone to overfitting<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Embedded Methods<\/h3>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*n71bwEfKVUDc3suv\" width=\"90%\"><\/p>\n<p><em>Figure 5: Embedded components; Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@cdr6934?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener noreferrer\">Chris Ried<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener noreferrer\">Unsplash<\/a>.<\/em><\/p>\n<p>Embedded methods are similar to Wrapper methods because this method also optimizes an objective function of a predictive model, but what separates the two methods is that in embedded methods, there is an intrinsic metric used during learning to build the model. Therefore, Embedded methods require a supervised learning model, which in turn will intrinsically determine the importance of each feature for predicting the target feature.<\/p>\n<p data-selectable-paragraph=\"\"><em>Note: The model that is used for feature selection does not have to be the model that is used as the final model.<\/em><\/p>\n<p data-selectable-paragraph=\"\">LASSO (<em>Least Absolute Shrinkage and Selection Operator<\/em>) is a good example of an embedded method. Wikipedia describes LASSO as \u201ca regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.\u201d Going into depth about how LASSO works is beyond the scope of this article, but a good article to get to grips with the algorithm can be found on the Analytics Vidhya blog by\u00a0<a href=\"https:\/\/medium.com\/u\/daf479ff8e76?source=post_page-----3ecfb4957fd4----------------------\">Aarshay Jain<\/a>\u00a0titled\u00a0<a href=\"https:\/\/www.analyticsvidhya.com\/blog\/2016\/01\/ridge-lasso-regression-python-complete-tutorial\/\" target=\"_blank\" rel=\"noopener noreferrer\">A Complete Tutorial on Ridge and Lasso Regression in Python<\/a>.<\/p>\n<div>\n<pre># train model\r\nlasso = Lasso()\r\nlasso.fit(X, y)\r\n\r\n# perform feature selection\r\nkept_cols = [feature for feature, weight in zip(X.columns.values, lasso.coef_) if weight != 0]\r\n\r\nkept_cols\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">This returns the columns that the Lasso regression model thought was relevant\u2026<\/p>\n<div>\n<pre>['CRIM', 'ZN', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">We can also use a waterfall chart to visualize the coefficients\u2026<\/p>\n<div>\n<pre>figt = go.Figure(\r\n         go.Waterfall(name= \"Lasso Coefficients\",\r\n                      orientation= \"h\",\r\n                      y = X.columns.values,\r\n                      x = lasso.coef_))\r\n\r\nfig.update_layout(title = \"Coefficients of Lasso Regression Model\")\r\n\r\nfig.show()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*t1yWfVVaC7oW78Xh7dvWVA.png\" width=\"90%\"><\/p>\n<p><em>Figure 6: Output of prior code; Waterfall chart displays coefficients from feature to feature; Note that 3 features have been set to 0, meaning that they were disregarded by the model.<\/em><\/p>\n<p><em><strong>Advantages<\/strong><\/em><\/p>\n<ul>\n<li>Computationally much faster than wrapper methods<\/li>\n<li>More accurate than filter methods<\/li>\n<li>Considers all the features at one time<\/li>\n<li>Not prone to overfitting<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\"><em><strong>Disadvantages<\/strong><\/em><\/p>\n<ul>\n<li>Selects features that are specific to the model<\/li>\n<li>Not as powerful as wrapper methods<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\"><strong><em>Tip<\/em><\/strong><em>: There is no best feature selection method. What works well for one business use case may not work for another, so it is down to you to conduct experiments and see what works best.<\/em><\/p>\n<p>\u00a0<\/p>\n<h3>Conclusion<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">In this article, I introduced different methods for performing feature selection. Of course, there are other ways you could do feature selection such as ANOVA, backward feature elimination, and using a decision tree. For a good article to learn more about those methods, I suggest reading\u00a0<a href=\"https:\/\/medium.com\/u\/a4a48b9cfaca?source=post_page-----3ecfb4957fd4----------------------\" target=\"_blank\" rel=\"noopener noreferrer\">Madeline McCombe<\/a>\u2019s article titled\u00a0<a href=\"https:\/\/towardsdatascience.com\/intro-to-feature-selection-methods-for-data-science-4cae2178a00a\" target=\"_blank\" rel=\"noopener noreferrer\"><em>Intro to Feature Selection methods for Data Science<\/em><\/a>.<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/getting-started-with-feature-selection-3ecfb4957fd4\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/08\/getting-started-feature-selection.html<\/p>\n","protected":false},"author":0,"featured_media":691,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/690"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=690"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/690\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/691"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=690"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=690"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=690"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}