{"id":8320,"date":"2021-06-09T12:14:12","date_gmt":"2021-06-09T12:14:12","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/06\/09\/this-data-visualization-is-the-first-step-for-effective-feature-selection\/"},"modified":"2021-06-09T12:14:12","modified_gmt":"2021-06-09T12:14:12","slug":"this-data-visualization-is-the-first-step-for-effective-feature-selection","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/06\/09\/this-data-visualization-is-the-first-step-for-effective-feature-selection\/","title":{"rendered":"This Data Visualization is the First Step for Effective Feature Selection"},"content":{"rendered":"<div id=\"post-\">\n   <!-- post_author Benjamin Obi Tayo -->  <\/p>\n<p><img class=\"aligncenter size-full wp-image-128343\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Fig0-data-visualization-feature-selection.jpg\" alt=\"\" width=\"90%\"><\/p>\n<p><em>Image by Benjamin O. Tayo.<\/em><\/p>\n<p>The scatter pairplot is a visualization of pairwise relationships in a dataset and is the first step for effective feature selection. It provides a qualitative analysis of the pairwise correlation between features and is a powerful tool for feature selection and dimensionality reduction. For an introduction of the pairplot using the seaborn package, see this link: <a href=\"https:\/\/seaborn.pydata.org\/generated\/seaborn.pairplot.html\">https:\/\/seaborn.pydata.org\/generated\/seaborn.pairplot.html<\/a><\/p>\n<p>In this article, we will analyze a portfolio of stocks to examine the ones that are strongly correlated to the overall market. The portfolio contains 22 stocks (see <strong>Table 1<\/strong>) from different sectors such as Healthcare, Real Estate, Consumer Discretionary, Energy, Industrials, Telecommunication Services, Information Technology, Consumer Staples, and Financials.<\/p>\n<p>\u00a0<\/p>\n<table border=\"1\" width=\"714\">\n<tbody>\n<tr>\n<td width=\"69\"><strong>Symbol<\/strong><\/td>\n<td width=\"159\"><strong>Name<\/strong><\/td>\n<td width=\"78\"><strong>Symbol<\/strong><\/td>\n<td width=\"186\"><strong>Name<\/strong><\/td>\n<td width=\"72\"><strong>Symbol<\/strong><\/td>\n<td width=\"150\"><strong>Name<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"69\">AAL<\/td>\n<td width=\"159\">American Airlines<\/td>\n<td width=\"78\">EDIT<\/td>\n<td width=\"186\">Editas Medicine<\/td>\n<td width=\"72\">UAL<\/td>\n<td width=\"150\">United Airlines<\/td>\n<\/tr>\n<tr>\n<td width=\"69\">AAPL<\/td>\n<td width=\"159\">Apple<\/td>\n<td width=\"78\">HPP<\/td>\n<td width=\"186\">Hudson Pacific Properties<\/td>\n<td width=\"72\">WEN<\/td>\n<td width=\"150\">Wendy<\/td>\n<\/tr>\n<tr>\n<td width=\"69\">ABT<\/td>\n<td width=\"159\">Abbott Laboratories<\/td>\n<td width=\"78\">JNJ<\/td>\n<td width=\"186\">Johnson &amp; Johnson<\/td>\n<td width=\"72\">WFC<\/td>\n<td width=\"150\">Wells Fargo<\/td>\n<\/tr>\n<tr>\n<td width=\"69\">BNTX<\/td>\n<td width=\"159\">BioNTech<\/td>\n<td width=\"78\">MRNA<\/td>\n<td width=\"186\">Moderna<\/td>\n<td width=\"72\">WMT<\/td>\n<td width=\"150\">Walmart<\/td>\n<\/tr>\n<tr>\n<td width=\"69\">BXP<\/td>\n<td width=\"159\">Boston Properties<\/td>\n<td width=\"78\">MRO<\/td>\n<td width=\"186\">Marathon Oil Corporation<\/td>\n<td width=\"72\">XOM<\/td>\n<td width=\"150\">Exxon Mobile<\/td>\n<\/tr>\n<tr>\n<td width=\"69\">CCL<\/td>\n<td width=\"159\">Carnival Corporation<\/td>\n<td width=\"78\">PFE<\/td>\n<td width=\"186\">Pfizer<\/td>\n<td width=\"72\">SP500<\/td>\n<td width=\"150\">Stock Market Index<\/td>\n<\/tr>\n<tr>\n<td width=\"69\">DAL<\/td>\n<td width=\"159\">Delta Airlines<\/td>\n<td width=\"78\">SLG<\/td>\n<td width=\"186\">SL Green Realty<\/td>\n<td width=\"72\"><\/td>\n<td width=\"150\"><\/td>\n<\/tr>\n<tr>\n<td width=\"69\">DVN<\/td>\n<td width=\"159\">Devon Energy<\/td>\n<td width=\"78\">TSLA<\/td>\n<td width=\"186\">Tesla<\/td>\n<td width=\"72\"><\/td>\n<td width=\"150\"><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>Table <\/strong><strong>1<\/strong>. Portfolio of 22 stocks from diverse sectors.<\/p>\n<p>Our goal is to answer the question: which stocks in the portfolio correlate strongly with the stock market? We will use the S&amp;P 500 index as a measure of the total stock market. We will assume a threshold correlation coefficient of 70% for a stock to be considered to be strongly correlated to the S&amp;P 500.<\/p>\n<p>\u00a0<\/p>\n<h3>Data Collection and Processing<\/h3>\n<p>\u00a0<\/p>\n<p>Raw data were obtained from the yahoo finance website: <a href=\"https:\/\/finance.yahoo.com\/\" target=\"_blank\" rel=\"noopener\">https:\/\/finance.yahoo.com\/<\/a><\/p>\n<p>The historical data for each stock has information on daily open price, high price, low price, and closing price. The CSV file was downloaded for each stock, and then the column \u201cclose\u201d was extracted and combined to create the dataset, which can be found here: <a href=\"https:\/\/github.com\/bot13956\/dataset\" target=\"_blank\" rel=\"noopener\">portfolio.csv<\/a><\/p>\n<p>\u00a0<\/p>\n<h3>Generate Scatter Pairplot<\/h3>\n<p>\u00a0<\/p>\n<div>\n<pre>import numpy as np\r\nimport pandas as pd\r\nimport pylab\r\nimport matplotlib.pyplot as plt\r\nimport seaborn as sns\r\n\r\nurl = 'https:\/\/raw.githubusercontent.com\/bot13956\/datasets\/master\/portfolio.csv'\r\ndata = pd.read_csv(url)\r\ndata.head()\r\n\r\ncols = data.columns[1:24]\r\nsns.pairplot(data[cols], height=2.0)\r\n\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<h3>Calculate Covariance Matrix<\/h3>\n<p>\u00a0<\/p>\n<p>The scatter pairplot is the first step, which provides a qualitative analysis of pairwise correlations between features. To quantify the degree of correlation, the covariance matrix has to be computed.<\/p>\n<div>\n<pre>from sklearn.preprocessing import StandardScaler\r\nstdsc = StandardScaler()\r\nX_std = stdsc.fit_transform(data[cols].iloc[:,range(0,23)].values)\r\n\r\ncov_mat = np.cov(X_std.T, bias= True)\r\n\r\nimport seaborn as sns\r\nplt.figure(figsize=(13,13))\r\nsns.set(font_scale=1.2)\r\nhm = sns.heatmap(cov_mat,\r\n                 cbar=True,\r\n                 annot=True,\r\n                 square=True,\r\n                 fmt='.2f',\r\n                 annot_kws={'size': 12},\r\n                 yticklabels=cols,\r\n                 xticklabels=cols)\r\nplt.title('Covariance matrix showing correlation coefficients')\r\nplt.tight_layout()\r\nplt.show()\r\n\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<h3>Compressed Output Showing Pairplots and Correlation Coefficients<\/h3>\n<p>\u00a0<\/p>\n<p>Since we are only interested in the correlations between the 22 stocks in the portfolio with the S&amp;P 500, <strong>Figure 1<\/strong> below shows the final output from our analysis.<\/p>\n<p><img class=\"aligncenter size-full wp-image-128344\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Fig1-data-visualization-feature-selection.jpg\" alt=\"\" width=\"90%\"><\/p>\n<p><em><strong>Figure <\/strong><strong>1<\/strong>. Scatter pairplots and correlation coefficients between portfolio stocks and the S&amp;P 500.<\/em><\/p>\n<p><strong>Figure 1<\/strong> shows that out of the 22 stocks, 8 have a correlation coefficient less than 70%. Interestingly, except for WEN stock, all the other stocks have a positive correlation with the S&amp;P 500 index.<\/p>\n<p>The full covariance matrix is shown in <strong>Figure 2<\/strong>.<\/p>\n<p><img class=\"aligncenter size-full wp-image-128345\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/Fig2-data-visualization-feature-selection.jpg\" alt=\"\" width=\"90%\"><\/p>\n<p><em><strong>Figure 2<\/strong>. Covariance matrix visualization.<\/em><\/p>\n<p>In summary, we\u2019ve shown how the scatter pairplot can be used as a first step for feature selection. Other advanced methods for feature selection and dimensionality reduction include the following: PCA (<a href=\"https:\/\/pub.towardsai.net\/machine-learning-dimensionality-reduction-via-principal-component-analysis-1bdc77462831?source=post_stats_page-------------------------------------\" target=\"_blank\" rel=\"noopener\">Principal Component Analysis<\/a>) and LDA (<a href=\"https:\/\/pub.towardsai.net\/machine-learning-dimensionality-reduction-via-linear-discriminant-analysis-cc96b49d2757\" target=\"_blank\" rel=\"noopener\">Linear Discriminant Analysis<\/a>).<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2021\/06\/data-visualization-feature-selection.html<\/p>\n","protected":false},"author":0,"featured_media":8321,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8320"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8320"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8320\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8321"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8320"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8320"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8320"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}