{"id":1864,"date":"2020-09-23T23:09:31","date_gmt":"2020-09-23T23:09:31","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/23\/predicting-house-prices-in-ames-iowa\/"},"modified":"2020-09-23T23:09:31","modified_gmt":"2020-09-23T23:09:31","slug":"predicting-house-prices-in-ames-iowa","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/23\/predicting-house-prices-in-ames-iowa\/","title":{"rendered":"Predicting House Prices in Ames, Iowa"},"content":{"rendered":"<div>\n<p>The data for this project was taken from a\u00a0<a title=\"House Price Prediction Kaggle\" href=\"https:\/\/www.kaggle.com\/c\/house-prices-advanced-regression-techniques\">Kaggle competition<\/a> that involved predicting housing prices in Ames, Iowa. The data consists of 79 features for 1490 different houses.\u00a0<\/p>\n<h2>Imputing missingness<\/h2>\n<p>Data often consists of some level of missingness. Regression models do not handle missing data that well. The graph below shows the percent missingness of the features.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/perc-miss-140002-ZCcHNeLx-300x138.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/perc-miss-140002-ZCcHNeLx-600x277.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/perc-miss-140002-ZCcHNeLx.png 627w\" loading=\"lazy\" width=\"627\" height=\"289\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/perc-miss-140002-ZCcHNeLx.png\" data-sizes=\"(max-width: 627px) 100vw, 627px\" class=\"wp-image-67149 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"627\" height=\"289\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/perc-miss-140002-ZCcHNeLx.png\" alt=\"\" class=\"wp-image-67149\"><\/figure>\n<p>Different strategies were employed depending on the type of data needing imputation. Some categorical and ordinal data were imputed to \u201cNone\u201d and zero respectively based on the feature not being present (ex: garage, pool, fence, etc.). Some special cases were imputed based on relationship to other categories. For instance, LotFrontage was imputed based on mean after grouping by Neighborhood and Lot Configuration. Some features were dropped based on not adding much value.\u00a0<\/p>\n<h2>Feature engineering<\/h2>\n<p>Some features were created based on the values of other features, making for the ability to merge a few features together (ex: <span>TotSF, PercBsmtFin, TotPorchSF, TotFullBath, TotalHalfBath). <\/span><span>Categorical variables with low variable values but still may be important were also engineered (e<\/span><span>x: Condition1 including near RRs, major roads, or positive places of interest).<\/span><\/p>\n<h2>EDA of the target variable<\/h2>\n<figure class=\"wp-block-image size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-038233-zzc81OP2-300x138.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-038233-zzc81OP2-600x277.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-038233-zzc81OP2.png 627w\" loading=\"lazy\" width=\"627\" height=\"289\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-038233-zzc81OP2.png\" data-sizes=\"(max-width: 627px) 100vw, 627px\" class=\"wp-image-67152 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"627\" height=\"289\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-038233-zzc81OP2.png\" alt=\"\" class=\"wp-image-67152\"><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-log-394312-ze0DLStE-300x138.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-log-394312-ze0DLStE-600x277.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-log-394312-ze0DLStE.png 627w\" loading=\"lazy\" width=\"627\" height=\"289\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-log-394312-ze0DLStE.png\" data-sizes=\"(max-width: 627px) 100vw, 627px\" class=\"wp-image-67153 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"627\" height=\"289\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/freq-sale-price-log-394312-ze0DLStE.png\" alt=\"\" class=\"wp-image-67153\"><\/figure>\n<p>The raw data showed a right skew. Taking the log of the sale price shows a more normal distribution, which regression handles better.\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/sale-price-boxplot-989577-W49nZbfz-300x138.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/sale-price-boxplot-989577-W49nZbfz-600x277.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/sale-price-boxplot-989577-W49nZbfz.png 627w\" loading=\"lazy\" width=\"627\" height=\"289\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/sale-price-boxplot-989577-W49nZbfz.png\" data-sizes=\"(max-width: 627px) 100vw, 627px\" class=\"wp-image-67154 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"627\" height=\"289\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/sale-price-boxplot-989577-W49nZbfz.png\" alt=\"\" class=\"wp-image-67154\"><\/figure>\n<p>Looking at a box plot of the house sale prices, there seem to be a couple outliers that I decided to drop from the data set.<\/p>\n<h2>Regressions<\/h2>\n<ul>\n<li>Multiple Linear Regression<\/li>\n<li>Ridge Regression<\/li>\n<li>LASSO Regression<\/li>\n<li>ElasticNet Regression<\/li>\n<li>Random Forest<\/li>\n<li><span>Generalized Boosted Regression Modeling (GBM)<\/span><\/li>\n<\/ul>\n<h2>R-square and Kaggle scores<\/h2>\n<figure class=\"wp-block-table is-style-stripes\">\n<table>\n<tbody>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\"><strong>Model<\/strong><\/td>\n<td class=\"has-text-align-center\" data-align=\"center\"><strong>R-squared train<\/strong><\/td>\n<td class=\"has-text-align-center\" data-align=\"center\"><strong>R-squared test<\/strong><\/td>\n<td class=\"has-text-align-center\" data-align=\"center\"><strong>Score<\/strong><\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">MLR<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.9211<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.8647<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.137<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">Ridge<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.9081<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.8622<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.1433<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">LASSO<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.89<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.8532<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.144<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">ElasticNet<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.9081<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.8622<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.1371<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">Random Forest<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.8857<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.8708<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.1376<\/td>\n<\/tr>\n<tr>\n<td class=\"has-text-align-center\" data-align=\"center\">GBM<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.9998<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.8647<\/td>\n<td class=\"has-text-align-center\" data-align=\"center\">0.1293<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>All Regression methods seem to be overfitting. The best model based on Kaggle score is Boosted Regression Tree model.<\/p>\n<h2>Feature importance<\/h2>\n<figure class=\"wp-block-image size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-lasso-714907-qcNOVrOp-300x138.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-lasso-714907-qcNOVrOp-600x277.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-lasso-714907-qcNOVrOp.png 627w\" loading=\"lazy\" width=\"627\" height=\"289\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-lasso-714907-qcNOVrOp.png\" data-sizes=\"(max-width: 627px) 100vw, 627px\" class=\"wp-image-67160 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"627\" height=\"289\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-lasso-714907-qcNOVrOp.png\" alt=\"\" class=\"wp-image-67160\"><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-rf-093314-jLw7vCTD-300x138.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-rf-093314-jLw7vCTD-600x277.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-rf-093314-jLw7vCTD.png 627w\" loading=\"lazy\" width=\"627\" height=\"289\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-rf-093314-jLw7vCTD.png\" data-sizes=\"(max-width: 627px) 100vw, 627px\" class=\"wp-image-67161 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"627\" height=\"289\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/09\/tim-colussi\/feat-imp-rf-093314-jLw7vCTD.png\" alt=\"\" class=\"wp-image-67161\"><\/figure>\n<p>Different features were of more importance in different regression models. Although neighborhood seems to be of great importance when determining house prices.\u00a0<\/p>\n<p>Code available on\u00a0<a title=\"ML Github\" href=\"https:\/\/github.com\/timcolussi\/ML_housing_price_proj\">GitHub<\/a>.<\/p>\n<p><span>Photo by <a href=\"https:\/\/unsplash.com\/@tierramallorca?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Tierra Mallorca<\/a> on <a href=\"https:\/\/unsplash.com\/s\/photos\/buying-a-house?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/span><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/nycdatascience.com\/blog\/student-works\/predicting-house-prices-in-ames-iowa-2\/<\/p>\n","protected":false},"author":0,"featured_media":1865,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1864"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=1864"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1864\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/1865"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=1864"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=1864"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=1864"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}