{"id":8175,"date":"2021-04-05T00:17:36","date_gmt":"2021-04-05T00:17:36","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/05\/predicting-housing-prices-in-ames-iowa\/"},"modified":"2021-04-05T00:17:36","modified_gmt":"2021-04-05T00:17:36","slug":"predicting-housing-prices-in-ames-iowa","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/05\/predicting-housing-prices-in-ames-iowa\/","title":{"rendered":"Predicting Housing Prices in Ames, Iowa"},"content":{"rendered":"<div>\n<p><span>There are many qualities of a residential home that determine its worth and price outside of just the number of rooms and square footage. Taking a closer look at 81 residential variables from Ames, Iowa between 2006 to 2010, we were able to use various machine learning methods to determine features of a home in this region that are important to predict sale prices. In this regression analysis, the R<\/span><sup><span>2<\/span><\/sup><span> value across select regression models helped determine which model best fit the dataset and predicted housing prices.<\/span><\/p>\n<p><span>The dataset includes several residential features, from ordinal numeric values like OveralQual and GrLivArea to different string variables. In all, after combining the test and training data, over 2900 observations were used for fitting various models.<\/span><\/p>\n<p><span>Applying a correlation heatmap to the data reveals which features are most correlated to the sale price (SalePrice) and which are least correlated. The heat map below shows the 10 features most correlated to SalePrice. There&#8217;s no great surprise here. The top 10 are the primary features people look for in a home they&#8217;re considering buying. So it&#8217;s no wonder that they would most likely affect sale prices. Converting the full heatmap to a bar chart, we can also see the features least correlated to the price like BsmtFinSF2, BsmthHalfBath, and MiscVal.<\/span><\/p>\n<p><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-heatmap-548475-Ojtc4YuM-300x262.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-heatmap-548475-Ojtc4YuM.png 390w\" loading=\"lazy\" alt=\"\" width=\"390\" height=\"341\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-heatmap-548475-Ojtc4YuM.png\" data-sizes=\"(max-width: 390px) 100vw, 390px\" class=\"alignnone size-full wp-image-72941 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-72941\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-heatmap-548475-Ojtc4YuM.png\" alt=\"\" width=\"390\" height=\"341\"> <img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-corr-756468-f1jEzFJB-276x300.png 276w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-corr-756468-f1jEzFJB.png 387w\" loading=\"lazy\" alt=\"\" width=\"387\" height=\"421\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-corr-756468-f1jEzFJB.png\" data-sizes=\"(max-width: 387px) 100vw, 387px\" class=\"alignnone size-full wp-image-72942 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" class=\"alignnone size-full wp-image-72942\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-corr-756468-f1jEzFJB.png\" alt=\"\" width=\"387\" height=\"421\"><\/p>\n<p><span>As noted, the overall quality (OverallQual) greatly affects the sale price. The higher the OverallQual, the higher the price, as shown on the box plot below.<\/span><\/p>\n<p><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-databloxplot-261272-oPsbVMaa-300x187.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-databloxplot-261272-oPsbVMaa.png 341w\" loading=\"lazy\" alt=\"\" width=\"341\" height=\"212\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-databloxplot-261272-oPsbVMaa.png\" data-sizes=\"(max-width: 341px) 100vw, 341px\" class=\"alignnone size-full wp-image-72949 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-72949\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-databloxplot-261272-oPsbVMaa.png\" alt=\"\" width=\"341\" height=\"212\"><\/p>\n<h2>Missing Data<\/h2>\n<p><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-missing-550772-JxtkvWtv-300x167.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-missing-550772-JxtkvWtv-600x333.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-missing-550772-JxtkvWtv.png 742w\" loading=\"lazy\" alt=\"\" width=\"742\" height=\"412\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-missing-550772-JxtkvWtv.png\" data-sizes=\"(max-width: 742px) 100vw, 742px\" class=\"alignnone size-full wp-image-72947 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-72947\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-missing-550772-JxtkvWtv.png\" alt=\"\" width=\"742\" height=\"412\"><\/p>\n<p><span>A number of features were identified with significant amounts of missing data. The row Id and features with more than 80% missing values, including PoolQC, MiscFeature (with its counterpart, MiscVal), Alley, and Fence were deleted from the dataset.\u00a0<\/span><\/p>\n<p><span>Other features with missing values were imputed based on their data type. Categorical features were imputed with \u201cNone.\u201d Numerical features were imputed with zeros. Features including LotFrontage, MSZoning, Utilities, Electrical, KitchenQual, SaleType, Functional, Exterior1st, and Exterior2nd were imputed with the mode.<\/span><\/p>\n<p><span>One difference between the simple and advanced dataset include whether GarageCars, GarageArea, and MasVnrArea had a \u201cNone\u201d or zero imputation, as both imputation methods yielded different R<\/span><sup><span>2<\/span><\/sup><span> values.<\/span><\/p>\n<p><span>To further improve the data set, several features were transformed to yield meaningful insight. For example, several quality and condition string features&#8211;including ExterQual, ExterCond, BsmtQual, BsmtCond, HeatingQC, KitchenQual, GarageQual, GarageCond&#8211;had values that were converted to ordinal numerical values (e.g. {None: 0, Po: 1, Fa: 2, TA: 3, Gd: 4, Ex: 5}). Other categorical variables were converted to ordinal values as well, as seen below:<\/span><\/p>\n<ul>\n<li><span>LotShape &#8211; {IR3: 1, IRF2: 2, IRF1: 3, Reg: 4}<\/span><\/li>\n<li><span>BsmtExposure &#8211; {None: 0, No: 1, Mn: 2, Av: 3, Gd: 4}<\/span><\/li>\n<li><span>BsmtFinType1 and BsmtFinType2 &#8211; {None: 0, Unf: 1, LwQ: 2, Rec: 3, BLQ: 4, ALQ: 5, GLQ: 6}<\/span><\/li>\n<li><span>Functional &#8211; {None: 0, Sal: 1, Sev: 2, Maj2: 3, Maj1: 4, Mod: 5, Min2: 6, Min1: 7, Typ: 8}<\/span><\/li>\n<li><span>GarageFinish &#8211; {None: 0, Unf: 1, RFn: 2, Fin: 3}<\/span><\/li>\n<li><span>PavedDrive &#8211; {N: 0, P: 1, Y: 2}<\/span><\/li>\n<li><span>CentralAir &#8211; {N: 0, Y: 1}<\/span><\/li>\n<li><span>LandSlope &#8211; {Gtl: 1, Mod: 2, Sev: 3}<\/span><\/li>\n<\/ul>\n<p>For the linear regression model, an F-Test revealed six coefficients that were statistically insignificant. We opted to drop these features from the linear regression model.<\/p>\n<p><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-ftest-959251-J5ofxCz2-300x250.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-ftest-959251-J5ofxCz2.png 372w\" loading=\"lazy\" alt=\"\" width=\"372\" height=\"310\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-ftest-959251-J5ofxCz2.png\" data-sizes=\"(max-width: 372px) 100vw, 372px\" class=\"alignnone size-full wp-image-72946 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-72946\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-ftest-959251-J5ofxCz2.png\" alt=\"\" width=\"372\" height=\"310\"><\/p>\n<p>Since the F-test also revealed GrLivArea as having statistical significance, outliers were filtered out for all models.<\/p>\n<p><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-outlier-959609-LPhuW3H2-300x119.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-outlier-959609-LPhuW3H2-600x237.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-outlier-959609-LPhuW3H2.png 766w\" loading=\"lazy\" alt=\"\" width=\"766\" height=\"303\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-outlier-959609-LPhuW3H2.png\" data-sizes=\"(max-width: 766px) 100vw, 766px\" class=\"alignnone size-full wp-image-72945 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-72945\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-outlier-959609-LPhuW3H2.png\" alt=\"\" width=\"766\" height=\"303\"><\/p>\n<h2>Linear Models<\/h2>\n<p><span>SalePrice distribution is right-skewed as shown below. To improve model fit, SalePrice was made normally distributed with log transformation. ElasticNet with alpha 0.001 and rho 0.6 performed the best of all the linear models.<\/span><\/p>\n<p><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-linear-300219-94frhk1v-300x100.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-linear-300219-94frhk1v-600x200.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-linear-300219-94frhk1v.png 755w\" loading=\"lazy\" alt=\"\" width=\"755\" height=\"252\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-linear-300219-94frhk1v.png\" data-sizes=\"(max-width: 755px) 100vw, 755px\" class=\"alignnone size-full wp-image-72952 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-72952\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-linear-300219-94frhk1v.png\" alt=\"\" width=\"755\" height=\"252\"><\/p>\n<h2>Advanced Regression Models<\/h2>\n<p><span>The two advanced regression models tested were random forest and gradient boosting regression. Both algorithms are more robust compared to linear regression. While random forest uses a bagging method to take aggregates of random samples of small subsets of data, gradient boosting converts weak learners into stronger ones and subsequently improves each tree. After obtaining the best parameters of each model using grid search cross-validation, gradient boosting yielded the best R<\/span><sup><span>2<\/span><\/sup><span> score compared to all models tested.<\/span><\/p>\n<p><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-results-689967-jXQn4BGX-300x103.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-results-689967-jXQn4BGX-600x207.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-results-689967-jXQn4BGX-768x265.png 768w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-results-689967-jXQn4BGX.png 777w\" loading=\"lazy\" alt=\"\" width=\"777\" height=\"268\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-results-689967-jXQn4BGX.png\" data-sizes=\"(max-width: 777px) 100vw, 777px\" class=\"alignnone size-full wp-image-72944 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-72944\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-results-689967-jXQn4BGX.png\" alt=\"\" width=\"777\" height=\"268\"><\/p>\n<p><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-gbrfeatimport-900483-dutkpTQc-300x269.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-gbrfeatimport-900483-dutkpTQc.png 378w\" loading=\"lazy\" alt=\"\" width=\"378\" height=\"339\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-gbrfeatimport-900483-dutkpTQc.png\" data-sizes=\"(max-width: 378px) 100vw, 378px\" class=\"alignnone size-full wp-image-72943 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-72943\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/03\/kristin-teves\/03-gbrfeatimport-900483-dutkpTQc.png\" alt=\"\" width=\"378\" height=\"339\"><\/p>\n<p><span>As mentioned previously, gradient boosting yielded the best R<\/span><sup><span>2<\/span><\/sup><span> score, compared to the linear regression and random forest model, with a test R<\/span><sup><span>2<\/span><\/sup><span> value of 0.924. Based on the gradient boosting model, the top housing variables that are important when predicting sale price are HeatingQC and Street. In all, feature engineering played a key role in improving the accuracy score for each model. Simpler regression models, like multiple linear regression, achieved lower accuracy or R<\/span><sup><span>2<\/span><\/sup><span> scores compared to gradient boosting. Therefore, the advanced gradient boosting regression model is the best model to use when determining important features that predict house prices. <\/span><\/p>\n<p><span>In order to improve the model, we would like to continue exploring methods that can minimize overfitting, including cross-validation and parameter tuning. Since there are still a number of data manipulations that can be done and explored, further feature engineering is another possibility, along with improving feature selection for each model. <\/span><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/nycdatascience.com\/blog\/student-works\/predicting-housing-prices-in-ames-iowa-7\/<\/p>\n","protected":false},"author":0,"featured_media":8176,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8175"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8175"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8175\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8176"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8175"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8175"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8175"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}