{"id":8442,"date":"2021-08-18T00:14:43","date_gmt":"2021-08-18T00:14:43","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/08\/18\/predicting-ames-ia-house-prices-and-identifying-critical-features\/"},"modified":"2021-08-18T00:14:43","modified_gmt":"2021-08-18T00:14:43","slug":"predicting-ames-ia-house-prices-and-identifying-critical-features","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/08\/18\/predicting-ames-ia-house-prices-and-identifying-critical-features\/","title":{"rendered":"Predicting Ames, IA House Prices and Identifying Critical Features"},"content":{"rendered":"<div>\n<h3><strong>Introduction<\/strong><\/h3>\n<p>The<a href=\"https:\/\/www.amstat.org\/publications\/jse\/v19n3\/decock.pdf\"> Ames Housing dataset<\/a> was compiled by Dean De Cock for use in data science education and consists of about 2500 house sale records during 2006\u22122010 (Ames is the college town of Iowa State University). It is used by data scientists as a modernized and expanded version of the often cited Boston Housing dataset. The <strong>Ames<\/strong> dataset is also hosted on<a href=\"https:\/\/www.kaggle.com\/c\/house-prices-advanced-regression-techniques\/data\"> <strong>Kaggle<\/strong><\/a> to be used for an entry-level competition.<\/p>\n<h3><strong>Objective<\/strong><\/h3>\n<p>The objective of the project is to use this dataset to<\/p>\n<ul>\n<li>perform descriptive data analysis to gain business insights<\/li>\n<li>build machine learning models to describe the local housing market and to use these models to predict house prices in that market<\/li>\n<\/ul>\n<p>Our team identified several business opportunities. They included:<\/p>\n<ol>\n<li>Provide the best prediction of house value to\n<ul>\n<li>Homeowners looking to sell their property \u2013 As is well known, if a property is priced too high, it can sit on the market for a long time , and then the homeowner may have to drop the price even lower than what they could have obtained had it been priced\u00a0 right in the first place.<\/li>\n<li>Homebuyers \u2013 Newcomers to a town usually have a budget in mind and look for features such as a good school district, low crime, physical attributes of the house (bedrooms, bathrooms, garage, basement, sq ft area, etc.) and their goal is to find the best combination of these within their budget. Providing a prediction of the value of different house attributes such as neighborhoods, size, etc. would be quite helpful.<\/li>\n<\/ul>\n<\/li>\n<li>Provide the best analysis to investors (\u201cflippers\u201d) who buy properties, upgrade and sell them \u2013 a predictive model that helps identify undervalued properties, given current market prices, and identifies the most value additive upgrades, based on the current costs of performing those upgrades.\u00a0<\/li>\n<li>Provide the best analysis to homeowners on how much a particular remodeling project could increase house value \u2013 while homeowners remodel their property for the benefits of living in an upgraded home, they often consider the potential increase in house value from the remodeling in deciding how much to invest in the remodeling project. A model that predicts the increase in house value due to different types of upgrades would be helpful to the homeowner as it is to the flipper mentioned above.<\/li>\n<\/ol>\n<p>The modeling approach used here is to treat a house as a bundle of attributes (e.g. neighborhood, sq footage, number of bedrooms) and to treat the price as reflecting the value of this bundle. The relationship between the prices observed on sold houses and their attributes is modeled and this model is then used both to describe how different attributes affect the sale price of a house and to predict the sale price of a house given its attributes.<\/p>\n<h3>Data<\/h3>\n<p>The dataset contains detailed information about the house attributes, along with sale prices. It<\/p>\n<ul>\n<li>covers the period from Jan 1, 2006 \u2013 July 31, 2010<\/li>\n<li>is a csv file with 2580 rows\u00a0 x 81 columns<\/li>\n<\/ul>\n<p>Documentation about the dataset can be found at<a href=\"http:\/\/jse.amstat.org\/v19n3\/decock\/DataDocumentation.txt\"> http:\/\/jse.amstat.org\/v19n3\/decock\/DataDocumentation.txt<\/a> .<\/p>\n<p>Of the 81 columns, <strong><em>SalePrice<\/em><\/strong> is the target, <strong><em>PID<\/em><\/strong> is the identifier and the remaining 79 columns are the features used in the modeling. Given the large number of features, it made sense to group them by the type of attributes captured by them. Accordingly, the features were grouped as follows (details in appendix):<\/p>\n<ul>\n<li><strong>Year of sale &#8211; <\/strong>1 column<strong>\u00a0<\/strong><\/li>\n<li><strong>Month of sale &#8211; <\/strong>1 column<\/li>\n<li><strong>Neighborhood characteristics &#8211; <\/strong>4 columns<strong>\u00a0<\/strong><\/li>\n<li><strong>External\/lot characteristics &#8211; <\/strong>9 columns<\/li>\n<li><strong>Building characteristics &#8211; <\/strong>62 columns<strong>\u00a0<\/strong>\n<ul>\n<li><strong>Number of rooms<\/strong> (7 columns)<\/li>\n<li><strong>Area <\/strong>(15 columns)\n<ul>\n<li>Above ground (4), Basement (4), Garage (2), External (5)<\/li>\n<\/ul>\n<\/li>\n<li><strong>Quality <\/strong>(28 columns)\n<ul>\n<li>Overall building (7), Building (4), Basement (6), Garage (5), Roof and Exterior (6)<\/li>\n<\/ul>\n<\/li>\n<li><strong>Miscellaneous <\/strong>(12 columns)\n<\/li>\n<\/ul>\n<\/li>\n<li><strong>Sale Type\/Cond &#8211; <\/strong>2 columns<\/li>\n<\/ul>\n<h3><strong>Preprocessing and Exploratory Data Analysis<\/strong><\/h3>\n<p>To begin with, we inspected the dataset for missing values. In this process, we identified 4 features that had greater then 2000 observations missing out of a total of 2580 observations. \u00a0These are \u2018PoolQC\u2019 (Pool quality) with 99.6% missingness, \u2018MiscFeature\u2019 (Miscellaneous feature not covered in other categories) with 96.3% missingness, \u2018Alley\u2019 (type of alley access to property) misses 93.4% observations, and \u2018Fence\u2019 (fence quality) has 80% missing values. We excluded columns \u2018PoolQC\u2019 and \u2018MiscFeature\u2019 from further analysis. A closer look at \u2018Fence\u2019 and \u2018Alley\u2019 features, however, along with cross-referencing the data dictionary suggested that null values in these 2 have a special meaning. \u2018NA\u2019 here means \u2018No alley access\u2019 in the \u2018Alley\u2019 feature and \u2018No Fence\u2019 in the \u2018Fence\u2019 feature. Therefore, \u2018NA\u2019 values were accordingly replaced. We also found \u2018NA\u2019 values in other categorical features to hold a special meaning. They were thus adequately replaced with a new class per data dictionary\u2019s definition.<\/p>\n<p>Further, we found only 1 continuous feature i.e. \u2018Lot Frontage\u2019 (linear feet of street connected to property) to contain 462 missing values. We imputed these using the mean Lot Frontage of houses in the neighborhood in which the missing value house fell into. Once missingness was taken care of, we split the features up into various groups i.e. building characteristics, external\/lot characteristics, sale type, neighborhood, time of sale (year and month sold). The building characteristics were further sub-classified as illustrated below in Table 1.<\/p>\n<div class=\"wp-block-group\">\n<div class=\"wp-block-group__inner-container\">\n<figure class=\"wp-block-image alignnone is-resized wp-image-77564 size-full\"><img loading=\"lazy\" class=\"wp-image-77564 alignleft\" title=\"Table 1. Sub-classification of Building Characteristics\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/buildingcharacteristics2-368741-ZYZX0qWD.png\" alt=\"\" width=\"580\" height=\"97\"><figcaption>\n<p>Table 1. Sub-classification of Building Characteristics<\/p>\n<\/figcaption><\/figure>\n<\/div>\n<\/div>\n<p>As part of EDA, we then studied how individual features correlated with response variable using several visualization tools such as scatter plots and histograms for continuous variables and box plots for categorical features (Figure 1).<\/p>\n<div class=\"wp-block-column\">\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-large is-resized is-style-default\">\n<div id=\"attachment_77624\" class=\"wp-caption alignleft\"><img aria-describedby=\"caption-attachment-77624\" loading=\"lazy\" class=\"wp-image-77624\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/overallqual-boxplot-850771-hISUZy5V.png\" alt=\"\" width=\"307\" height=\"174\"><\/p>\n<p id=\"caption-attachment-77624\" class=\"wp-caption-text\">Figure 1. Univariate Feature Analysis<\/p>\n<\/div>\n<\/figure>\n<\/div>\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-large is-resized is-style-default\"><img class=\"wp-image-77625 alignleft\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/1stflsqftscatter-210300-kLezgSRZ.png\" alt=\"\" width=\"308\"><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<h3>Modeling<\/h3>\n<p>Based on univariate data analysis and visualizations, we chose 40 features to fit linear and random forest models. First, we dummified all the qualitative features, which stretched the dimensionality of our feature space from 40 to 138 and also resulted in heightened multicollinearity.<\/p>\n<p>In order to obviate the encountered issues, we took a step back and started again by fitting a linear model with all 78 features. Using Variance Inflation Factor (VIF), we tested for multicollinearity and dropped 3 features that had VIF &gt;5. Then, we looked to see if continuous variables in our dataset satisfied the assumption of normality. Most of them, including the response variable i.e. Sale Price, were not found to be normally distributed. Therefore, we performed \u201clogarithmic\u201d and \u201csquare root\u201d transformations of these as deemed necessary. Next, we performed a stepwise \u201cboth\u201d feature selection procedure using BIC as the criterion. Here, we started either with just an intercept term or a model that included all transformed continuous variables, as well as qualitative features, and then sequentially added or removed the predictor variables, depending on which ones had the greatest\/smallest impact on the model. Through this process, we shortlisted a set of 27 features, which we moved forward with for linear models.<\/p>\n<h4><strong>Multiple Linear Regression<\/strong><\/h4>\n<p>In order to evaluate both Multiple Linear Regression (MLR) and Random Forest models on the same test set, we first split up the dataset into 80:20 train and test sets. For MLR, we developed three models. The first one consisted solely of the 9 continuous features. These primarily included the variables related to area along with Year built and Year Remodeled. Just these 9 continuous features were able to explain 82% of the variance in the response variable. For the second model, we added the qualitative features as well. Some of these variables were evidently ordinal, so we used the ordinal encoding from scikit learn. The remaining variables were nominal and were dummified. We finally ended up with a dataset of 75 features. Regression on this set resulted in a training R<sup>2<\/sup> score of 92.93%. Further, to create a more robust model we used Lasso to shrink the coefficients. To this end, after some experimentation, we settled on a range of 100 lambda values. For each of these lambdas, we conducted a 5-fold cross-validation to find an optimal lambda. In this process, lasso pruned 25 features, pulling their coefficients to 0. As can be seen in Figure 2, there are empty spaces in between the bars. These empty spaces correspond to features whose coefficients were turned down to 0.<\/p>\n<figure class=\"wp-block-image is-resized\"><img loading=\"lazy\" class=\"wp-image-77575 alignleft\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/laso-dropped-coefficients-123863-SP22PUNZ-1024x596.png\" alt=\"\" width=\"582\" height=\"338\"><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure class=\"wp-block-image is-resized\">Figure 2. Barplot showing the Lasso determined attributes&#8217; coefficients<\/figure>\n<p>Thus, we dropped the 25 variables suggested by Lasso and ran another regression. Upon testing our final model for assumptions of linearity, homoscedasticity, and normality, we found them satisfied as shown below in Figure 3. Although we noticed 5 outliers, for which our model was conservatively predicting, a closer look at these properties didn\u2019t reveal anything markedly differently compared to others. We opine that buyers of these properties overpaid (perhaps their realtor should have consulted a data scientist).<\/p>\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-large is-resized is-style-default\">\n<div id=\"attachment_77779\" class=\"wp-caption alignleft\"><img aria-describedby=\"caption-attachment-77779\" loading=\"lazy\" class=\"wp-image-77779\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/finalmodel-linearity-141660-iVnyd4dv.png\" alt=\"\" width=\"210\" height=\"201\"><\/p>\n<p id=\"caption-attachment-77779\" class=\"wp-caption-text\">Linear Relationship of Actual and Predicted<\/p>\n<\/div><figcaption>Figure 3. Assumptions tested on final model<\/figcaption><\/figure>\n<\/div>\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-large is-resized is-style-default\">\n<div id=\"attachment_77780\" class=\"wp-caption alignleft\"><img aria-describedby=\"caption-attachment-77780\" loading=\"lazy\" class=\"wp-image-77780\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/finalmodel-constantvariance-976496-rFski0qr.png\" alt=\"\" width=\"210\" height=\"208\"><\/p>\n<p id=\"caption-attachment-77780\" class=\"wp-caption-text\">Homoscedasticity of Residuals<\/p>\n<\/div>\n<\/figure>\n<\/div>\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-large is-resized is-style-default\">\n<div id=\"attachment_77781\" class=\"wp-caption alignleft\"><img aria-describedby=\"caption-attachment-77781\" loading=\"lazy\" class=\"wp-image-77781\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/finalmodel-normality-279273-nDCwORUP.png\" alt=\"\" width=\"210\" height=\"207\"><\/p>\n<p id=\"caption-attachment-77781\" class=\"wp-caption-text\">Normally distributed Residuals<\/p>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<p>Overall, our final MLR model is predicting with an average error of ~$23000. Figure 4 compares the train and test scores and Root Mean Squared Error (RMSE) for three models.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter is-resized\"><img loading=\"lazy\" class=\"wp-image-77581\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/leinear-metrics-499908-UDYxsTdj-768x102.png\" alt=\"\" width=\"579\" height=\"76\"><figcaption>Figure 4. MLR models scores comparison<\/figcaption><\/figure>\n<\/div>\n<p>Additionally, we had compared the p-values and coefficient of features in all models and had found Lasso to be in accord with p-value information relayed by Model 2 that was run prior to Lasso. Table 2 shows a snapshot of the comparison. Features that were found insignificant in model 2, as suggested by high p-values, had their coefficients turned to 0 in Lasso.<\/p>\n<figure class=\"wp-block-image\"><img loading=\"lazy\" class=\"wp-image-77594 alignleft\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/p-valuessnapshot-633100-qgB5c0wP-768x296.png\" alt=\"\" width=\"618\" height=\"238\"><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure><\/figure>\n<figure class=\"wp-block-image\">Table 2. Lasso Coefficients consistent with Model-2 p-values<\/figure>\n<p>Next, we quantified the effect of each of these variables. Table 3A compares the property sale price in different neighborhoods with respect to North Ames. Neighborhoods highlighted in green represent those where a person would buy at a premium, while orange highlights present neighborhoods that would sell at a discounted price compared to North Ames. For example, if a person is looking at 2 identical houses to purchase, one in North Ames and one in GreenHills, he can expect to pay about a 60% premium for the house in GreenHills (top row in Table 3A). Similarly, an identical house in Edwards neighborhood (bottom row) would sell at a 3.4% discount. In terms of zoning, the houses located in commercial neighborhoods sell at a 17.3% discount compared to those located in residential low density.<\/p>\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-large is-resized is-style-default\"><img loading=\"lazy\" class=\"wp-image-77626\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/neighborhoods-056474-FT0Tjnki-635x1024.png\" alt=\"\" width=\"275\" height=\"443\"><figcaption>Table 3A<\/figcaption><\/figure>\n<\/div>\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-large is-resized is-style-default\"><img loading=\"lazy\" class=\"wp-image-77627\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/mszoning-015079-fr9HiRHw.png\" alt=\"\" width=\"276\" height=\"126\"><figcaption>Table 3B<\/figcaption><\/figure>\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<p>Table 3B compares how different zoning classification can affect the sale price. Here, residential low density (RLD) was taken as a baseline. As can be seen, houses located in commercial neighborhoods were found to sell at 17.3% discount compared to RLD.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>Furthermore, we looked at features of the house that added the most value. Here, we compared a house with standard features to a house with targeted features (Table 4A). Features highlighted in green represent the sale price premium one could expect if a house would be upgraded to the target state. For example, we found that if the overall condition and quality of a house is improved from average to excellent, the price goes up more than 50%. Likewise, if the kitchen quality is improved from average to excellent, the price can be increased by 6.4%. The orange highlights indicate houses that would sell at a discounted price. Most houses in this dataset have central air,\u00a0 and a house that doesn\u2019t have that feature would be expected to have a lower price..\u00a0<\/p>\n<p>How this could work to optimize the sales prices: Assume that someone was looking to sell their house and came to us for an evaluation. Upon investigation, we found that their heating quality was average compared to what was standard. We would recommend that they upgrade it to excellent quality to fetch more than 5.5% premium. Also, they should ensure that the cost for upgrade doesn\u2019t offset the gain.<\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" class=\"wp-image-77660 alignleft\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/qualityfeatures-237746-a6JNFEXh.png\" alt=\"\" width=\"614\" height=\"260\"><figcaption>\n<p>Table 4A<\/p>\n<\/figcaption><\/figure>\n<p>Table 4B shows the contribution of area related features to the sale price. With each additional square foot, price goes up. For instance, if one had 150 additional square feet over 1070 on the first floor, they can expect a higher sale price of approximately $10000. Table 4C shows the impact the condition of the house has on the sale price. For example, if we put our money down on a house that\u2019s still under construction, which is indicated by partial sale here, we would pay a premium of 3.9% compared to a house that was already built.\u00a0 Similarly, a distressed property would sell at about 10% discount.<\/p>\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-image size-medium\"><img loading=\"lazy\" width=\"300\" height=\"57\" class=\"wp-image-77661\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/area-466174-qUjIF9Ao-300x57.png\" alt=\"\"><figcaption>Table 4B<\/figcaption><\/figure>\n<\/div>\n<div class=\"wp-block-column\">\n<div class=\"wp-block-group\">\n<div class=\"wp-block-group__inner-container\">\n<figure class=\"wp-block-image size-medium is-style-default\"><img loading=\"lazy\" width=\"300\" height=\"117\" class=\"wp-image-77632\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/ireena-bagai\/salecondition-151041-vzRMFcOA-300x117.png\" alt=\"\"><figcaption>Table 4C<\/figcaption><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<h4><strong>Random Forest<\/strong><\/h4>\n<p>We trained Random Forest as our choice for non-linear model using the 40 features we narrowed down from our EDA, prior univariate and regression analysis.\u00a0 We split the data into train and test based on the indices obtained from the split in multiple linear regression. We used the same indices so we could compare both types of models on the same set of data. We started with the process of tuning the hyperparameters using Grid-search. The parameters tuned in this process were max depth, max features, number of estimators and minimum sample split. As a result of tuning the hyperparameters using grid search, we were able to lower the high training score and improve the test score. Below are the final scores from pre-tuning and post-tuning of hyperparameters with the list of the best hyperparameters chosen by the model.<\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/lh6.googleusercontent.com\/rr7Oq9oZstiS7-3OZAmok3afoDMJUpBEDawjAjqEuQXnuS1dVbYj1mL_vv3JX5tFJmBv4dzx9pUR11biQzt_Paao9cqsrE2tLqOBac2aozd8PmCrGF2iy6-AmbrwGeXCbYCGKJ-f\" width=\"359\" height=\"165\"><\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/lh3.googleusercontent.com\/BnzNMMXk6FdiWT3m3gQZnoPTTUzXRi80D2DknXEi7eJ9vc7v0yZGi_XgPnR6Wfsp75I7D0Tzyr3-5n4zgjzhOxKebBiY4xah4U1NMzcsBTxRnxn-6nTqV1qz1bZtjU_G6kzVv36e\" width=\"220\" height=\"284\"><\/p>\n<p>Next, we extracted the most important features from Random Forest:<\/p>\n<p><b>Top 10 Explanative Features of house price:<\/b><\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/lh6.googleusercontent.com\/FK8j6JeJlvfkRk_64yw0GTRQZoJwD48DTplXmWFjLvKSpild6s3b1YLq8MuKANj4b4n-si6P3mzCgoJpYRfd6zW1_Rr4Zpy1Mob3N5EEn_2JRhr6mNoBJFbdxUgFf9Sq-JHIhldl\" width=\"347\" height=\"390\">\u00a0 \u00a0 \u00a0 <img loading=\"lazy\" class=\"alignnone wp-image-77860 \" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/augi-bold\/highres-170525-vmpZ2tYG.jpg\" alt=\"\" width=\"270\" height=\"386\"><\/p>\n<p>Overall quality represents the overall material and finish of the house. This alone explains about 26% of the house prices. Above ground living area includes the first floor and the second floor. It is ranked second with 13.4% contribution.\u00a0 To our surprise, the garage area was found to be the top fourth most important feature. This could mean that bigger houses are most likely to be accompanied by a good sized garage. The following scatterplot illustrates that garage area has a positive relationship with the house sale price.\u00a0\u00a0<\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/lh3.googleusercontent.com\/f1Gr51oCCgJgITCsFJ6KORFrS9JyceITxfsxTM6ocxWyye1tdXBe8oh6aVmd8QJ7-RAXF1ZFNohMprGDWTkVxhy1lHX77Vm2YZa8qsgq0xJ4jPtLFiI9ZaPmmZxLnvYeubQqNUhK\" width=\"408\" height=\"262\"><\/p>\n<p>The ranking of the important features indicates that area and quality of the house are the most important factors in explaining the house prices. To clarify further, the price of a house is a function of the areas of individual parts of the house and not just the overall area. One important finding is that different areas of the house each contribute differently to the house price based on their level of importance as shown by the model.\u00a0\u00a0\u00a0<\/p>\n<p>These Important features also corresponded to our findings from the EDA. Interestingly, the order of the Top Important features was consistent with the pearson correlation ranking of features to the house SalePrice:<\/p>\n<p><img loading=\"lazy\" class=\"\" src=\"https:\/\/lh6.googleusercontent.com\/nAy5L7gqBOhq0rU0YVAfqMW8BYNCHWz_4obZt10HvfiWBkGWk0np4d7hl7A09v6CyGcqOMI3yf4gzxrvMYT_QZLgIbgvPbzcZBz-fHQpp9rIKxMtRwwGvNQGAeFByFi3Q8yYsGdH\" width=\"278\" height=\"164\">\u00a0 \u00a0 \u00a0<img loading=\"lazy\" class=\"alignnone  wp-image-77861\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/augi-bold\/qualitycorr-461901-N8UKOJXm.jpg\" alt=\"\" width=\"171\" height=\"182\"><\/p>\n<p>\u00a0In addition, as seen below, quality of the basement(height), kitchen and exterior also all have\u00a0 a positive impact on the house prices.<\/p>\n<p><b>\u00a0\u00a0\u00a0\u00a0<\/b> <b>Basement Quality(height) and house prices<\/b><\/p>\n<p><img loading=\"lazy\" class=\"alignnone wp-image-77864 \" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/augi-bold\/bsmtquality-399184-Ks4hFtXS.jpg\" alt=\"\" width=\"550\" height=\"399\"><\/p>\n<p><span>\u00a0\u00a0<\/span> <span> \u00a0 <\/span><b>Kitchen Quality and house prices<\/b><\/p>\n<p><img loading=\"lazy\" class=\"alignnone wp-image-77866 \" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/augi-bold\/kitchenquality-880194-AJW5Lj8l.jpg\" alt=\"\" width=\"552\" height=\"400\"><\/p>\n<p><b>\u00a0\u00a0\u00a0\u00a0<\/b> <b>Exterior Quality and house prices<\/b><\/p>\n<p><img loading=\"lazy\" class=\"alignnone wp-image-77867 \" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2021\/08\/augi-bold\/exteriorquality-324687-Ksh2ojZ4.jpg\" alt=\"\" width=\"554\" height=\"402\"><\/p>\n<p>With these results in mind, we bring our model to a business case scenario.\u00a0\u00a0<\/p>\n<p>For our business objective of providing house value prediction to a prospective homebuyer so they can target houses knowing what prices to pay for the attributes they are seeking, we chose the following scenario:\u00a0<\/p>\n<p>A professor who joined the Iowa State University faculty is looking for a house for his family and children in a good-school district near the campus where he would be working.\u00a0 We simulated his preferences and estimated a house price in North Ames neighborhood near the campus close to the 4th highest ranked school district in the state. We chose a house that is a 2-story remodeled house having above average overall material quality, 3 bedrooms, 80-89 inches basement ceiling, and an unfinished garage.\u00a0\u00a0Within this scenario, we can advise the professor to expect a house price of $161,711 according to our Random Forest model with a 16% RMSE meaning the forecast is within 16% of the actual value 67% of the time.<\/p>\n<p>To put it in perspective, our model is quite comparable to a real market evaluation like that of Zillow as seen in the figure below:\u00a0\u00a0<\/p>\n<p><img loading=\"lazy\" class=\"alignnone\" src=\"https:\/\/lh4.googleusercontent.com\/sp1ZgAIMe9XDdmj6flbHRDAFCGR-xils77VndbFhNEYyxCPUcO-e5KnWEdac7vjYCnGqtmYAzgqmdHBKP1dBZUI3MPzBzJiQWzh96umte5kTEdMKhx0Mge70gSfnI0v7I0PieiAt\" alt=\"\" width=\"684\" height=\"400\"><\/p>\n<p>As we looked at relatively similar smaller midwestern cities like Cleveland and Cincinnati, Ohio, we found for Cleveland for instance Zillow\u2019s house price estimate falls within 20% of the actual sales price about 78.8% of the time according to their website.\u00a0<\/p>\n<p>Although, we have used Random Forest model for our case study, our MLR model with a higher R<sup>2<\/sup> and lower RMSE is our best model and can predict house prices with 92% accuracy within 13% of the actual value 67% of the time.\u00a0The following table shows a comparison of both types of models\u2019 performance on predicting house price on the Ames housing data set:<\/p>\n<p><img loading=\"lazy\" src=\"https:\/\/lh4.googleusercontent.com\/gZJZvHaynQMuI3XY3Y-l6wOaMC7eAUGYkztvaXNY9yqaenLiOSiOSLfPNOFegCTg8kOcw_QpVN5OLZPx9IdopceXZJo8_05KhR-qA6frx2xjr7meVxgatzQcmK_wgYGRd7d8Zv4r\" width=\"413\" height=\"158\"><\/p>\n<h2>Conclusion<\/h2>\n<p>Finally, we found from our analysis that though square footage area overall is probably the most important factor in predicting house price, the distribution of the total area across the different parts of the house &#8211; first floor, 2nd floor, bedroom etc &#8211;\u00a0 matters significantly and contributes differently in impacting the house price.\u00a0 In addition, we saw how the overall material and finish of the house, the quality of certain rooms like the kitchen and features like the basement exposure and height are quite significant in the valuation of the house price.\u00a0 Lastly, we quantified the effect of each individual feature on the house price using our Multiple Linear Regression.\u00a0 This enabled us to compare a house with standard features to a house with targeted features resulting in price differentiation so house owners and buyers can have better expectation and approximation when it comes to renovation and purchase.<\/p>\n<p><!-- \/wp:column --><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/nycdatascience.com\/blog\/student-works\/house-price-prediction-in-ames-iowa\/<\/p>\n","protected":false},"author":0,"featured_media":8443,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8442"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8442"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8442\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8443"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8442"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8442"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8442"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}