Predictive Modeling – House prices prediction

Source code – https://github.com/Evinwlin/Kaggle_House_price

Introduction

Real estate has always been the best form of investment; it generates passive income and long-term appreciation if the value increases over time. When one analyzes real estate, there are a lot of factors involved with one of them being price. Historical prices show the past valuations of a property; it also reveals the growth in the value of the property. The purpose of this project is to apply machine learning techniques to explore different variables, how they affect property value, and extract that information to construct a predictive model for house prices.   

Background

The compiled dataset that was used in this project is part of a Kaggle competition named Advanced Regression Techniques. There are a total of 81 variables and 2560 observations in the data. Since the goal of this project is to explore variables that have potential affects on house prices and to use that information to predict house prices, the target variable is sales-price, and the remaining 80 predictor variables would be used to predict sale-price.

Table of contents

  • Data Cleaning
  • Model Selection
  • Outlier Removal and Feature Selection
  • Model Performance

Data Cleaning — Exploratory Data Analysis

In statistics, exploratory data analysis is an important approach to explore characteristics in data. It also helps to uncover the underlying story the data is telling. To begin, the first step is to closely examine the target variable sales price by running a couple of statistics and graphs (Figure1.1). As the figure below shows, the mean and median for the target variable are 180,921.2 and 163,000 respectively. The histogram shows that most of the house values are laying in the range between 200k and 400k, and the distribution is positively skewed or right skewed meaning there are some degree of deviation from a normal distribution. A value of 1.88 in skewness and a value of 6.54 in kurtosis also reveals that. The QQ-plot shows heavy tails, which is a sign of outliers.

Note: Log transformation is needed on sales price since the graph shows deviation from a normal distribution.

Next, an examination regrading correlation was performed (Figure1.2). Examining correlations between sales price and predictor variables helps to understand the data better; it also gives a general sense of how predictor variables would potentially affect the target variable.

(Figure 1.2)

With further investigation, there are a total of 10 variables with a correlation coefficient above 0.5, which is considered moderately correlated to sales price. Among the 10 variables, the variable overall quality and general living area are considered highly correlated with a correlation coefficient above 0.7. This simply means when one unit of increase in the variable overall quality will result in sale price also increasing.

The last step of this analysis is to examine missingness (Figure1.3). There are 34 variables in the data contain some degrees of missingness. Among them, the variables PoolQC(pool quality), MiscFeature(Miscellaneous feature), Alley, Fence, FireplaceQu(Fireplace quality), and LotFrontage(Linear feet of street connected to property) have the most missing values.

(Figure1.3)

To remedy this issue, the imputation method is needed. For most of the missing values in categorical variables, “None” was the method of the imputation and for most of the continuous variable, a mix of zero, mode, and median imputation were performed. A list of detail summary is shown below (Figure1.4).

(Figure1.4)

Model Selection and performance metrics

For model selection process, the list of models was selected to try out different types of linear or non-linear model – linear regression, penalization regression, support vector regression, random forest regression, gradient boost regression, and Xgboost regression. Mean square error and root mean square error are used as the scoring metrics for examining performance status; both MSE and RMSE provides insight of model performance by analyzing residuals and residual standard deviations.

Note – The purpose of trying out different models is to examine which type of model works best with the dataset. Liner models such linear regression and penalization regression usually work well with predicting continuous variables, and lasso regression provides insights of variable importance. Non-linear models such as svr, random forest, gradient boost and Xgboost use techniques like bagging and boosting to predict. They are worth it to try out and see if non-linear is going to outperform linear models.

Linear regression as baseline model – Outlier Removal and Feature Selection

Linear regression is well known for being sensitive to outliers as it would skew away from the true underlying relationship if there are outliers present. Thus, using linear regression as the baseline model can help detect outliers. In this process, there are a couple of methods applied for detecting. They are studentized residuals, leverage, DFFITS, Cook’s Distance, and Bonferroni’s one-step correlation. Among them, Cook’s Distance performed well as it reduced the baseline model’s mean square error and AIC the most (Figure1.5).

(Figure1.5)

To further better the baseline model, feature selection is needed. As it shows in figure 1.6, out of the 4 feature selection methods, recessive feature elimination works better with the baseline model. It further reduced the model’s RMSE to 0.0764 which means the difference of observed and predicted sale price from the model has sunk even more.

Insights – although RMSE reduces, but the r-square is 0.9587. There is a sign of overfitting.

(Figure 1.6)

Model performance

Using RMSE as an index of performance, Xgboost performed better compared to other models (Figure1.7). Although Xgboost shows a promising accuracy score, training the model and grid-searching for hyperparameters took over 30 minutes. To better the time complexity of the model, reduction in model complexity is needed. Moreover, it seems that non-linear models outperformed linear models. Xgboost and gradient boost scored better compared to all linear models, but SVR and random forest underperformed.

(Figure1.7)

Feature importance

One of the purposes of this project is to extract information regarding which variable affect houses sale price. To get that information, feature importance from models like lasso, random forest and gradient boost is the way to go. It provides insights of how each predictor variable is contributing to predicting the target variable and ranks each variable according to their importance. In the case of random forest regression, the model ranks the predictor variables by the percentage of residual sum of square each variable reduces. The higher the rank a variable is, the more impact is has in predicting sale price (Figure 1.8).

 The top three variable are listed below:

  1. Totalsf – Total square footage of a house, its engineered variable combining total basement, first, and second square footage.
  2. OverallQual – Ranking of overall quality of a house
  3. GrlivArea – Above grade (ground) living area square feet

(Figure1.8)

Conclusion

 In conclusion, leveraging regression techniques to predict house prices works well, especially Xgboost regression with further tuning in the model setup and training. That said, when one considers home improvement to increase potential house value, it can be done through the following:

  1. Increasing total square footage or livable above ground square footage by building additional parts / fixtures. Such as a deck and balcony. (totalsf, GrlivArea)
  2. Improving the condition of the house by remodeling, renovation, or landscaping. (OverallQual)