{"id":2689,"date":"2020-10-05T16:53:46","date_gmt":"2020-10-05T16:53:46","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/10\/05\/your-guide-to-linear-regression-models\/"},"modified":"2020-10-05T16:53:46","modified_gmt":"2020-10-05T16:53:46","slug":"your-guide-to-linear-regression-models","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/10\/05\/your-guide-to-linear-regression-models\/","title":{"rendered":"Your Guide to Linear Regression Models"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/lopezyse\/\" target=\"_blank\" rel=\"noopener noreferrer\">Diego Lopez Yse<\/a>, Data Scientist<\/b><\/p>\n<div>\n<img src=\"https:\/\/miro.medium.com\/max\/4320\/0*-rG6g8xIcKoWnncc\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>Interpretability is one of the biggest challenges in machine learning. A model has more interpretability than another one if its decisions are easier for a human to comprehend. Some models are so complex and are internally structured in such a way that it\u2019s almost impossible to understand how they reached their final results. These black boxes seem to break the association between raw data and final output, since several processes happen in between.<\/p>\n<p>But in the universe of machine learning algorithms, some models are more transparent than others.\u00a0<a href=\"https:\/\/towardsdatascience.com\/modelling-classification-trees-3607ad43a123\" rel=\"noopener noreferrer\" target=\"_blank\">Decision Trees<\/a>\u00a0are definitely one of them, and Linear Regression models are another one. Their simplicity and straightforward approach turns them into an ideal tool to approach different problems. Let\u2019s see how.<\/p>\n<p>You can use Linear Regression models to analyze how salaries in a given place depend on features like experience, level of education, role, city they work in, and so on. Similarly, you can analyze if real estate prices depend on factors such as their areas, numbers of bedrooms, or distances to the city center.<\/p>\n<p>In this post, I\u2019ll focus on Linear Regression models that examine the linear relationship between a\u00a0<strong>dependent variable\u00a0<\/strong>and one (Simple Linear Regression) or more (Multiple Linear Regression) <strong>independent variables<\/strong>.<\/p>\n<p>\u00a0<\/p>\n<h3>Simple Linear Regression (SLR)<\/h3>\n<p>\u00a0<br \/>Is the simplest form of Linear Regression used when there is a single input variable (predictor) for the output variable (target):<\/p>\n<ul>\n<li>The\u00a0<strong>input<\/strong>\u00a0or\u00a0<strong>predictor variable<\/strong>\u00a0is the variable that helps predict the value of the output variable. It is commonly referred to as\u00a0<strong><em>X<\/em><\/strong>.\n<\/li>\n<li>The\u00a0<strong>output\u00a0<\/strong>or<strong>\u00a0target variable<\/strong>\u00a0is the variable that we want to predict. It is commonly referred to as\u00a0<strong><em>y<\/em><\/strong>.\n<\/li>\n<\/ul>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/507\/1*zGkbD-yIkANDwn4i19VYdg.png\" width=\"100%\"><\/p>\n<p>The value of\u00a0<strong>\u03b20, also called the intercept<\/strong>, shows the point where the estimated regression line crosses the\u00a0<strong><em>y<\/em><\/strong>\u00a0axis, while the value of\u00a0<strong>\u03b21<\/strong>\u00a0<strong>determines the slope<\/strong>\u00a0of the estimated regression line. The\u00a0<strong>random error<\/strong>\u00a0describes the random component of the linear relationship between the dependent and independent variable (the disturbance of the model, the part of\u00a0<strong><em>y<\/em><\/strong>\u00a0that\u00a0<strong><em>X\u00a0<\/em><\/strong>is unable to explain). The true regression model is usually never known (since we are not able to capture all the effects that impact the dependent variable), and therefore the value of the random error term corresponding to observed data points remains unknown. However, the regression model can be estimated by calculating the parameters of the model for an observed data set.<\/p>\n<p>The idea behind regression is to estimate the parameters\u00a0<strong>\u03b20<\/strong>\u00a0and\u00a0<strong>\u03b21<\/strong>\u00a0from a sample. If we are able to determine the optimum values of these two parameters, then we will have the\u00a0<strong>line of best fit<\/strong>\u00a0that we can use to predict the values of\u00a0<strong><em>y<\/em><\/strong>, given the value of\u00a0<strong><em>X<\/em><\/strong>. In other words, we try to fit a line to observe a relationship between the input and output variables and then further use it to predict the output of unseen inputs.<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/707\/1*IbiwnMrmD3Bwe5VwY4dFVg.png\" width=\"100%\"><br \/>\u00a0<\/p>\n<p>How do we estimate\u00a0<strong>\u03b20<\/strong><em>\u00a0<\/em>and\u00a0<strong>\u03b21<\/strong>? We can use a method called\u00a0<strong>Ordinary Least Squares (OLS)<\/strong>.<strong>\u00a0<\/strong>The goal behind this is to minimize the distance from the black dots to the red line as close to zero as possible, which is done by minimizing the squared differences between actual and predicted outcomes.<\/p>\n<p>The difference between actual and predicted values is called\u00a0<strong>residual (e)<\/strong><em>\u00a0<\/em>and can be negative or positive depending on whether the model overpredicted or underpredicted the outcome. Hence, to calculate the net error, adding all the residuals directly can lead to the cancellations of terms and reduction of the net effect. To avoid this, we take the sum of squares of these error terms, which is called the\u00a0<strong><em>Residual Sum of Squares (RSS).<\/em><\/strong><\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/249\/0*1m12le87BriQkEMF.png\"><br \/>\u00a0<\/p>\n<p>The\u00a0<strong>Ordinary Least Squares (OLS) method minimizes the residual sum of squares<\/strong>, and its objective is to fit a regression line that would minimize the distance (measured in quadratic values) from the observed values to the predicted ones (the regression line).<\/p>\n<p>\u00a0<\/p>\n<h3>Multiple Linear Regression (MLR)<\/h3>\n<p>\u00a0<br \/>Is the form<strong>\u00a0<\/strong>of Linear Regression used when there are two or more predictors or input variables. Similar to the SLR model described before, it includes additional predictors:<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/414\/1*Ko7YDmTa_TctiL2Fkm-kGQ.png\" width=\"100%\"><\/p>\n<p>Notice that the equation is just an extension of the Simple Linear Regression one, in which each input\/ predictor has its corresponding slope coefficient\u00a0<strong>(\u03b2<em>)<\/em><\/strong>. The first\u00a0<strong>\u03b2<\/strong><em>\u00a0<\/em>term\u00a0<strong>(\u03b20)<\/strong>\u00a0is the intercept constant and is the value of\u00a0<strong><em>y<\/em><\/strong>\u00a0in absence of all predictors (i.e when all\u00a0<strong><em>X<\/em><\/strong>\u00a0terms are 0).<\/p>\n<p>As the number of features grows, the complexity of our model increases and it becomes more difficult to visualize, or even comprehend, our data. Because there are more parameters in these models compared to SLR ones, more care is needed. when working with them. Adding more terms will inherently improve the fit to the data, but the new terms may not have any real significance. This is dangerous because it can lead to a model that fits that data but doesn\u2019t actually mean anything useful.<\/p>\n<p>\u00a0<\/p>\n<h3>An example<\/h3>\n<p>\u00a0<br \/>The advertising dataset consists of the sales of a product in 200 different markets, along with advertising budgets for three different media: TV, radio, and newspaper. We\u2019ll use the dataset to predict the amount of sales (dependent variable), based on the TV, radio and newspaper advertising budgets (independent variables).<\/p>\n<p>Mathematically, the formula we\u2019ll try solve is:<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/414\/1*2OhUDPvd3Y7B4ZtgRUc7Ow.png\" width=\"100%\"><\/p>\n<p>Finding the values of these constants\u00a0<strong>(\u03b2)<\/strong>\u00a0is what regression model does by minimizing the error function and fitting the best line or hyperplane (depending on the number of input variables). Let\u2019s code.<\/p>\n<p>\u00a0<\/p>\n<h3>Load data and describe dataset<\/h3>\n<p>\u00a0<br \/>You can download the dataset under\u00a0<a href=\"https:\/\/github.com\/dlopezyse\/Medium\" rel=\"noopener noreferrer\" target=\"_blank\">this link<\/a>. Before loading the data, we\u2019ll import the necessary libraries:<\/p>\n<div>\n<pre><code>import pandas as pd\r\nimport numpy as np\r\nimport seaborn as sns\r\nimport matplotlib.pyplot as plt\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.linear_model import LinearRegression\r\nfrom sklearn import metrics\r\nfrom sklearn.metrics import r2_score\r\nimport statsmodels.api as sm<\/code><\/pre>\n<\/div>\n<p>Now we load the dataset:<\/p>\n<div>\n<pre><code>df = pd.read_csv(\u201cAdvertising.csv\u201d)<\/code><\/pre>\n<\/div>\n<p>Let\u2019s understand the dataset and describe it:<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/1077\/1*fPwZ_gX20AtN-knnHGTIBg.png\" width=\"100%\"><br \/>\u00a0<\/p>\n<p>We\u2019ll drop the first column (\u201cUnnamed\u201d) since we don\u2019t need it:<\/p>\n<div>\n<pre><code>df = df.drop([\u2018Unnamed: 0\u2019], axis=1)\r\ndf.info()<\/code><\/pre>\n<\/div>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/1070\/1*x6TRgBQLiHaucn9cE32U2A.png\" width=\"100%\"><br \/>\u00a0<\/p>\n<p>Our dataset now contains 4 columns (including the target variable \u201csales\u201d), 200 registers and no missing values. Let\u2019s visualize the relationship between the independent and target variables.<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/1089\/1*OrnAQH8eXXq3oHcUp50ndQ.png\" width=\"100%\"><br \/>\u00a0<\/p>\n<p>The relationship between TV and sales seems to be pretty strong, and while there seems to be some trend between radio and sales, the relationship between newspaper and sales seems to be nonexistent. We can verify that also numerically through a correlation map:<\/p>\n<div>\n<pre><code>mask = np.tril(df.corr())\r\nsns.heatmap(df.corr(), fmt=\u2019.1g\u2019, annot=True, cmap= \u2018cool\u2019, mask=mask)<\/code><\/pre>\n<\/div>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/848\/1*PL6X-0nllnafSun0TXgmNA.png\" width=\"100%\"><br \/>\u00a0<\/p>\n<p>As we expected, the strongest positive correlation happens between sales and TV, while the relationship between sales and newspaper is close to 0.<\/p>\n<p>\u00a0<\/p>\n<h3>Select features and target variable<\/h3>\n<p>\u00a0<br \/>Next, we divide the variables into two sets: dependent (or target variable \u201cy\u201d) and independents (or feature variables \u201cX\u201d)<\/p>\n<div>\n<pre><code>X = df.drop([\u2018sales\u2019], axis=1)\r\ny = df[\u2018sales\u2019]<\/code><\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<h3>Split the dataset<\/h3>\n<p>\u00a0<br \/>To understand model performance, dividing the dataset into a training set and a test set is a good strategy. By splitting the dataset into two separate sets, we can train using one set and test the model performance using unseen data on the other one.<\/p>\n<div>\n<pre><code>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)<\/code><\/pre>\n<\/div>\n<p>We split our dataset into 70% train and 30% test. The random_state parameter is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. I set random state = 0 so that you can compare your output over multiple runs of the code using the same parameter.<\/p>\n<div>\n<pre><code>print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)<\/code><\/pre>\n<\/div>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/1044\/1*O3wLPGQ1lZE6xgW56Mg3UQ.png\" width=\"100%\"><br \/>\u00a0<\/p>\n<p>By printing the shape of the splitted sets, we see that we created:<\/p>\n<ul>\n<li>2 datasets of 140 registers each (70% of total registers), one with 3 independent variables and one with just the target variable, that will be used for\u00a0<strong>training<\/strong>\u00a0and producing the linear regression model.\n<\/li>\n<li>2 datasets of 60 registers each (30% of total registers), one with 3 independent variables and one with just the target variable, that will be used for\u00a0<strong>testing<\/strong>\u00a0the performance of the linear regression model.\n<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Build model<\/h3>\n<p>\u00a0<br \/>Building the model is as simple as:<\/p>\n<p>\u00a0<\/p>\n<h3>Train model<\/h3>\n<p>\u00a0<br \/>Fitting your model to the training data represents the training part of the modelling process. After it is trained, the model can be used to make predictions, with a predict method call:<\/p>\n<div>\n<pre><code>mlr.fit(X_train, y_train)<\/code><\/pre>\n<\/div>\n<p>Let\u2019s see the output of the model after being trained, and take a look at the value of\u00a0<strong>\u03b20<\/strong>\u00a0(the intercept):<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/939\/1*feaLlmrnj92ADvZ3Frz21w.png\" width=\"100%\"><\/p>\n<p>We can also print the values of the coefficients\u00a0<strong>(\u03b2)<\/strong>:<\/p>\n<div>\n<pre><code>coeff_df = pd.DataFrame(mlr.coef_, X.columns, columns =[\u2018Coefficient\u2019])\r\ncoeff_df<\/code><\/pre>\n<\/div>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/984\/1*5NqHYDLSlQ2qnrglIFKDCA.png\" width=\"100%\"><\/p>\n<p>This way we can now estimate the value of \u201csales\u201d based on different budget values for TV, radio and newspaper:<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/1194\/1*raNZs5Bd9ISvgfn3QA4ngQ.png\" width=\"100%\"><\/p>\n<p>For example, if we determine a budget value of 50 for TV, 30 for radio and 10 for newspaper, the estimated value of \u201csales\u201d will be:<\/p>\n<div>\n<pre><code>example = [50, 30, 10]\r\noutput = mlr.intercept_ + sum(example*mlr.coef_)\r\noutput<\/code><\/pre>\n<\/div>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/878\/1*vF4DFBq8IGEXaoamiiX3Hw.png\" width=\"100%\"><\/p>\n<p>\u00a0<\/p>\n<h3>Test model<\/h3>\n<p>\u00a0<br \/>A test dataset is a dataset that is independent of the training dataset. This test dataset is the unseen data set for your model which will help you have a better view of its ability to generalize:<\/p>\n<div>\n<pre><code>y_pred = mlr.predict(X_test)<\/code><\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<h3><strong>Evaluate Performance<\/strong><\/h3>\n<p>\u00a0<br \/>The quality of a model is related to how well its predictions match up against the actual values of the testing dataset:<\/p>\n<div>\n<pre><code>print(\u2018Mean Absolute Error:\u2019, metrics.mean_absolute_error(y_test, y_pred))\r\nprint(\u2018Mean Squared Error:\u2019, metrics.mean_squared_error(y_test, y_pred))\r\nprint(\u2018Root Mean Squared Error:\u2019, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))\r\nprint(\u2018R Squared Score is:\u2019, r2_score(y_test, y_pred))<\/code><\/pre>\n<\/div>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/1008\/1*tteI8_puZtTefFmpokyNIQ.png\" width=\"100%\"><\/p>\n<p>After validating our model against the testing set, we get an R\u00b2 of 0.86 which seems like a pretty decent performance score. But although a higher R\u00b2 indicates a better fit for the model, it\u2019s not always the case that a high measure is a positive thing. We\u2019ll see below some ways to interpret and improve our regression models.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>How to interpret and improve your model?<\/strong><\/h3>\n<p>\u00a0<br \/>OK, we created our model, and now what? Let\u2019s take a look at the model statistics over the training data to get some answers:<\/p>\n<div>\n<pre><code>X2 = sm.add_constant(X_train)\r\nmodel_stats = sm.OLS(y_train.values.reshape(-1,1), X2).fit()\r\nmodel_stats.summary()<\/code><\/pre>\n<\/div>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/miro.medium.com\/max\/1109\/1*5v33qvGfs5LEq0OVU7WLyw.png\" width=\"100%\"><\/p>\n<p>Let\u2019s see below what these numbers mean.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>Hypothesis Test<\/strong><\/h3>\n<p>\u00a0<br \/>One of the fundamental questions you should answer while running a MLR model is, whether or not,\u00a0<a href=\"https:\/\/towardsdatascience.com\/multiple-linear-regression-8cf3bee21d8b\" rel=\"noopener noreferrer\" target=\"_blank\">at least one of the predictors is useful in predicting the output<\/a>. What if the relationship between the independent variables and target is just by chance and there is no actual impact on sales due to any of the predictors?<\/p>\n<p>We need to perform a Hypothesis Test to answer this question and check our assumptions. It all starts by forming a\u00a0<strong>Null Hypothesis (H0)<\/strong>, which states that all the coefficients are equal to zero, and there\u2019s no relationship between predictors and target (meaning that a model with no independent variables fits the data as well as your model):<\/p>\n<div>\n<img src=\"https:\/\/miro.medium.com\/max\/355\/0*zKszcsewUKm5IIWf.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>On the other hand, we need to define an\u00a0<strong>Alternative Hypothesis (Ha)<\/strong>, which states that at least one of the coefficients is not zero, and there is a relationship between predictors and target (meaning that your model fits the data better than the intercept-only model):<\/p>\n<div>\n<img src=\"https:\/\/miro.medium.com\/max\/362\/0*3S8ecOREimbhlIRC.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>If we want to reject the Null Hypothesis and have confidence in our regression model, we need to find strong statistical evidence. To do this we perform a hypothesis test, for which we use the\u00a0<strong>F-Statistic<\/strong>.<\/p>\n<blockquote>\n<p>\n<em>If the value of F-statistic is equal to or very close to 1, then the results are in favor of the Null Hypothesis and we fail to reject it.<\/em>\n<\/p>\n<\/blockquote>\n<p>As we can see in the table above (marked in yellow), the F-statistic is 439.9, thus providing strong evidence against the Null Hypothesis (that all coefficients are zero). Next, we also need to check the\u00a0<strong>probability of occurrence of the F-statistic<\/strong>\u00a0(also marked in yellow) under the assumption that the null hypothesis is true, which is 8.76e-70, an exceedingly small number lower than 1%. This means that there is much less than 1% probability that the F-statistic of 439.9 could have occurred by chance under the assumption of a valid Null hypothesis.<\/p>\n<p>Having said this, we can reject the Null Hypothesis and be confident that at least one predictor is useful in predicting the output.<\/p>\n<p>\u00a0<\/p>\n<h3>Generate models<\/h3>\n<p>\u00a0<br \/>Running a Linear Regression model with many variables including irrelevant ones will lead to a needlessly complex model. Which of the predictors are important? Are all of them significant to our model? To find that out, we need to perform a process called\u00a0<strong>feature selection<\/strong>.<em>\u00a0<\/em>The 2 main methods for feature selection are:<\/p>\n<ol>\n<li>\n<strong>Forward Selection:\u00a0<\/strong>where predictors are added one at a time beginning with the predictor with the highest correlation with the dependent variable. Then, variables of greater theoretical importance are incorporated to the model sequentially, until a stopping rule is reached.\n<\/li>\n<li>\n<strong>Backward Elimination:<\/strong>\u00a0where you start with all variables in the model, and remove the variables that have the least statistically significant (greater p-value), until a stopping rule is reached.\n<\/li>\n<\/ol>\n<p>Although both methods can be used, unless the number of predictors is larger than the sample size (or number of events), it\u2019s usually preferred to use a backward elimination approach.<\/p>\n<p>You can find a full example and implementation of these methods in\u00a0<a href=\"https:\/\/towardsdatascience.com\/multiple-linear-regression-8cf3bee21d8b\" rel=\"noopener noreferrer\" target=\"_blank\">this link<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3><strong>Compare models<\/strong><\/h3>\n<p>\u00a0<br \/>Every time you add an independent variable to a model, the R\u00b2 increases, even if the independent variable is insignificant. In our model, are all predictors contributing to an increase in sales? And if so, are they all doing it in the same extent?<\/p>\n<p>As opposed to R\u00b2,<strong>\u00a0Adjusted R\u00b2\u00a0<\/strong>is a measure that increases only when the independent variable is significant and affects the dependent variable. So,if your R\u00b2 score increases but the Adjusted R\u00b2 score decreases as you add variables to the model, then you know that some features are not useful and you should remove them.<\/p>\n<p>An interesting finding in the table above is that the\u00a0<strong>p-value<\/strong>\u00a0for newspaper is super high (0.789, marked in red). Finding the p-value for each coefficient will tell if the variable is statistically significant to predict the target or not.<\/p>\n<blockquote>\n<p>\nAs a general rule of thumb, if the p-value for a given variable is less than 0.05 then there is a strong relationship between that variable and the target.\n<\/p>\n<\/blockquote>\n<p>This way, including the variable newspaper doesn\u2019t seem to be appropriate to reach a robust model, and removing it may improve the performance and generalization of the model.<\/p>\n<p>Besides Adjusted R\u00b2 score you can use other criteria to compare different regression models:<\/p>\n<ul>\n<li>\n<strong>Akaike Information Criterion (AIC):<\/strong>\u00a0is a technique used to estimate the likelihood of a model to predict\/estimate the future values. It rewards models that achieve a high goodness-of-fit score and penalizes them if they become overly complex. A good model is the one that has minimum AIC among all the other models.\n<\/li>\n<li>\n<strong>Bayesian Information Criterion (BIC):<\/strong>\u00a0is another criteria for model selection that measures the trade-off between model fit and complexity, penalizing overly complex models even more than AIC.\n<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Assumptions<\/h3>\n<p>\u00a0<br \/>Because Linear Regression models are an approximation of the long-term sequence of any event, they require some assumptions to be made about the data they represent in order to remain appropriate. Most statistical tests rely upon certain assumptions about the variables used in the analysis, and when these assumptions are not met, the results may not be trustworthy (e.g. resulting in Type I or Type II errors).<\/p>\n<p>Linear Regression models are linear in the sense that the output is a linear combination of the input variables, and only suited for modeling linearly separable data. Linear Regression models work under various assumptions that must be present in order to produce a proper estimation and not to depend solely on accuracy scores:<\/p>\n<ul>\n<li>\n<strong>Linearity<\/strong>: the relationship between the features and target must be linear. One way to check the linear relationships is to visually inspect scatter plots for linearity. If the relationship displayed in the scatter plot is not linear, then we\u2019d need to run a non-linear regression or transform the data.\n<\/li>\n<li>\n<strong>Homoscedasticity<\/strong>: the variance of the residual must be the same for any value of x. Multiple linear regression assumes that the amount of error in the residuals is similar at each point of the linear model. This scenario is known as homoscedasticity. Scatter plots are a good way to check whether the data are homoscedastic, and also several tests exist to validate the assumption numerically (e.g. Goldfeld-Quandt, Breusch-Pagan, White)\n<\/li>\n<\/ul>\n<div>\n<img src=\"https:\/\/miro.medium.com\/max\/1050\/0*Pf8JAomWSiIAICyA.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<ul>\n<li>\n<strong>No multicollinearity:<\/strong>\u00a0data should not show multicollinearity, which occurs when the independent variables (explanatory variables) are highly correlated to one another. If this happens, there will be problems in figuring out the specific variable that contributes to the variance in the dependent\/target variable. This assumption can be tested with the Variance Inflation Factor (VIF) method, or through a correlation matrix. Alternatives to solve this issue may be centering the data (deducting the mean score), or conducting a factor analysis and rotating the factors to insure independence of the factors in the linear regression analysis.\n<\/li>\n<li>\n<strong>No autocorrelation<\/strong>: the value of the residuals should be independent of one another. The presence of correlation in residuals drastically reduces model\u2019s accuracy. If the error terms are correlated, the estimated standard errors tend to underestimate the true standard error. To test for this assumption, you can use the Durbin-Watson statistic.\n<\/li>\n<li>\n<strong>Normality of residuals<\/strong>: residuals must be normally distributed. Normality can be checked with a goodness of fit test (e.g. Kolmogorov-Smirnov or Shapiro-Wilk tests), and if data is not normally distributed, a non-linear transformation (e.g. log transformation) might fix the issue.\n<\/li>\n<\/ul>\n<div>\n<img src=\"https:\/\/miro.medium.com\/max\/486\/0*febzA62009Qxvl51.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>Assumptions are critical because if they are not valid, then the analytical process can be considered unreliable, unpredictable, and out of control. Failing to meet the assumptions can lead to draw conclusions that are not valid or scientifically unsupported by the data.<\/p>\n<p>You can find a full testing of the assumptions in\u00a0<a href=\"https:\/\/www.kaggle.com\/shrutimechlearn\/step-by-step-assumptions-linear-regression\" rel=\"noopener noreferrer\" target=\"_blank\">this link<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3>Final thoughts<\/h3>\n<p>\u00a0<br \/>Although MLR models extend the scope of SLR models, they are still linear models, meaning that the terms included in the model are incapable of showing any non-linear relationships between each other or representing any sort of non-linear trend. You should also be careful when predicting a point outside the observed range of features since the relationship among variables may change as you move outside the observed range (a fact that you can\u2019t know because you don\u2019t have the data).<\/p>\n<blockquote>\n<p>\nThe observed relationship may be locally linear, but there may be unobserved non-linear relationships on the outside range of your data.\n<\/p>\n<\/blockquote>\n<p><strong>Linear models can also model curvatures<\/strong>\u00a0by including non-linear variables such as polynomials and transforming exponential functions. The linear regression equation is\u00a0<em>linear in the\u00a0<\/em><a href=\"https:\/\/statisticsbyjim.com\/glossary\/parameter\/\" rel=\"noopener noreferrer\" target=\"_blank\"><em>parameters<\/em><\/a>, meaning you can raise an independent variable by an exponent to fit a curve, and still remain in the \u201clinear world\u201d. Linear Regression models can contain log terms and inverse terms to follow different kinds of curves and yet continue to be linear in the parameters.<\/p>\n<div>\n<img src=\"https:\/\/miro.medium.com\/max\/960\/1*N5viBqbRIUKGz5TjRQ3lOA.png\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p>While the independent variable is squared, the model is still linear in the parameters<\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>Regressions like\u00a0<a href=\"https:\/\/towardsdatascience.com\/machine-learning-with-python-easy-and-robust-method-to-fit-nonlinear-data-19e8a1ddbd49\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>Polynomial Regression<\/strong><\/a>\u00a0can model\u00a0<em>non-linear relationships<\/em>, and while a linear equation has one basic form, non-linear equations can take many different forms. The reason you might consider using\u00a0<a href=\"https:\/\/towardsdatascience.com\/how-to-choose-between-a-linear-or-nonlinear-regression-for-your-dataset-e58a568e2a15\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>Non-linear Regression Models<\/strong><\/a>\u00a0is that, while linear regression can model curves, it might not be able to model the specific curve that exists in your data.<\/p>\n<p>You should also know that OLS is not the only method to fit your Linear Regression model, and other optimization methods like\u00a0<a href=\"https:\/\/towardsdatascience.com\/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>Gradient Descent<\/strong><\/a>\u00a0are more adequate to fit large datasets. Applying OLS to complex and non-linear algorithms might not be scalable, and Gradient Descent can be computationally cheaper (faster) for finding the solution.\u00a0<em>Gradient Descent is an algorithm that minimizes functions<\/em>, and given a function defined by a set of parameters, the algorithm starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This\u00a0<strong>iterative minimization<\/strong>\u00a0is achieved using derivatives, taking steps in the negative direction of the function gradient.<\/p>\n<div>\n<img src=\"https:\/\/miro.medium.com\/max\/648\/0*NrINywH3bS8-po_T.gif\" alt=\"Figure\" width=\"100%\"><br \/><span><\/p>\n<p><\/span>\n<\/div>\n<p>\u00a0<\/p>\n<p>Another key thing to take into account is that\u00a0<strong>outliers can have a dramatic effect on regression lines\u00a0<\/strong>and the correlation coefficient. In order to identify them it\u2019s essential to perform\u00a0<a href=\"https:\/\/towardsdatascience.com\/the-basics-of-data-prep-7bb5f3af77ac\" rel=\"noopener noreferrer\" target=\"_blank\">Exploratory Data Analysis (EDA)<\/a>, examining the data to detect unusual observations, since they can impact the results of our analysis and statistical modeling in a drastic way. In case you recognize any, outliers can be imputed (e.g. with mean \/ median \/ mode), capped (replacing those outside certain limits), or replaced by missing values and predicted.<\/p>\n<p>Finally, some\u00a0<a href=\"https:\/\/www.imf.org\/external\/pubs\/ft\/fandd\/2006\/03\/basics.htm\" rel=\"noopener noreferrer\" target=\"_blank\">limitations of Linear Regression models<\/a>\u00a0are:<\/p>\n<ul>\n<li>\n<strong>Omitted variables<\/strong>. It is necessary to have a good theoretical model to suggest variables that explain the dependent variable. In the case of a simple two-variable regression, one has to think of other factors that might explain the dependent variable, since there may be other \u201cunobserved\u201d variables that explain the output.\n<\/li>\n<li>\n<strong>Reverse causality<\/strong>. Many theoretical models predict bidirectional causality \u2014 that is, a dependent variable can cause changes in one or more explanatory variables. For instance, higher earnings may enable people to invest more in their own education, which, in turn, raises their earnings. This complicates the way regressions should be estimated, calling for special techniques.\n<\/li>\n<li>\n<strong>Mismeasurement<\/strong>. Factors might be measured incorrectly. For example, aptitude is difficult to measure, and there are well-known problems with IQ tests. As a result, the regression using IQ might not properly control for aptitude, leading to inaccurate or biased correlations between variables like education and earnings.\n<\/li>\n<li>\n<strong>Too limited a focus<\/strong>. A regression coefficient provides information only about how small changes \u2014 not large changes \u2014 in one variable relate to changes in another. It will show how a small change in education is likely to affect earnings but it will not allow the researcher to generalize about the effect of large changes. If everyone became college educated at the same time, a newly minted college graduate would be unlikely to earn a great deal more because the total supply of college graduates would have increased dramatically.\n<\/li>\n<\/ul>\n<blockquote>\n<p>\nInterested in these topics? Follow me on\u00a0<a href=\"https:\/\/www.linkedin.com\/in\/lopezyse\/\" rel=\"noopener noreferrer\" target=\"_blank\">Linkedin<\/a>\u00a0or\u00a0<a href=\"https:\/\/twitter.com\/lopezyse\" rel=\"noopener noreferrer\" target=\"_blank\">Twitter<\/a>\n<\/p>\n<\/blockquote>\n<p>\u00a0<br \/><b>Bio: <a href=\"https:\/\/www.linkedin.com\/in\/lopezyse\/\" target=\"_blank\" rel=\"noopener noreferrer\">Diego Lopez Yse<\/a><\/b> is an experienced professional with a solid international background acquired in different industries (capital markets, biotechnology, software, consultancy, government, agriculture). Always a team member. Skilled in Business Management, Analytics, Finance, Risk, Project Management and Commercial Operations. MS in Data Science and Corporate Finance.<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/your-guide-to-linear-regression-models-df1d847185db\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/10\/guide-linear-regression-models.html<\/p>\n","protected":false},"author":0,"featured_media":2690,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/2689"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=2689"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/2689\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/2690"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=2689"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=2689"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=2689"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}