{"id":910,"date":"2020-09-02T15:17:47","date_gmt":"2020-09-02T15:17:47","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/02\/which-methods-should-be-used-for-solving-linear-regression\/"},"modified":"2020-09-02T15:17:47","modified_gmt":"2020-09-02T15:17:47","slug":"which-methods-should-be-used-for-solving-linear-regression","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/02\/which-methods-should-be-used-for-solving-linear-regression\/","title":{"rendered":"Which methods should be used for solving linear regression?"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/medium.com\/@ahmadbinshafiq\" target=\"_blank\" rel=\"noopener noreferrer\">Ahmad BinShafiq<\/a>, Machine Learning Student<\/b>.<\/p>\n<p data-selectable-paragraph=\"\">Linear Regression\u00a0is a supervised machine learning algorithm. It predicts a\u00a0linear relationship\u00a0between an\u00a0independent variable (y), based on the given\u00a0dependant variables (x), such that the\u00a0independent variable (y)\u00a0has the\u00a0<strong>lowest cost<\/strong>.<\/p>\n<p>\u00a0<\/p>\n<h3>Different approaches to solve linear regression models<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">There are many different methods that we can apply to our linear regression model in order to make it more efficient. But we will discuss the most common of them here.<\/p>\n<ol>\n<li>Gradient Descent<\/li>\n<li>Least Square Method \/ Normal Equation Method<\/li>\n<li>Adams Method<\/li>\n<li>Singular Value Decomposition (SVD)<\/li>\n<\/ol>\n<p data-selectable-paragraph=\"\">Okay, so let\u2019s begin\u2026<\/p>\n<p>\u00a0<\/p>\n<h3>Gradient Descent<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">One of the most common and easiest methods for\u00a0beginners\u00a0to solve linear regression problems is gradient descent.<\/p>\n<p data-selectable-paragraph=\"\"><strong>How Gradient Descent works<\/strong><\/p>\n<p data-selectable-paragraph=\"\">Now, let&#8217;s suppose we have our data plotted out in the form of a scatter graph, and when we apply a cost function to it, our model will make a prediction. Now this prediction can be very good, or it can be far away from our ideal prediction (meaning its cost will be high). So, in order to minimize that cost (error), we apply gradient descent to it.<\/p>\n<p data-selectable-paragraph=\"\">Now, gradient descent will slowly converge our hypothesis towards a global minimum, where the\u00a0<strong>cost<\/strong>\u00a0would be lowest. In doing so, we have to manually set the value of\u00a0<strong>alpha,\u00a0<\/strong>and the slope of the hypothesis changes with respect to our alpha\u2019s value. If the value of alpha is large, then it will take big steps. Otherwise, in the case of small alpha, our hypothesis would converge slowly and through small baby steps.<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*PbfwvqyWXOB3nB4IrMt-Fg.png\" width=\"90%\"><\/p>\n<p><em>Hypothesis converging towards a global minimum. Image from\u00a0<a href=\"https:\/\/medium.com\/@ahmadbinshafiq\/linear-regression-simplified-for-beginners-dcd3afe0b23f\" target=\"_blank\" rel=\"noopener noreferrer\">Medium<\/a>.<\/em><\/p>\n<p>The Equation for Gradient Descent is<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/264\/1*FPY3t9boKODUi6x1YZ3UEA.png\" width=\"211\" height=\"73\"><\/p>\n<p><em>Source:\u00a0<a href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/index.html#batchgradientdescent\" target=\"_blank\" rel=\"noopener noreferrer\">Ruder.io<\/a>.<\/em><\/p>\n<p data-selectable-paragraph=\"\"><strong>Implementing Gradient Descent in Python<\/strong><\/p>\n<div>\n<pre>import numpy as np\r\nfrom matplotlib import pyplot\r\n\r\n#creating our data\r\nX = np.random.rand(10,1)\r\ny = np.random.rand(10,1)\r\nm = len(y)\r\ntheta = np.ones(1)\r\n\r\n#applying gradient descent\r\na = 0.0005\r\ncost_list = []\r\nfor i in range(len(y)):\r\n    \r\n    theta = theta - a*(1\/m)*np.transpose(X)@(X@theta - y)\r\n           \r\n    cost_val = (1\/m)*np.transpose(X)@(X@theta - y)\r\n    cost_list.append(cost_val)\r\n\r\n#Predicting our Hypothesis\r\nb = theta\r\nyhat = X.dot(b)\r\n\r\n#Plotting our results\r\npyplot.scatter(X, y, color='red')\r\npyplot.plot(X, yhat, color='blue')\r\npyplot.show()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/699\/1*SPfnDq7H5wZYB5nzK6WmeA.png\" width=\"90%\"><\/p>\n<p><em>Model after Gradient Descent.<\/em><\/p>\n<p>Here first, we have created our dataset, and then we looped over all our training examples in order to minimize our cost of hypothesis.<\/p>\n<p data-selectable-paragraph=\"\"><strong>Pros:<\/strong><\/p>\n<p data-selectable-paragraph=\"\">Important advantages of Gradient Descent are<\/p>\n<ul>\n<li>Less Computational Cost as compared to SVD or ADAM<\/li>\n<li>Running time is O(kn\u00b2)<\/li>\n<li>Works well with more number of features<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\"><strong>Cons:<\/strong><\/p>\n<p data-selectable-paragraph=\"\">Important cons of Gradient Descent are<\/p>\n<ul>\n<li>Need to choose some learning rate\u00a0<strong>\u03b1<\/strong>\n<\/li>\n<li>Needs many iterations to converge<\/li>\n<li>Can be stuck in Local Minima<\/li>\n<li>If not proper Learning Rate\u00a0<strong>\u03b1<\/strong>, then it might not converge.<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Least Square Method<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">The least-square method, also known as the\u00a0<strong>normal equation,<\/strong>\u00a0is also one of the most common approaches to solving linear regression models easily. But, this one needs to have some basic knowledge of linear algebra.<\/p>\n<p data-selectable-paragraph=\"\"><strong>How the least square method works<\/strong><\/p>\n<p data-selectable-paragraph=\"\">In normal LSM, we solve directly for the value of our coefficient. In short, in one step, we reach our optical minimum point, or we can say only in one step we fit our hypothesis to our data with the lowest cost possible.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/753\/1*oxeqyNwUB0SBVYDGh0zGog.png\" width=\"90%\"><\/p>\n<p><em>Before and after applying LSM to our dataset. Image from\u00a0<a href=\"https:\/\/towardsdatascience.com\/complete-guide-to-linear-regression-in-python-d95175447255\" target=\"_blank\" rel=\"noopener noreferrer\">Medium<\/a>.<\/em><\/p>\n<p>The equation for LSM is<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/405\/1*RzXPAZa3dC8powvJ-4WtXQ.png\" width=\"324\" height=\"78\"><\/p>\n<p data-selectable-paragraph=\"\"><strong>Implementing LSM in Python<\/strong><\/p>\n<div>\n<pre>import numpy as np\r\nfrom matplotlib import pyplot\r\n\r\n#creating our data\r\nX = np.random.rand(10,1)\r\ny = np.random.rand(10,1)\r\n\r\n#Computing coefficient\r\nb = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)\r\n\r\n#Predicting our Hypothesis\r\nyhat = X.dot(b)\r\n#Plotting our results\r\npyplot.scatter(X, y, color='red')\r\npyplot.plot(X, yhat, color='blue')\r\npyplot.show()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/539\/1*ZgEGBpll-xKBESNoyXUOIg.jpeg\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">Here first we have created our dataset and then minimized the cost of our hypothesis using the<\/p>\n<p data-selectable-paragraph=\"\"><em>b = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)<\/em><\/p>\n<p data-selectable-paragraph=\"\">code, which is equivalent to our equation.<\/p>\n<p data-selectable-paragraph=\"\"><strong>Pros:<\/strong><\/p>\n<p data-selectable-paragraph=\"\">Important advantages of LSM are:<\/p>\n<ul>\n<li>No Learning Rate<\/li>\n<li>No Iterations<\/li>\n<li>Feature Scaling Not Necessary<\/li>\n<li>Works really well when the Number of Features is less.<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\"><strong>Cons:<\/strong><\/p>\n<p data-selectable-paragraph=\"\">Important cons are:<\/p>\n<ul>\n<li>Is computationally expensive when the dataset is big.<\/li>\n<li>Slow when Number of Features is more<\/li>\n<li>Running Time is O(n\u00b3)<\/li>\n<li>Sometimes, your X transpose X is non-invertible, i.e., a singular matrix with no inverse. You can use <em>np.linalg.pinv<\/em> instead of\u00a0<em>np.linalg.inv<\/em>\u00a0to overcome this problem.<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Adam\u2019s Method<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">ADAM, which stands for Adaptive Moment Estimation, is an optimization algorithm that is widely used in Deep Learning.<\/p>\n<p data-selectable-paragraph=\"\">It is an iterative algorithm that works well on noisy data.<\/p>\n<p data-selectable-paragraph=\"\">It is the combination of RMSProp and Mini-batch Gradient Descent algorithms.<\/p>\n<p data-selectable-paragraph=\"\">In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients, similar to momentum.<\/p>\n<p data-selectable-paragraph=\"\">We compute the decaying averages of past and past squared gradients respectively as follows:<\/p>\n<p data-selectable-paragraph=\"\"><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/350\/1*MhIKaOrJhqkLFguZ3q_UhA.png\" width=\"280\" height=\"81\"><\/p>\n<p><em>Credit:\u00a0<a href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/index.html#adam\" target=\"_blank\" rel=\"noopener noreferrer\">Ruder.io<\/a>.<\/em><\/p>\n<p>As\u00a0<em>mt<\/em> and\u00a0<em>vt<\/em> are initialized as vectors of 0\u2019s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e., \u03b21\u03b21 and \u03b22\u03b22 are close to 1).<\/p>\n<p data-selectable-paragraph=\"\">They counteract these biases by computing bias-corrected first and second-moment estimates:<\/p>\n<p data-selectable-paragraph=\"\"><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/295\/1*fRR-qpxZ910AbjSG-opwjA.png\" width=\"236\" height=\"148\"><\/p>\n<p><em>Credit:\u00a0<a href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/#adam\" target=\"_blank\" rel=\"noopener noreferrer\">Ruder.io<\/a>.<\/em><\/p>\n<p>They then update the parameters with:<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/375\/1*ZcJMpjU3eAP5NXSCepyNgg.png\" width=\"300\" height=\"94\"><\/p>\n<p><em>Credit:\u00a0<a href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/#adam\" target=\"_blank\" rel=\"noopener noreferrer\">Ruder.io<\/a>.<\/em><\/p>\n<p data-selectable-paragraph=\"\">You can learn the theory behind Adam\u00a0<a href=\"https:\/\/towardsdatascience.com\/adam-latest-trends-in-deep-learning-optimization-6be9a291375c\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>\u00a0or\u00a0<a href=\"https:\/\/ruder.io\/optimizing-gradient-descent\/#adam\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>.<\/p>\n<p data-selectable-paragraph=\"\"><strong>Pseudocode for Adam<\/strong>\u00a0is<\/p>\n<p data-selectable-paragraph=\"\"><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*lxzXyvMyXQSiIkwK9mi9Vw.png\" width=\"759\" height=\"422\"><\/p>\n<p><em>Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1412.6980v9.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Arxiv Adam<\/a>.<\/em><\/p>\n<p>Let\u2019s see it\u2019s code in Pure Python.<\/p>\n<div>\n<pre>#Creating the Dummy Data set and importing libraries\r\nimport math\r\nimport seaborn as sns\r\nimport numpy as np \r\nfrom scipy import stats\r\nfrom matplotlib import pyplot\r\nx = np.random.normal(0,1,size=(100,1))\r\ny = np.random.random(size=(100,1))\r\n\r\n<\/pre>\n<\/div>\n<p data-selectable-paragraph=\"\">Now Let\u2019s find the actual graph of Linear Regression and values for slope and intercept for our dataset.<\/p>\n<div>\n<pre>print(\"Intercept is \" ,stats.mstats.linregress(x,y).intercept)\r\nprint(\"Slope is \", stats.mstats.linregress(x,y).slope)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\"><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/463\/1*_YoDTEhrkEvXuYgebaOHmA.png\" width=\"370\" height=\"42\"><\/p>\n<p data-selectable-paragraph=\"\">Now let us see the Linear Regression line using the Seaborn\u00a0<em>regplot\u00a0<\/em>function.<\/p>\n<div>\n<pre>pyplot.figure(figsize=(15,8))\r\nsns.regplot(x,y)\r\npyplot.show()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*nQ7nLURLsnA_tO82nMOMIA.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">Let us code Adam Optimizer now in pure Python.<\/p>\n<div>\n<pre>h = lambda theta_0, theta_1, x: theta_0 + np.dot(x,theta_1) #equation of straight lines\r\n\r\n# the cost function (for the whole batch. for comparison later)\r\ndef J(x, y, theta_0, theta_1):\r\n    m = len(x)\r\n    returnValue = 0\r\n    for i in range(m):\r\n        returnValue += (h(theta_0, theta_1, x[i]) - y[i])**2\r\n    returnValue = returnValue\/(2*m)\r\n    return returnValue\r\n\r\n# finding the gradient per each training example\r\ndef grad_J(x, y, theta_0, theta_1):\r\n    returnValue = np.array([0., 0.])\r\n    returnValue[0] += (h(theta_0, theta_1, x) - y)\r\n    returnValue[1] += (h(theta_0, theta_1, x) - y)*x\r\n    return returnValue\r\n\r\nclass AdamOptimizer:\r\n    def __init__(self, weights, alpha=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):\r\n        self.alpha = alpha\r\n        self.beta1 = beta1\r\n        self.beta2 = beta2\r\n        self.epsilon = epsilon\r\n        self.m = 0\r\n        self.v = 0\r\n        self.t = 0\r\n        self.theta = weights\r\n        \r\n    def backward_pass(self, gradient):\r\n        self.t = self.t + 1\r\n        self.m = self.beta1*self.m + (1 - self.beta1)*gradient\r\n        self.v = self.beta2*self.v + (1 - self.beta2)*(gradient**2)\r\n        m_hat = self.m\/(1 - self.beta1**self.t)\r\n        v_hat = self.v\/(1 - self.beta2**self.t)\r\n        self.theta = self.theta - self.alpha*(m_hat\/(np.sqrt(v_hat) - self.epsilon))\r\n        return self.theta\r\n\r\n<\/pre>\n<\/div>\n<p data-selectable-paragraph=\"\">Here, we have implemented all the equations mentioned in the pseudocode above using an object-oriented approach and some helper functions.<\/p>\n<p data-selectable-paragraph=\"\">Let us now set the hyperparameters for our model.<\/p>\n<div>\n<pre>epochs = 1500\r\nprint_interval = 100\r\nm = len(x)\r\ninitial_theta = np.array([0., 0.]) # initial value of theta, before gradient descent\r\ninitial_cost = J(x, y, initial_theta[0], initial_theta[1])\r\n\r\ntheta = initial_theta\r\nadam_optimizer = AdamOptimizer(theta, alpha=0.001)\r\nadam_history = [] # to plot out path of descent\r\nadam_history.append(dict({'theta': theta, 'cost': initial_cost})#to check theta and cost function\r\n\r\n<\/pre>\n<\/div>\n<p data-selectable-paragraph=\"\">And finally, the training process.<\/p>\n<div>\n<pre>for j in range(epochs):\r\n    for i in range(m):\r\n        gradients = grad_J(x[i], y[i], theta[0], theta[1])\r\n        theta = adam_optimizer.backward_pass(gradients)\r\n    \r\n    if ((j+1)%print_interval == 0 or j==0):\r\n        cost = J(x, y, theta[0], theta[1])\r\n        print ('After {} epochs, Cost = {}, theta = {}'.format(j+1, cost, theta))\r\n        adam_history.append(dict({'theta': theta, 'cost': cost}))\r\n        \r\nprint ('nFinal theta = {}'.format(theta))\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*RUDEWUoLuhkn72oqUaTTkQ.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">Now, if we compare the\u00a0<em>Final theta<\/em>\u00a0values to the slope and intercept values, calculated earlier using\u00a0<em>scipy.stats.mstat.linregress<\/em>, they are almost 99% equal and can be 100% equal by adjusting the hyperparameters.<\/p>\n<p data-selectable-paragraph=\"\">Finally, let us plot it.<\/p>\n<div>\n<pre>b = theta\r\nyhat = b[0] + x.dot(b[1])\r\npyplot.figure(figsize=(15,8))\r\npyplot.scatter(x, y, color='red')\r\npyplot.plot(x, yhat, color='blue')\r\npyplot.show()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*Qs9HaEyhHH4KZ9mAKgzy0g.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">And we can see that our plot is similar to plot obtained using\u00a0<em>sns.regplot<\/em>.<\/p>\n<p data-selectable-paragraph=\"\"><strong>Pros:<\/strong><\/p>\n<ul>\n<li>Straightforward to implement.<\/li>\n<li>Computationally efficient.<\/li>\n<li>Little memory requirements.<\/li>\n<li>Invariant to diagonal rescale of the gradients.<\/li>\n<li>Well suited for problems that are large in terms of data and\/or parameters.<\/li>\n<li>Appropriate for non-stationary objectives.<\/li>\n<li>Appropriate for problems with very noisy\/or sparse gradients.<\/li>\n<li>Hyper-parameters have intuitive interpretation and typically require little tuning.<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\"><strong>Cons:<\/strong><\/p>\n<ul>\n<li>Adam and RMSProp are highly sensitive to certain values of the learning rate (and, sometimes, other hyper-parameters like the batch size), and they can catastrophically fail to converge if e.g., the learning rate is too high. (Source:\u00a0<a href=\"https:\/\/ai.stackexchange.com\/questions\/11455\/when-should-we-use-algorithms-like-adam-as-opposed-to-sgd\" target=\"_blank\" rel=\"noopener noreferrer\">stackexchange<\/a>)<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Singular Value Decomposition<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">Singular value decomposition shortened as SVD is one of the famous and most widely used dimensionality reduction methods in linear regression.<\/p>\n<p data-selectable-paragraph=\"\">SVD is used (amongst other uses) as a preprocessing step to reduce the number of dimensions for our learning algorithm. SVD decomposes a matrix into a product of three other matrices (U, S, V).<\/p>\n<p data-selectable-paragraph=\"\"><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/459\/1*c-EKLiCGtikl2CBl1v03gA.jpeg\" width=\"367\" height=\"84\"><\/p>\n<p data-selectable-paragraph=\"\">Once our matrix has been decomposed, the coefficients for our hypothesis can be found by calculating the pseudoinverse of the input matrix\u00a0<strong>X<\/strong>\u00a0and multiplying that by the output vector\u00a0<strong>y<\/strong>. After that, we fit our hypothesis to our data, and that gives us the lowest cost.<\/p>\n<p data-selectable-paragraph=\"\"><strong>Implementing SVD in Python<\/strong><\/p>\n<div>\n<pre>import numpy as np\r\nfrom matplotlib import pyplot\r\n\r\n#Creating our data\r\nX = np.random.rand(10,1)\r\ny = np.random.rand(10,1)\r\n\r\n#Computing coefficient\r\nb = np.linalg.pinv(X).dot(y)\r\n\r\n#Predicting our Hypothesis\r\nyhat = X.dot(b)\r\n\r\n#Plotting our results\r\npyplot.scatter(X, y, color='red')\r\npyplot.plot(X, yhat, color='blue')\r\npyplot.show()\r\n\r\n<\/pre>\n<\/div>\n<p><br class=\"blank\"><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/695\/1*Wjx4edNZKNlUa7gAxJ2FCQ.png\" width=\"90%\"><\/p>\n<p data-selectable-paragraph=\"\">Though it is not converged very well, it is still pretty good.<\/p>\n<p data-selectable-paragraph=\"\">Here first, we have created our dataset and then minimized the cost of our hypothesis using\u00a0b = np.linalg.pinv(X).dot(y), which is the equation for SVD.<\/p>\n<p data-selectable-paragraph=\"\"><strong>Pros:<\/strong><\/p>\n<ul>\n<li>Works better with higher dimensional data<\/li>\n<li>Good for gaussian type distributed data<\/li>\n<li>Really stable and efficient for a small dataset<\/li>\n<li>While solving linear equations for linear regression, it is more stable and the preferred approach.<\/li>\n<\/ul>\n<p data-selectable-paragraph=\"\"><strong>Cons:<\/strong><\/p>\n<ul>\n<li>Running time is O(n\u00b3)<\/li>\n<li>Multiple risk factors<\/li>\n<li>Really sensitive to outliers<\/li>\n<li>May get unstable with a very large dataset<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Learning Outcome<\/h3>\n<p>\u00a0<\/p>\n<p data-selectable-paragraph=\"\">As of now, we have learned and implemented gradient descent, LSM, ADAM, and SVD. And now, we have a very good understanding of all of these algorithms, and we also know what are the pros and cons.<\/p>\n<p data-selectable-paragraph=\"\">One thing we noticed was that the ADAM optimization algorithm was the most accurate, and according to the actual ADAM research paper, ADAM outperforms almost all other optimization algorithms.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/09\/solving-linear-regression.html<\/p>\n","protected":false},"author":0,"featured_media":911,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/910"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=910"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/910\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/911"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=910"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=910"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=910"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}