{"id":5230,"date":"2020-10-19T14:15:04","date_gmt":"2020-10-19T14:15:04","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/10\/19\/how-to-explain-key-machine-learning-algorithms-at-an-interview\/"},"modified":"2020-10-19T14:15:04","modified_gmt":"2020-10-19T14:15:04","slug":"how-to-explain-key-machine-learning-algorithms-at-an-interview","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/10\/19\/how-to-explain-key-machine-learning-algorithms-at-an-interview\/","title":{"rendered":"How to Explain Key Machine Learning Algorithms at an Interview"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/towardsdatascience.com\/@terenceshin\" target=\"_blank\" rel=\"noopener noreferrer\">Terence Shin<\/a>, Data Scientist | MSc Analytics &amp; MBA student<\/b>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*c14OKEbBA4Pc-4wjRnZ1eg.png\" width=\"90%\"><\/p>\n<p><em>Created by <a href=\"http:\/\/www.freepik.com\" target=\"_blank\" rel=\"noopener noreferrer\">katemangostar<\/a>.<\/em><\/p>\n<p>\u00a0<\/p>\n<h3>Linear Regression<\/h3>\n<p>\u00a0<\/p>\n<p>Linear Regression involves finding a \u2018line of best fit\u2019 that represents a dataset using the least squares method. The least squares method involves finding a linear equation that minimizes the sum of squared residuals. A residual is equal to the actual minus predicted value.<\/p>\n<p>To give an example, the red line is a better line of best fit than the green line because it is closer to the points, and thus, the residuals are smaller.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*GHApjgH_c6EeLIKo.png\" width=\"90%\"><\/p>\n<p><em>Image created by Author.<\/em><\/p>\n<p>\u00a0<\/p>\n<h3>Ridge Regression<\/h3>\n<p>\u00a0<\/p>\n<p>Ridge regression, also known as L2 Regularization, is a regression technique that introduces a small amount of bias to reduce overfitting. It does this by minimizing the sum of squared residuals\u00a0<strong>plus\u00a0<\/strong>a penalty, where the penalty is equal to lambda times the slope squared. Lambda refers to the severity of the penalty.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*vnHI8OyZqzPwgKM8nK7UcQ.png\" width=\"90%\"><\/p>\n<p><em>Image Created by Author.<\/em><\/p>\n<p>Without a penalty, the line of best fit has a steeper slope, which means that it is more sensitive to small changes in X. By introducing a penalty, the line of best fit becomes less sensitive to small changes in X. This is the idea behind ridge regression.<\/p>\n<p>\u00a0<\/p>\n<h3>Lasso Regression<\/h3>\n<p>\u00a0<\/p>\n<p>Lasso Regression, also known as L1 Regularization, is similar to Ridge regression. The only difference is that the penalty is calculated with the absolute value of the slope instead.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*rNknsxFrZO8vGcwedTCiEA.png\" width=\"90%\"><\/p>\n<p>\u00a0<\/p>\n<h3>Logistic Regression<\/h3>\n<p>\u00a0<\/p>\n<p>Logistic Regression is a classification technique that also finds a \u2018line of best fit.\u2019 However, unlike linear regression, where the line of best fit is found using least squares, logistic regression finds the line (logistic curve) of best fit using maximum likelihood. This is done because the <em>y<\/em> value can only be one or zero.\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=BfKanl1aSG0\" target=\"_blank\" rel=\"noopener noreferrer\"><em>Check out StatQuest\u2019s video to see how the maximum likelihood is calculated<\/em><\/a>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/863\/1*nBH02qlime9ali_frW63dA.png\" width=\"90%\"><\/p>\n<p><em>Image Created by Author.<\/em><\/p>\n<p>\u00a0<\/p>\n<h3>K-Nearest Neighbours<\/h3>\n<p>\u00a0<\/p>\n<p>K-Nearest Neighbours is a classification technique where a new sample is classified by looking at the nearest classified points, hence \u2018K-nearest.\u2019 In the example below, if <em>k=1<\/em>, then an unclassified point would be classified as a blue point.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*evDyspMuyttExyDFe6B4-w.png\" width=\"90%\"><\/p>\n<p><em>Image Created by Author.<\/em><\/p>\n<p>If the value of <em>k<\/em> is too low, then it can be subject to outliers. However, if it\u2019s too high, then it may overlook classes with only a few samples.<\/p>\n<p>\u00a0<\/p>\n<h3>Naive Bayes<\/h3>\n<p>\u00a0<\/p>\n<p>The Naive Bayes Classifier is a classification technique inspired by Bayes Theorem, which states the following equation:<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*cJo6zolk5fZ4M45Q\" width=\"90%\"><\/p>\n<p>Because of the naive assumption (hence the name) that variables are independent given the class, we can rewrite P(X|y) as follows:<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*MIu4cAWvYLXsjNEv\" width=\"90%\"><\/p>\n<p>Also, since we are solving for <em>y<\/em>, <em>P(X)<\/em> is a constant, which means that we can remove it from the equation and introduce a proportionality.<\/p>\n<p>Thus, the probability of each value of <em>y<\/em> is calculated as the product of the conditional probability of <em>x<sub>n<\/sub><\/em> given <em>y<\/em>.<\/p>\n<p>\u00a0<\/p>\n<h3>Support Vector Machines<\/h3>\n<p>\u00a0<\/p>\n<p>Support Vector Machines are a classification technique that finds an optimal boundary, called the hyperplane, which is used to separate different classes. The hyperplane is found by maximizing the margin between the classes.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*sE-c5O6pkAVofI74MTue4w.png\" width=\"90%\"><\/p>\n<p><em>Image Created by Author.<\/em><\/p>\n<p>\u00a0<\/p>\n<h3>Decision Trees<\/h3>\n<p>\u00a0<\/p>\n<p>A decision tree is essentially a series of conditional statements that determine what path a sample takes until it reaches the bottom. They are intuitive and easy to build but tend not to be accurate.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*xvPKnEVMKxXfomrm5Zm-kw.png\" width=\"90%\"><\/p>\n<p>\u00a0<\/p>\n<h3>Random Forest<\/h3>\n<p>\u00a0<\/p>\n<p>Random Forest is an ensemble technique, meaning that it combines several models into one to improve its predictive power. Specifically, it builds 1000s of smaller decision trees using bootstrapped datasets and random subsets of variables (also known as bagging). With 1000s of smaller decision trees, random forests use a \u2018majority wins\u2019 model to determine the value of the target variable.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*dKt5G_W1NsL7Gv6R.png\" width=\"90%\"><\/p>\n<p>For example, if we created one decision tree, the third one, it would predict 0. But if we relied on the mode of all 4 decision trees, then the predicted value would be 1. This is the power of random forests.<\/p>\n<p>\u00a0<\/p>\n<h3>AdaBoost<\/h3>\n<p>\u00a0<\/p>\n<p>AdaBoost is a boosted algorithm that is similar to Random Forests but has a couple of significant differences:<\/p>\n<ol>\n<li>Rather than a forest of trees, AdaBoost typically makes a forest of stumps (a stump is a tree with only one node and two leaves).<\/li>\n<li>Each stump\u2019s decision is not weighted equally in the final decision. Stumps with less total error (high accuracy) will have a higher say.<\/li>\n<li>The order in which the stumps are created is important, as each subsequent stump emphasizes the importance of the samples that were incorrectly classified in the previous stump.<\/li>\n<\/ol>\n<p>\u00a0<\/p>\n<h3>Gradient Boost<\/h3>\n<p>\u00a0<\/p>\n<p>Gradient Boost is similar to AdaBoost in the sense that it builds multiple trees where each tree is built off of the previous tree. Unlike AdaBoost, which builds stumps, Gradient Boost builds trees with usually 8 to 32 leaves.<\/p>\n<p>More importantly, Gradient Boost differs from AdaBoost in the way that the decisions trees are built. Gradient Boost starts with an initial prediction, usually the average. Then, a decision tree is built based on the residuals of the samples. A new prediction is made by taking the initial prediction + a learning rate times the outcome of the residual tree, and the process is repeated.<\/p>\n<p>\u00a0<\/p>\n<h3>XGBoost<\/h3>\n<p>\u00a0<\/p>\n<p>XGBoost is essentially the same thing as Gradient Boost, but the main difference is how the residual trees are built. With XGBoost, the residual trees are built by calculating similarity scores between leaves and the preceding nodes to determine which variables are used as the roots and the nodes.<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/how-to-explain-each-machine-learning-model-at-an-interview-499d82f91470\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/10\/explain-machine-learning-algorithms-interview.html<\/p>\n","protected":false},"author":0,"featured_media":5231,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/5230"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=5230"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/5230\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/5231"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=5230"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=5230"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=5230"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}