{"id":8305,"date":"2021-06-09T12:13:49","date_gmt":"2021-06-09T12:13:49","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/06\/09\/fitting-support-vector-machines-via-quadratic-programming\/"},"modified":"2021-06-09T12:13:49","modified_gmt":"2021-06-09T12:13:49","slug":"fitting-support-vector-machines-via-quadratic-programming","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/06\/09\/fitting-support-vector-machines-via-quadratic-programming\/","title":{"rendered":"Fitting Support Vector Machines via Quadratic Programming"},"content":{"rendered":"<div>\n<p><em>In this blog post we take a deep dive into the internals of Support Vector Machines. We derive a Linear SVM classifier, explain its advantages, and show what the fitting process looks like when solved via CVXOPT \u2013 a convex optimisation package for Python.<\/em><\/p>\n<p>Support Vector Machines (SVMs) are supervised learning models with a wide range of applications in text classification (Joachims, 1998), image recognition (Decoste and Sch\u00f6lkopf, 2002), image segmentation (Barghout, 2015), anomaly detection (Sch\u00f6lkopf et al., 1999) and more.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" width=\"504\" height=\"504\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/svm_decision_boundaries.png\" alt=\"Two linearly separable classes and two possible separating hyperplane - the red one passes close to the classes, the blue one leaves plenty of space between the two\" class=\"wp-image-7577\"><figcaption>Figure 1 \u2013 There are infinitely many lines separating the two classes, but a good generalisation is achieved by the one that has the largest distance to the nearest data point of any class.<\/figcaption><\/figure>\n<\/div>\n<p>Figure 1 shows a sample of Fisher\u2019s Iris data set (Fisher, 1936). The sample contains all data points for two of the classes \u2014 <em>Iris setosa<\/em> (-1) and <em>Iris versicolor<\/em> (+1), and uses only two of the four original features \u2014 <em>petal length<\/em> and <em>petal width<\/em>. This selection results in a dataset that is clearly linearly separable, and it is straightforward to confirm that there exist infinitely many hyperplanes that separate the two classes. Selecting the optimal decision boundary, however, is not a straightforward process. Both the red and blue dotted lines fully separate the two classes. The red line, however, is located too closely to the two clusters and such a decision boundary is unlikely to generalise well. If we add a new \u201cunseen\u201d observation (red dot), which is clearly in the neighbourhood of class +1, a classifier using the red dotted line will misclassify it as the observation lies on the negative side of the decision boundary. A classifier using the blue dotted line, however, will have no problem assigning the new observation to the correct class. The intuition here is that a decision boundary that leaves a wider margin between the classes generalises better, which leads us to the key property of support vector machines \u2014 they construct a hyperplane in a such a way that the margin of separation between the two classes is maximised (Haykin, 2009). This is in stark contrast with the perceptron, where we have no guarantee about which separating hyperplane the perceptron will find.<\/p>\n<h2>Derivation of a Linear SVM<\/h2>\n<p>Let\u2019s look at a binary classification dataset (mathcal{D} = {boldsymbol{x}_i, y_i}_{i=1}^N), where (boldsymbol{x_i} in mathbb{R}^2) and (y in {-1,+1}). Note, that we develop the process of fitting a linear SVM in a two-dimensional Euclidean space. This is done for the purposes of brevity, as the generalisation to higher dimensions is trivial.<\/p>\n<p>We can now formalise the problem by starting with an equation for the separating hyperplane:<\/p>\n<p>$$begin{equation}label{eq:svm-hyperplane}<br \/>mathcal{H}_{boldsymbol{w},b} = {boldsymbol{x}:boldsymbol{w}^T boldsymbol{x} + b = 0} ; ; ; text{(1)}<br \/>end{equation}$$<\/p>\n<p>where (boldsymbol{x}) is an input vector, (boldsymbol{w}) is an adjustable weight vector, and (b) is a bias term. We can further define the following decision rule that can be used for assigning class labels:<\/p>\n<p>$$begin{equation}<br \/>begin{aligned}<br \/>boldsymbol{w}^T boldsymbol{x} &amp;+ b geq 0 text{, for } y_i = +1 ; ; ; text{(2)} \\<br \/>boldsymbol{w}^T boldsymbol{x} &amp;+ b &lt; 0 text{, for } y_i = -1<br \/>end{aligned}<br \/>end{equation}$$<\/p>\n<p>We now introduce the notion of <em>margin<\/em> \u2014 the distance of an observation from the separating hyperplane. The distance from an arbitrary data point (boldsymbol{x}_i) to the optimal hyperplane in our case is given by<\/p>\n<p>$$begin{equation}<br \/>boldsymbol{x}_{i_p} = boldsymbol{x}_i \u2013 gamma_i (boldsymbol{w} \/ |boldsymbol{w}|)<br \/>end{equation}$$<\/p>\n<p>where (boldsymbol{x}{i_p}) is the normal projection of (boldsymbol{x}_i) onto (mathcal{H}), and (gamma_i) is an algebraic measure of the margin (see Duda and Hart, 1973). The key idea here is that the line segment connecting (boldsymbol{x}_i) and (boldsymbol{x}{i_p}) is parallel to (boldsymbol{w}), and can be expressed as a scalar (gamma_i) multiplied by a unit-length vector (boldsymbol{w} \/ |boldsymbol{w}|), pointing in the same direction as (boldsymbol{w}).<\/p>\n<p>Because (boldsymbol{x}_{i_p}) lies on (mathcal{H}) it satisfies (1) therefore<\/p>\n<p>$$ begin{align} boldsymbol{w}^T boldsymbol{x}_{i_p} + b = 0 \\<br \/>\n boldsymbol{w}^T (boldsymbol{x}_i \u2013 gamma_i (boldsymbol{w} \/ |boldsymbol{w}|)+ b = 0 end{align} $$<\/p>\n<p>Solving for (gamma_i) yields<\/p>\n<p>$$begin{align}<br \/>gamma_i = (boldsymbol{w}^T boldsymbol{x}_i + b ) \/ |boldsymbol{w}| ; ; ; text{(3)}<br \/>end{align}$$<\/p>\n<p>We apply a further correction to (3), to enable the measure to also handle data points on the negative side of the hyperplane:<\/p>\n<p>$$begin{align}label{eq:svm-margin-final}<br \/>gamma_i = y_i  [(boldsymbol{w}^T boldsymbol{x}_i + b ) \/ |boldsymbol{w}|] ;;;text{(4)}<br \/>end{align}$$<\/p>\n<p>This new definition works for both positive and negative examples, producing a value for (gamma) which is always non-negative. This is easy to see, as ignoring the norm of (boldsymbol{w}) and referring to the decision rule (2) we get<\/p>\n<p>$$begin{equation}<br \/>begin{aligned}<br \/>y_i = +1 ;;; &amp; gamma_i = (+1) (boldsymbol{w}^T boldsymbol{x}_i + b geq 0) \\<br \/>y_i = -1 ;;; &amp; gamma_i = (-1) (boldsymbol{w}^T boldsymbol{x}_i + b &lt; 0) end{aligned} <br \/>end{equation}$$<\/p>\n<p>which makes (gamma_i &gt; 0) in both cases. Finally, we define the margin with respect to the entire dataset (mathcal{D}) as<\/p>\n<p>$$begin{equation}<br \/>gamma = min_{i=1,dots,N} gamma_i<br \/>end{equation}$$<\/p>\n<p>We now turn our attention to the problem of finding the optimal hyperplane. Intuitively, we would like to find such values for (boldsymbol{w}) and (b) that the resulting hyperplane maximises the margin of separation between the positive and the negative samples. If we try to maximise (gamma) directly, we will likely end up with a hyperplane that is far from both the negative and positive samples, but does not separate the two. Therefore, we introduce the following constraints:<\/p>\n<p>$$begin{equation}label{eq:svm-constraints} begin{aligned} boldsymbol{w}^T boldsymbol{x} + b geq 1 text{, for } y_i = +1 \\ boldsymbol{w}^T boldsymbol{x} + b leq -1 text{, for } y_i = -1 end{aligned} end{equation}$$<\/p>\n<p>The specific data points that satisfy the above constraints with an equality sign are called <em>support vectors<\/em> \u2013 these are the data points that lie on the dotted red lines (Figure 2) and are the closest to the optimal hyperplane. Note, that if (2) holds, we can always rescale (boldsymbol{w}) and (b) in such a way that the constraints hold as well. In other words, rescaling (boldsymbol{w}) and (b) doesn\u2019t impact (mathcal{H}).<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/svm_margin_distance.png\" alt=\"\" class=\"wp-image-7578\" width=\"383\" height=\"318\"><figcaption>Figure 2 \u2013 Two linearly separable classes and an optimal separating hyperplane. The distance between an arbitrary point and the hyperplane is obtained using the normal projection of the point onto the hyperplane. There are also two support vectors present \u2014 one on the positive and one on the negative side of the hyperplane.<\/figcaption><\/figure>\n<\/div>\n<p>If we now focus on a support vector ({boldsymbol{x}_s, y_s}), then plugging the result from the constraints into equation (4) reveals that the optimal distance between the support vector and (mathcal{H}) is<\/p>\n<p>$$begin{equation}<br \/>gamma_s = y_s [(boldsymbol{w}^T boldsymbol{x}_s + b) \/ |boldsymbol{w}|] = (pm 1) [pm 1 \/ |boldsymbol{w}|] = 1 \/ |boldsymbol{w}|<br \/>end{equation}$$<\/p>\n<p>It then follows, that the optimal margin of separation between the two classes is<\/p>\n<p>$$begin{equation}<br \/>2 gamma_s = 2 \/  |boldsymbol{w}|<br \/>end{equation}$$<\/p>\n<p>Maximising (1 \/  |boldsymbol{w}|), however, is equivalent to minimising (|boldsymbol{w}|). Therefore, maximising the margin subject to the constraints is equivalent to<\/p>\n<p>$$begin{equation}<br \/>begin{aligned}<br \/>min_{boldsymbol{w}, b} quad &amp; |boldsymbol{w}|  ;;;text{(5)} \\<br \/>textrm{such that} quad &amp; y_i (boldsymbol{w}^T boldsymbol{x}_i + b) geq 1 text{, for } forall {boldsymbol{x}_i, y_i} in mathcal{D}<br \/>end{aligned}<br \/>end{equation}$$<\/p>\n<p>The question now is: how can we solve this optimisation problem?<\/p>\n<h2>Learning a Linear SVM with Quadratic Programming<\/h2>\n<p>Quadratic programming (QP) is a technique for optimising a quadratic objective function, subject to certain linear constraints. There is a large number of QP solvers available, for example GNU Octave\u2019s qp, MATLAB\u2019s Optimization Toolbox, Python\u2019s <a href=\"https:\/\/cvxopt.org\/\">CVXOPT<\/a> framework etc., and they are all available within the Domino Data Science Platform. For this tutorial we will use CVXOPT. In order to use convex optimisation, we first need to construct a Lagrangian function of the constrained-optimisation problem (5) We allocate Lagrange multipliers (boldsymbol{alpha}) to the constraints, and construct the following function<\/p>\n<p>$$begin{equation}<br \/>\nbegin{aligned}<br \/>\nJ(boldsymbol{w}, b, boldsymbol{alpha}) &amp;= (1\/2) |boldsymbol{w}|^2 \u2013 sum_{i=1}^N alpha_i [y_i (boldsymbol{w}^T boldsymbol{x}_i + b) \u2013 1]  \\<br \/>\n&amp;= (1\/2) boldsymbol{w}^Tboldsymbol{w} \u2013 sum_{i=1}^N alpha_i y_i (boldsymbol{w}^T boldsymbol{x}_i + b) + sum_{i=1}^N alpha_i ;;; (6)<br \/>\nend{aligned}<br \/>\nend{equation}$$<\/p>\n<p>We are now looking for a minmax point of (J(boldsymbol{w}, b, boldsymbol{alpha})), where the Lagrangian function is minimised with respect to (boldsymbol{w}) and (b), and is maximised with respect to (boldsymbol{alpha}). This is an example of a <em>Lagrangian dual problem<\/em>, and the standard approach is to begin by solving for the <em>primal variables<\/em> that minimise the objective ((boldsymbol{w}) and (b)). This solution provides (boldsymbol{w}) and (b) as functions of the Lagrange multipliers (<em>dual variables<\/em>). The next step is then to maximise the objective with respect to (boldsymbol{alpha}) under the constraints derived on the dual variables.<\/p>\n<p>Given that (J(boldsymbol{w}, b, boldsymbol{alpha})) is convex, the minimisation problem can be solved by differentiating (J(boldsymbol{w}, b, boldsymbol{alpha})) with respect to (boldsymbol{w}) and (b), and then setting the results to zero.<\/p>\n<p>$$begin{align} partial J(boldsymbol{w}, b, boldsymbol{alpha}) \/ partial boldsymbol{w}  = 0 text{, which yields } boldsymbol{w} = sum_{i=1}^N alpha_i y_i boldsymbol{x}_i  label{eq:svm-dual-constr1} \\ partial J(boldsymbol{w}, b, boldsymbol{alpha})\/ partial b = 0 text{, which yields } sum_{i=1}^N alpha_i y_i = 0 label{eq:svm-dual-constr2} ;;; text{(7)}end{align}$$<\/p>\n<p>Expanding (6) and plugging the solutions for w and b yields<\/p>\n<p>$$begin{align}<br \/>\nJ(boldsymbol{w}, b, boldsymbol{alpha}) &amp;= (1\/2) sum_{i=1}^N sum_{j=1}^N alpha_i alpha_j y_i y_j  boldsymbol{x}_i^T boldsymbol{x}_j   \u2013 sum_{i=1}^N sum_{j=1}^N alpha_i alpha_j y_i y_j  boldsymbol{x}_i^T boldsymbol{x}_j  + b  sum_{i=1}^N alpha_i y_i + sum_{i=1}^N alpha_i  \\<br \/>\n&amp;= -(1\/2) sum_{i=1}^N sum_{j=1}^N alpha_i alpha_j y_i y_j  boldsymbol{x}_i^T boldsymbol{x}_j  + b  sum_{i=1}^N alpha_i y_i + sum_{i=1}^N alpha_i ;;;text{(8)}<br \/>\nend{align}<br \/>\n$$<\/p>\n<p>The second term in (8) is zero because of (7), which gives us the final formulation of the problem.<\/p>\n<p>$$begin{equation}label{eq:svm-qp-min-final}<br \/>\nbegin{aligned}<br \/>\nmax_{boldsymbol{alpha}} quad &amp;  sum_{i=1}^N alpha_i -(1\/2) sum_{i=1}^N sum_{j=1}^N alpha_i alpha_j y_i y_j  boldsymbol{x}_i^T boldsymbol{x}_j ;;;text{(9)} \\<br \/>\ntextrm{such that} quad &amp; (1) ; sum_{i=1}^N alpha_i y_i = 0 \\<br \/>\n&amp; (2) ; alpha_i geq 0 ; forall i<br \/>\nend{aligned}<br \/>\nend{equation}$$<\/p>\n<p>Now let\u2019s see how we can apply this in practice, using the modified Iris dataset. Let\u2019s start by loading the needed Python libraries, loading and sampling the data, and plotting it for visual inspection.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport matplotlib\n\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport numpy as np\n\nfrom cvxopt import matrix, solvers\nfrom sklearn.datasets import load_iris\n\niris = load_iris()\niris_df = pd.DataFrame(data= np.c_[iris[\"data\"], iris[\"target\"]], columns= iris[\"feature_names\"] + [\"target\"])\n\n# Retain only 2 linearly separable classes\niris_df = iris_df[iris_df[\"target\"].isin([0,1])]\niris_df[\"target\"] = iris_df[[\"target\"]].replace(0,-1)\n\n# Select only 2 attributes\niris_df = iris_df[[\"petal length (cm)\", \"petal width (cm)\", \"target\"]]\n\niris_df.head()\n<\/pre>\n<\/div>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"306\" height=\"173\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/svm_iris_df_head.png\" alt=\"output of iris_df.head() showing the first 5 entries in the data frame.\" class=\"wp-image-7580\"><\/figure>\n<p>Let\u2019s convert the data to NumPy arrays, and plot the two classes.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nX = iris_df[[\"petal length (cm)\", \"petal width (cm)\"]].to_numpy()\ny = iris_df[[\"target\"]].to_numpy()\n\nplt.figure(figsize=(8, 8))\ncolors = [\"steelblue\", \"orange\"]\nplt.scatter(X[:, 0], X[:, 1], c=y.ravel(), alpha=0.5, cmap=matplotlib.colors.ListedColormap(colors), edgecolors=\"black\")\nplt.xlim(x_min, x_max)\nplt.ylim(y_min, y_max)\nplt.show()\n\n<\/pre>\n<\/div>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"512\" height=\"481\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/svm_iris_classes_plot.png\" alt=\"Scatter plot of the two Iris classes\" class=\"wp-image-7581\"><\/figure>\n<p>Recall, that for finding the optimal hyperplane we use the dual Lagrangian formulation given in (9). CVXOPT, however, expects that the problem is expressed in the following form<\/p>\n<p>$$<br \/>\nbegin{equation}<br \/>\nbegin{aligned}<br \/>\nmin quad &amp; (1\/2) boldsymbol{x}^T P boldsymbol{x} + boldsymbol{q}^Tboldsymbol{x}\\<br \/>\ntextrm{subject to} quad &amp;  A boldsymbol{x} = b\\<br \/>\nquad &amp; G boldsymbol{x} preccurlyeq boldsymbol{h}<br \/>\nend{aligned}<br \/>\nend{equation}<br \/>\n$$<\/p>\n<p>and we need to rewrite (9) to match the above format. Let\u2019s define a matrix (mathcal{H}) such that (H_{ij} = y_i y_j x_i cdot xj). We can then rewrite our original problem in vector form. We also multiply both the objective and the constraints by -1, which turns it into a minimisation task.<\/p>\n<p>$$<br \/>\nbegin{aligned}<br \/>\nmin_{alpha} quad &amp;  (1\/2) boldsymbol{alpha}^T H boldsymbol{alpha} -1^T boldsymbol{alpha} \\<br \/>\ntextrm{such that} quad &amp; y^T boldsymbol{alpha} = 0 \\<br \/>\nquad &amp; \u2013 alpha_i leq 0 ; forall i<br \/>\nend{aligned}<br \/>\n$$<\/p>\n<p>From there, we take the number of training samples (n) and construct the matrix (H).<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nn = X.shape[0]\nH = np.dot(y*X, (y*X).T)\n\n<\/pre>\n<\/div>\n<p>This takes care of the first term of the objective. For the second term we simply need a column vector of -1\u2019s.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nq = np.repeat([-1.0], n)[..., None]\n\n<\/pre>\n<\/div>\n<p>For the first constraint we define (A) as a (1 times n) matrix that contains the labels (boldsymbol{y}), and we set (b) to 0.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nA = y.reshape(1, -1)\nb = 0.0\n\n<\/pre>\n<\/div>\n<p>For the second constraint we construct an (n times n) matrix (G) with negative ones on the diagonal and zeros elsewhere, and a zero vector (h).<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nG = np.negative(np.eye(n))\nh = np.zeros(n)\n\n<\/pre>\n<\/div>\n<p>Finally, we convert everything to CVXOPT objects<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nP = matrix(H)\nq = matrix(q)\nG = matrix(G)\nh = matrix(h)\nA = matrix(A)\nb = matrix(b)\n\n<\/pre>\n<\/div>\n<p>and call the QP solver.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nsol = solvers.qp(P, q, G, h, A, b)\nalphas = np.array(sol[\"x\"])\n\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n     pcost       dcost       gap    pres   dres\n 0: -5.8693e+00 -1.1241e+01  3e+02  1e+01  2e+00\n 1: -5.9277e+00 -3.6988e+00  4e+01  2e+00  3e-01\n 2: -1.0647e+00 -1.9434e+00  5e+00  2e-01  2e-02\n 3: -6.5979e-01 -1.1956e+00  6e-01  6e-03  8e-04\n 4: -8.3813e-01 -1.2988e+00  5e-01  3e-03  4e-04\n 5: -1.1588e+00 -1.1784e+00  2e-02  8e-05  1e-05\n 6: -1.1763e+00 -1.1765e+00  2e-04  8e-07  1e-07\n 7: -1.1765e+00 -1.1765e+00  2e-06  8e-09  1e-09\n 8: -1.1765e+00 -1.1765e+00  2e-08  8e-11  1e-11\nOptimal solution found.\n\n<\/pre>\n<\/div>\n<p>We can now get (boldsymbol{w}) using<\/p>\n<p>$$begin{equation}<br \/>\nboldsymbol{w} =  sum alpha_i y_i boldsymbol{x}_i<br \/>\nend{equation}<br \/>\n$$<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nw = np.dot((y * alphas).T, X)[0]\n\n<\/pre>\n<\/div>\n<p>Next, we identify the support vectors, and calculate the bias using<\/p>\n<p>$$<br \/>\nbegin{equation}<br \/>\nb = (1\/|S|) sum_{s in S} ( y_s \u2013 sum_{i in S} alpha_i y_i boldsymbol{x}_i^T boldsymbol{x}_s)<br \/>\nend{equation}<br \/>\n$$<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nS = (alphas &gt; 1e-5).flatten()\nb = np.mean(y[S] - np.dot(X[S], w.reshape(-1,1)))\n\n<\/pre>\n<\/div>\n<p>Let\u2019s see the optimal (boldsymbol{w}) and (b) values.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nprint(\"W:\", w)\nprint(\"b:\", b)\n\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nW: [1.29411744 0.82352928]\nb: [-3.78823471]\n\n<\/pre>\n<\/div>\n<p>Next, we plot the separating hyperplane and the support vectors.<br \/>\nThis code is based on the <a href=\"https:\/\/scikit-learn.org\/stable\/auto_examples\/svm\/plot_svm_margin.html\">SVM Margins Example<\/a> from the scikit-learn documentation.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nx_min = 0\nx_max = 5.5\ny_min = 0\ny_max = 2\n\nxx = np.linspace(x_min, x_max)\na = -w[0]\/w[1]\nyy = a*xx - (b)\/w[1]\n\nmargin = 1 \/ np.sqrt(np.sum(w**2))\nyy_neg = yy - np.sqrt(1 + a**2) * margin\nyy_pos = yy + np.sqrt(1 + a**2) * margin\n\nplt.figure(figsize=(8, 8))\n\nplt.plot(xx, yy, \"b-\")\nplt.plot(xx, yy_neg, \"m--\")\nplt.plot(xx, yy_pos, \"m--\")\n\ncolors = [\"steelblue\", \"orange\"]\n\nplt.scatter(X[:, 0], X[:, 1], c=y.ravel(), alpha=0.5, cmap=matplotlib.colors.ListedColormap(colors), edgecolors=\"black\")\n\nplt.xlim(x_min, x_max)\nplt.ylim(y_min, y_max)\n\nplt.show()\n\n<\/pre>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" width=\"512\" height=\"481\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/06\/svm_iris_fitted.png\" alt=\"\" class=\"wp-image-7582\"><\/figure>\n<\/div>\n<p>The plot is very satisfying, as the solution perfectly identified the support vectors that maximise the margin of separation, and the separating hyperplane is correctly placed between the two. Finally, we can also verify the correctness of our solution by fitting an SVM using the scikit-learn SVM implementation.\n<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom sklearn import svm\n\n# Use the linear kernel and set C to a large value to ensure hard margin fitting\nclf = svm.SVC(kernel=\"linear\", C=10.0)\nclf.fit(X, y.ravel())\nw = clf.coef_[0]\nb = clf.intercept_\n\nprint(\"W:\", w)\nprint(\"b:\", b)\n\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nW: [1.29411744 0.82352928]\nb: [-3.78823471]\n\n<\/pre>\n<\/div>\n<p>We get identical values for (boldsymbol{w}) and (b) (within the expected margin of error due to implementation specifics), which confirms that our solution is correct.<\/p>\n<h2>Summary<\/h2>\n<p>In this article we went over the mathematics of the Support Vector Machine and its associated learning algorithm. We did a \u201cfrom scratch\u201d implementation in Python using CVXOPT, and we showed that it yields identical solutions to the sklearn.svm.SVC implementation.<\/p>\n<p>Understanding the internals of SVMs arms us with extra knowledge on the suitability of the algorithm for various tasks, and enables us to write custom implementations for more complex use-cases.<\/p>\n<h2>References<\/h2>\n<ul>\n<li>Barghout, L. (2015). Spatial-taxon information granules as used in iterative fuzzy-decision-making for image segmentation. In W. Pedrycz &amp; S.-M. Chen (Eds.), Granular computing and decision-making: Interactive and iterative approaches (pp. 285\u2013318). Springer International Publishing. https:\/\/doi.org\/10.1007\/978-3-319-16829-6_12<\/li>\n<li>Decoste, D., &amp; Sch\u00f6lkopf, B. (2002). Training invariant support vector machines. Machine learning, 46(1-3), 161\u2013190.<\/li>\n<li>Duda, R. O., &amp; Hart, P. E. (1973). Pattern classification and scene analysis. John Willey &amp; Sons. <\/li>\n<li>Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics,<br \/>7(2), 179\u2013188. https:\/\/doi.org\/https:\/\/doi.org\/10.1111\/j.1469-1809.1936.tb02137.x <\/li>\n<li>Haykin, S. S. (2009). Neural networks and learning machines (Third). Pearson Education. <\/li>\n<li>Joachims, T. (1998). Text categorization with support vector machines: Learning with many<br \/>relevant features. In C. N\u00e9dellec &amp; C. Rouveirol (Eds.), Machine learning: Ecml-98 (pp. 137\u2013142). Springer Berlin Heidelberg.<\/li>\n<li>Kuhn, H. W., &amp; Tucker, A. W. (1951). Nonlinear programming. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 1950, 481\u2013492.<\/li>\n<li>Sch\u00f6lkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., &amp; Platt, J. (1999). Support vector method for novelty detection. Proceedings of the 12th International Conference on Neural Information Processing Systems, 582\u2013588.<\/li>\n<\/ul>\n<p><!-- relpost-thumb-wrapper --><!-- close relpost-thumb-wrapper -->    <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blog.dominodatalab.com\/fitting-support-vector-machines-quadratic-programming\/<\/p>\n","protected":false},"author":0,"featured_media":8306,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8305"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8305"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8305\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8306"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8305"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8305"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8305"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}