SVM math
Last updated
Last updated
The Hundred-Page Machine Learning Book
In machine learning, the boundary separating the examples of different classes is called the decision boundary.
In SVM, a hyperplane is used to make the boundary to classify the feature X as the label Y=+1 or Y=-1. The hyperplane is expressed with two parameters w, b
where the expression wx means w(1)x(1) + w(2)x(2) + . . . + w(D)x(D), and D is the number of dimensions of the feature vector x.
The feature input X can be classified as Y=+1 or Y=-1, by the condition of
From solving an optimization problem.
For SVM, solves for Lagrange Multiplier with KKT conditions.
For SVM, it is solving optimization function with constraints as
We would also prefer that the hyperplane separates positive examples from negative ones with the largest margin.
Since the margin is inverse of , we need to minimize norm of w.
Thus, the concept of SVM optimization problem becomes:
What if there are outliers and no hyperplane can perfectly separate positive examples from negative ones.
--> Use Softmargin
What if data cannot be separated by the given dimension hyperplane ? example: a plane for 3D?
--> Use Kernel Function
To extend SVM to cases in which the data is not linearly separable, we introduce the hinge loss function
The hinge loss function is zero if the constraints for i = 1, . . . , N. are satisfied.
For data on wrong side of decision boundary, the loss function value is proportional to the distance from the decision boundary:
Thus, the minimization of the following cost function becomes
where C is the hyperparamter.
Sufficiently high values of C, the second term in the cost function will become negligible, so the SVM algorithm will try to find the highest margin by completely ignoring misclassification (generalization)
As we decrease the value of C, making classification errors is becoming more costly, so the SVM algorithm tries to make fewer mistakes by sacrificing the margin size. (Minimizing empirical risk)
SVMs that optimize hinge loss are called soft-margin SVMs, while the original formulation is referred to as a hard-margin SVM.
The method traditionally used to solve the optimization problem is the method of Lagrange multipliers.
SVM can be adapted to work with datasets that cannot be separated by a hyperplane in its original space
it’s possible to transform a two-dimensional non-linearly-separable data into a linearly-separable three dimensional data using a specific mapping\
When the 2D data is hard to classify linearly by a line
However, we don’t know a priori which mapping φ would work for our data. We do not want to try all possible mapping function.
Instead, we can use Kernel function to efficiently work in higher-dimensional spaces without doing this transformation explicitly.
In SVM, instead of directly solving for Eq (1), it is convenient to solve an equivalent problem of _Lagrange Multiplier._
First, we have to see how the optimization algorithm for SVM finds the optimal values for w and b.
This optimization problem becomes a convex quadratic optimization problem, efficiently solvable by quadratic programming algorithms.
If we need to transform x into higher dimension,
Option 1) Do the transformation then do dot-product of __ //Costly process
Option 2) Kernel Trick
By using the kernel trick, we can get rid of a costly transformation of original feature vectors into higher-dimensional vectors and avoid computing their dot-product. We replace that by a simple operation on the original feature vectors that gives the same result.
Example: Using quadratic kernel for Kernel trick
Example: RBF Kernel
It can be shown that the feature space of the RBF (for “radial basis function”) kernel has an infinite number of dimensions. By varying the hyperparameter σ, the data analyst can choose between getting a smooth or curvy decision boundary in the original space.
Aim: Minimize a function f(x) under constraints in the form g_i(x) >= 0
The optimal solution can be found by optimizing Lagrangian of f(x) by
The minimum function f(x) with the constraints is finding the Stationary Point of Lagrangian in which we maximize with respect to all and minimize with respect to x
The point in which the gradient with respect to all and X is zero.
Note, the function can be either maximum or minimum at stationary point
The condition for SVM is inequalities:
Thus, for inequalities, the Karush-Kuhn-Tucker (KKT) conditions have to be satisfied for all i :
Minimize with constraint o for i = 1, . . . , N.
Minimizing ||w|| is similar to minimizing 1/2 ||w||^2
This can be written in Lagrangian as f(x)--> 1/2 ||w||^2
In SVM, function
f = is a convex function, that has one minimum.
constraint g_i are linear
For such case, the value of the Lagrangian in the stationary point is exactly the same as the minimum of f.
@Stationary Point, L = f , where f is at its minimum.
It means, at the stationary point, constraint
Then, for such conditions, if either or all conditions are satisfied
SVM optimization problem (Primal) of
is solved by using Lagrange Multiplier method by expressing as
But usally, SVM is re-written as Dual SVM formula, to be independent to w to use only support vector x., so we can **** apply Kernel functions (Kernel trick).
So it becomes the optimizing L w.r.t only. Thus, we want to solve _L to get maximum_ parameters to stationary points
It now becomes finding stationary point of with only parameter of (instead of w, b)
It is a quadratic programming (convex) problem that has global solution.
Conditions now become (1) one equal condition and (2) N inequality conditions
We can use kernel trick on
The data points xj corresponding to nonzero αj are the support vectors.
To make predictions, only support vectors x are used, ( b is calculated from x) so the predictions in SVM are very fast:
This is solving for solving N number of There are several algorithms that can be applied such as
SMO: sequential minimal optimization
Plane Cutting:
Coordinate Descent:
Just use the Kernel function on