SVM math

Reference

The Hundred-Page Machine Learning Book

Concept of SVM Classification

In machine learning, the boundary separating the examples of different classes is called the decision boundary.

In SVM, a hyperplane is used to make the boundary to classify the feature X as the label Y=+1 or Y=-1. The hyperplane is expressed with two parameters w, b

{\bf{wx}}-b = 0

where the expression wx means w(1)x(1) + w(2)x(2) + . . . + w(D)x(D), and D is the number of dimensions of the feature vector x.

The feature input X can be classified as Y=+1 or Y=-1, by the condition of

y = sign({\bf wx} − b)

How does SVM find the optimal model w* and b*?

From solving an optimization problem.

For SVM, solves for Lagrange Multiplier with KKT conditions.

For SVM, it is solving optimization function with constraints as

We would also prefer that the hyperplane separates positive examples from negative ones with the largest margin.

Since the margin is inverse of $\bf ||w||$ , we need to minimize norm of w.

Thus, the concept of SVM optimization problem becomes:

Minimize $\bf ||w||$ with constraint subject to $y_i({\bf wx_i} − b) \geq 1$ for i = 1, . . . , N.

Questions to considier

What if there are outliers and no hyperplane can perfectly separate positive examples from negative ones.
--> Use Softmargin
What if data cannot be separated by the given dimension hyperplane ? example: a plane for 3D?
--> Use Kernel Function

Dealing with Noise/Outlier (Soft margin)

To extend SVM to cases in which the data is not linearly separable, we introduce the hinge loss function

Hinge Loss Function

The hinge loss function is zero if the constraints $y_i({\bf wx_i} − b) \geq 1$ for i = 1, . . . , N. are satisfied.

For data on wrong side of decision boundary, the loss function value is proportional to the distance from the decision boundary: $1+ ({\bf wx_i} − b) >0$

Soft-margin SVM

Thus, the minimization of the following cost function becomes

where C is the hyperparamter.

Sufficiently high values of C, the second term in the cost function will become negligible, so the SVM algorithm will try to find the highest margin by completely ignoring misclassification (generalization)
As we decrease the value of C, making classification errors is becoming more costly, so the SVM algorithm tries to make fewer mistakes by sacrificing the margin size. (Minimizing empirical risk)

SVMs that optimize hinge loss are called soft-margin SVMs, while the original formulation is referred to as a hard-margin SVM.

The method traditionally used to solve the optimization problem is the method of Lagrange multipliers.

Dealing with Inherent Non-Linearity (Kernel Function)

SVM can be adapted to work with datasets that cannot be separated by a hyperplane in its original space

it’s possible to transform a two-dimensional non-linearly-separable data into a linearly-separable three dimensional data using a specific mapping\

When the 2D data is hard to classify linearly by a line

However, we don’t know a priori which mapping φ would work for our data. We do not want to try all possible mapping function.

Instead, we can use Kernel function to efficiently work in higher-dimensional spaces without doing this transformation explicitly.

In SVM, instead of directly solving for Eq (1), it is convenient to solve an equivalent problem of _Lagrange Multiplier._

> See Method of Langrage Multiplier

First, we have to see how the optimization algorithm for SVM finds the optimal values for w and b.

This optimization problem becomes a convex quadratic optimization problem, efficiently solvable by quadratic programming algorithms.

If we need to transform x into higher dimension, $\Phi(x)$

Option 1) Do the transformation then do dot-product of __ $(\Phi(x_i) \cdot \Phi(x_k))$ //Costly process
Option 2) Kernel Trick

By using the kernel trick, we can get rid of a costly transformation of original feature vectors into higher-dimensional vectors and avoid computing their dot-product. We replace that by a simple operation on the original feature vectors that gives the same result.

Example: Using quadratic kernel for Kernel trick

Example: RBF Kernel

It can be shown that the feature space of the RBF (for “radial basis function”) kernel has an infinite number of dimensions. By varying the hyperparameter σ, the data analyst can choose between getting a smooth or curvy decision boundary in the original space.

Lagrange Multiplier Method in SVM

Lagrange Multiplier Method

Aim: Minimize a function f(x) under constraints in the form g_i(x) >= 0

The optimal solution can be found by optimizing Lagrangian of f(x) by

The minimum function f(x) with the constraints is finding the Stationary Point of Lagrangian in which we maximize with respect to all $\alpha_i$ and minimize with respect to x

Stationary Point

The point in which the gradient with respect to all $\alpha_i$ and X is zero.

Note, the function can be either maximum or minimum at stationary point

Lagrange Multipliers for SVM problem

The condition for SVM $g_i(x) >= 0$ is inequalities: $y_i({\bf wx_i} − b) \geq 1$

Thus, for inequalities, the Karush-Kuhn-Tucker (KKT) conditions have to be satisfied for all i :

SVM problem was

Minimize $\bf ||w||$ with constraint o $y_i({\bf wx_i} − b) \geq 1$ for i = 1, . . . , N.

Minimizing ||w|| is similar to minimizing 1/2 ||w||^2

This can be written in Lagrangian as f(x)--> 1/2 ||w||^2

In SVM, function

f = $0.5* \bf ||w|| ^2$ is a convex function, that has one minimum.
constraint g_i are linear

For such case, the value of the Lagrangian in the stationary point is exactly the same as the minimum of f.

@Stationary Point, L = f , where f is at its minimum.

It means, at the stationary point, constraint $\alpha_i g_i=0$

Then, for such conditions, if either $y_i({\bf wx_i} − b) = 1$ or $\alpha_i =0$ all conditions are satisfied

Dual SVM formulation (Derivation )

Concept

SVM optimization problem (Primal) of

is solved by using Lagrange Multiplier method by expressing as

But usally, SVM is re-written as Dual SVM formula, to be independent to w to use only support vector x., so we can **** apply Kernel functions (Kernel trick).

So it becomes the optimizing L w.r.t $\alpha_i$ only. Thus, we want to solve _L to get maximum_ $\alpha_i$ parameters to stationary points

It now becomes finding stationary point of $L(\alpha)$ with only parameter of $\alpha$ (instead of w, b)
It is a quadratic programming (convex) problem that has global solution.
Conditions now become (1) one equal condition and (2) N inequality conditions
We can use kernel trick on $x_i \cdot x_k$

Support vector in dual form

The data points xj corresponding to nonzero αj are the support vectors.

To make predictions, only support vectors x are used, ( b is calculated from x) so the predictions in SVM are very fast:

Derivation

How to solve the Optimal parameters of Dual SVM formula?

This is solving for solving N number of $\alpha_i$ There are several algorithms that can be applied such as

SMO: sequential minimal optimization
Plane Cutting:
Coordinate Descent:

Non-Linear SVM

Just use the Kernel function on $x_i \cdot x_k$

Reference

[집콕]딥러닝 시대에도 필요한 고급기계학습 - Support VectorMachine

video | K-MOOC

PreviousSVM concept NextCross Validation

Last updated 3 years ago

Was this helpful?

Reference

Concept of SVM Classification

How does SVM find the optimal model w* and b*?

Minimize ∣∣w∣∣\bf ||w||∣∣w∣∣ with constraint subject to yi(wxi−b)≥1y_i({\bf wx_i} − b) \geq 1yi​(wxi​−b)≥1 for i = 1, . . . , N.

Questions to considier

Dealing with Noise/Outlier (Soft margin)

Hinge Loss Function

Soft-margin SVM

Dealing with Inherent Non-Linearity (Kernel Function)

> See Method of Langrage Multiplier

Lagrange Multiplier Method in SVM

Lagrange Multiplier Method

Stationary Point

Lagrange Multipliers for SVM problem

SVM problem was

Dual SVM formulation (Derivation )

Concept

Support vector in dual form

Derivation

How to solve the Optimal parameters of Dual SVM formula?

Non-Linear SVM

Reference

[집콕]딥러닝 시대에도 필요한 고급기계학습 - Support VectorMachine

Minimize $\bf ||w||$ with constraint subject to $y_i({\bf wx_i} − b) \geq 1$ for i = 1, . . . , N.