🖍️
gitbook_docs
  • Introduction
  • Machine Learning
    • Recommended Courses
      • For Undergrad Research
      • Math for Machine Learning
    • ML Notes
      • Covariance Correlation
      • Feature Selection
      • Linear Regression
      • Entropy, Cross-Entropy, KL Divergence
      • Bayesian Classifier
        • Terminology Review
        • Bayesian Classifier for Normally Distributed classes
      • Linear Discriminant Analysis
      • Logistic Regression
        • Logistic Regression Math
      • Logistic Regression-MaximumLikelihood
      • SVM
        • SVM concept
        • SVM math
      • Cross Validation
      • Parameter, Density Estimation
        • MAP, MLE
        • Gaussian Mixture Model
      • E-M
      • Density Estimation(non-parametric)
      • Unsupervised Learning
      • Clustering
      • kNN
      • WaveletTransform
      • Decision Tree
    • Probability and Statistics for Machine Learning
      • Introduction
      • Basics of Data Analysis
      • Probability for Discrete Random Variable
      • Poisson Distribution
      • Chi-Square Distribution
      • P-value and Statistical Hypothesis
      • Power and Sample Size
      • Hypothesis Test Old
      • Hypothesis Test
      • Multi Armed Bandit
      • Bayesian Inference
      • Bayesian Updating with Continuous Priors
      • Discrete Distribution
      • Comparison of Bayesian and frequentist inference
      • Confidence Intervals for Normal Data
      • Frequenist Methods
      • Null Hypothesis Significance Testing
      • Confidence Intervals: Three Views
      • Confidence Intervals for the Mean of Non-normal Data
      • Probabilistic Prediction
  • Industrial AI
    • PHM Dataset
    • BearingFault_Journal
      • Support Vector Machine based
      • Autoregressive(AR) model based
      • Envelope Extraction based
      • Wavelet Decomposition based
      • Prediction of RUL with Deep Convolution Nueral Network
      • Prediction of RUL with Information Entropy
      • Feature Model and Feature Selection
    • TempCore Journal
      • Machine learning of mechanical properties of steels
      • Online prediction of mechanical properties of hot rolled steel plate using machine learning
      • Prediction and Analysis of Tensile Properties of Austenitic Stainless Steel Using Artificial Neural
      • Tempcore, new process for the production of high quality reinforcing
      • TEMPCORE, the most convenient process to produce low cost high strength rebars from 8 to 75 mm
      • Experimental investigation and simulation of structure and tensile properties of Tempcore treated re
    • Notes
  • LiDAR
    • Processing of Point Cloud
    • Intro. 3D Object Detection
    • PointNet
    • PointNet++
    • Frustrum-PointNet
    • VoxelNet
    • Point RCNN
    • PointPillars
    • LaserNet
  • Simulator
    • Simulator List
    • CARLA
    • Airsim
      • Setup
      • Tutorial
        • T#1
        • T#2
        • T#3: Opencv CPP
        • T#4: Opencv Py
        • Untitled
        • T#5: End2End Driving
  • Resources
    • Useful Resources
    • Github
    • Jekyll
  • Reinforcement Learning
    • RL Overview
      • RL Bootcamp
      • MIT Deep RL
    • Textbook
    • Basics
    • Continuous Space RL
  • Unsupervised Learning
    • Introduction
  • Unclassified
    • Ethics
    • Conference Guideline
  • FPGA
    • Untitled
  • Numerical Method
    • NM API reference
Powered by GitBook
On this page
  • Reference
  • Concept of SVM Classification
  • How does SVM find the optimal model w* and b*?
  • Questions to considier
  • Dealing with Noise/Outlier (Soft margin)
  • Dealing with Inherent Non-Linearity (Kernel Function)
  • Lagrange Multiplier Method in SVM
  • Lagrange Multiplier Method
  • Stationary Point
  • Lagrange Multipliers for SVM problem
  • Dual SVM formulation (Derivation )
  • Concept
  • Derivation
  • How to solve the Optimal parameters of Dual SVM formula?
  • Non-Linear SVM
  • Reference
  • [집콕]딥러닝 시대에도 필요한 고급기계학습 - Support VectorMachine

Was this helpful?

  1. Machine Learning
  2. ML Notes
  3. SVM

SVM math

PreviousSVM conceptNextCross Validation

Last updated 3 years ago

Was this helpful?

Reference

Concept of SVM Classification

In machine learning, the boundary separating the examples of different classes is called the decision boundary.

In SVM, a hyperplane is used to make the boundary to classify the feature X as the label Y=+1 or Y=-1. The hyperplane is expressed with two parameters w, b

wx−b=0{\bf{wx}}-b = 0wx−b=0

where the expression wx means w(1)x(1) + w(2)x(2) + . . . + w(D)x(D), and D is the number of dimensions of the feature vector x.

The feature input X can be classified as Y=+1 or Y=-1, by the condition of

y=sign(wx−b)y = sign({\bf wx} − b)y=sign(wx−b)

How does SVM find the optimal model w* and b*?

From solving an optimization problem.

For SVM, solves for Lagrange Multiplier with KKT conditions.

For SVM, it is solving optimization function with constraints as

We would also prefer that the hyperplane separates positive examples from negative ones with the largest margin.

Thus, the concept of SVM optimization problem becomes:

Questions to considier

  1. What if there are outliers and no hyperplane can perfectly separate positive examples from negative ones.

    --> Use Softmargin

  2. What if data cannot be separated by the given dimension hyperplane ? example: a plane for 3D?

    --> Use Kernel Function

Dealing with Noise/Outlier (Soft margin)

To extend SVM to cases in which the data is not linearly separable, we introduce the hinge loss function

Hinge Loss Function

Soft-margin SVM

Thus, the minimization of the following cost function becomes

where C is the hyperparamter.

  • Sufficiently high values of C, the second term in the cost function will become negligible, so the SVM algorithm will try to find the highest margin by completely ignoring misclassification (generalization)

  • As we decrease the value of C, making classification errors is becoming more costly, so the SVM algorithm tries to make fewer mistakes by sacrificing the margin size. (Minimizing empirical risk)

SVMs that optimize hinge loss are called soft-margin SVMs, while the original formulation is referred to as a hard-margin SVM.

The method traditionally used to solve the optimization problem is the method of Lagrange multipliers.

Dealing with Inherent Non-Linearity (Kernel Function)

SVM can be adapted to work with datasets that cannot be separated by a hyperplane in its original space

it’s possible to transform a two-dimensional non-linearly-separable data into a linearly-separable three dimensional data using a specific mapping\

When the 2D data is hard to classify linearly by a line

However, we don’t know a priori which mapping φ would work for our data. We do not want to try all possible mapping function.

Instead, we can use Kernel function to efficiently work in higher-dimensional spaces without doing this transformation explicitly.

In SVM, instead of directly solving for Eq (1), it is convenient to solve an equivalent problem of _Lagrange Multiplier._

> See Method of Langrage Multiplier

First, we have to see how the optimization algorithm for SVM finds the optimal values for w and b.

This optimization problem becomes a convex quadratic optimization problem, efficiently solvable by quadratic programming algorithms.

  • Option 2) Kernel Trick

By using the kernel trick, we can get rid of a costly transformation of original feature vectors into higher-dimensional vectors and avoid computing their dot-product. We replace that by a simple operation on the original feature vectors that gives the same result.

Example: Using quadratic kernel for Kernel trick

Example: RBF Kernel

It can be shown that the feature space of the RBF (for “radial basis function”) kernel has an infinite number of dimensions. By varying the hyperparameter σ, the data analyst can choose between getting a smooth or curvy decision boundary in the original space.

Lagrange Multiplier Method in SVM

Lagrange Multiplier Method

Aim: Minimize a function f(x) under constraints in the form g_i(x) >= 0

The optimal solution can be found by optimizing Lagrangian of f(x) by

Stationary Point

Note, the function can be either maximum or minimum at stationary point

Lagrange Multipliers for SVM problem

Thus, for inequalities, the Karush-Kuhn-Tucker (KKT) conditions have to be satisfied for all i :

SVM problem was

Minimizing ||w|| is similar to minimizing 1/2 ||w||^2

This can be written in Lagrangian as f(x)--> 1/2 ||w||^2

In SVM, function

  • constraint g_i are linear

For such case, the value of the Lagrangian in the stationary point is exactly the same as the minimum of f.

@Stationary Point, L = f , where f is at its minimum.

Dual SVM formulation (Derivation )

Concept

SVM optimization problem (Primal) of

is solved by using Lagrange Multiplier method by expressing as

But usally, SVM is re-written as Dual SVM formula, to be independent to w to use only support vector x., so we can **** apply Kernel functions (Kernel trick).

  • It is a quadratic programming (convex) problem that has global solution.

  • Conditions now become (1) one equal condition and (2) N inequality conditions

Support vector in dual form

The data points xj corresponding to nonzero αj are the support vectors.

To make predictions, only support vectors x are used, ( b is calculated from x) so the predictions in SVM are very fast:

Derivation

How to solve the Optimal parameters of Dual SVM formula?

  • SMO: sequential minimal optimization

  • Plane Cutting:

  • Coordinate Descent:

Non-Linear SVM

Reference

[집콕]딥러닝 시대에도 필요한 고급기계학습 - Support VectorMachine

Since the margin is inverse of ∣∣w∣∣\bf ||w||∣∣w∣∣, we need to minimize norm of w.

Minimize ∣∣w∣∣\bf ||w||∣∣w∣∣ with constraint subject to yi(wxi−b)≥1y_i({\bf wx_i} − b) \geq 1yi​(wxi​−b)≥1 for i = 1, . . . , N.

The hinge loss function is zero if the constraints yi(wxi−b)≥1y_i({\bf wx_i} − b) \geq 1yi​(wxi​−b)≥1 for i = 1, . . . , N. are satisfied.

For data on wrong side of decision boundary, the loss function value is proportional to the distance from the decision boundary: 1+(wxi−b)>01+ ({\bf wx_i} − b) >01+(wxi​−b)>0

If we need to transform x into higher dimension, Φ(x)\Phi(x)Φ(x)

Option 1) Do the transformation then do dot-product of __ (Φ(xi)⋅Φ(xk))(\Phi(x_i) \cdot \Phi(x_k))(Φ(xi​)⋅Φ(xk​)) //Costly process

The minimum function f(x) with the constraints is finding the Stationary Point of Lagrangian in which we maximize with respect to all αi\alpha_iαi​ and minimize with respect to x

The point in which the gradient with respect to all αi\alpha_iαi​ and X is zero.

The condition for SVM gi(x)>=0g_i(x) >= 0gi​(x)>=0 is inequalities: yi(wxi−b)≥1y_i({\bf wx_i} − b) \geq 1yi​(wxi​−b)≥1

Minimize ∣∣w∣∣\bf ||w||∣∣w∣∣ with constraint o yi(wxi−b)≥1y_i({\bf wx_i} − b) \geq 1yi​(wxi​−b)≥1 for i = 1, . . . , N.

f = 0.5∗∣∣w∣∣20.5* \bf ||w|| ^20.5∗∣∣w∣∣2 is a convex function, that has one minimum.

It means, at the stationary point, constraint αigi=0\alpha_i g_i=0αi​gi​=0

Then, for such conditions, if either yi(wxi−b)=1y_i({\bf wx_i} − b) = 1yi​(wxi​−b)=1 or αi=0\alpha_i =0αi​=0 all conditions are satisfied

So it becomes the optimizing L w.r.t αi\alpha_iαi​ only. Thus, we want to solve _L to get maximum_ αi\alpha_iαi​parameters to stationary points

It now becomes finding stationary point of L(α)L(\alpha)L(α) with only parameter of α\alphaα (instead of w, b)

We can use kernel trick on xi⋅xkx_i \cdot x_kxi​⋅xk​

This is solving for solving N number of αi\alpha_iαi​ There are several algorithms that can be applied such as

Just use the Kernel function on xi⋅xkx_i \cdot x_kxi​⋅xk​

The Hundred-Page Machine Learning Book
video | K-MOOC
<Eq 1>
Method of Langrage Multiplier