SVM concept

Reference

Youtube Lecture

Margin

For classification, the shortest distance between the observation and the threshold

Example 1: Without outlier, 1D data

Maximum Marginal Classifier

It is using the threshold to give the largest margin to make classification.

If the margin is halfway between the two end-point observations of red, green, then the margin is the larget.

If the threshold is moved to left or right, the margin becomes smaller.

Example 2: With outliers, 1D data

If there is an outlier and if misclassifcation of outlier is NOT allowed. then, the Maximum Margin Classifier would NOT classify new observation properly.

Maximum Margin Classifier is very sensitive to outliers in the training data

Solution: Allow misclassification

If the threshold is halfway, it allows misclassification. But it can classify new observation more sensibly.

New observation will be classify as obese, and makes sense for it is more closer to most obese observations.

Soft Margin

Distance between the observations and threshold, with misclassification.

How to choose a better Soft margin? use cross validation that counts number of misclassification/correct classification to allow in the Soft margin.

Soft Margin Classifier (Support Vector Classifier)

Using soft margins to find the best threshold to classify.

Support Vectors: Observations on the edge and within the soft margin

1D, 2D, 3D Support Vector Classifier

The data within the soft margin is misclassified in this example. Use Cross Validation to determine that allowing this misclassification results in better classification

For 3D, a plane is used for Support Vector Classifier. For 4D or higher, SVC is a hyperplane.

Support Vector Machine

SVC may not work for certain data without changing the dimensions. Example:

Support Vector Machine can solve this limitation as shown in the following example.

Lets transform the data 1D to 2D ( y=Dosage^2 )

Then, it can classify new observation as

Support Vector Machine Concept

(1) With Low Dimension Data, it may not divide nicely. Then, (2) Move the data into a higher dimension by Kernel function. (3) Find a SVC that separates the higher dimension data

Question: Which functions to use to make the data into a higher dimension? x^2? x^3?

Use Kernel Functions to systematically find SVC.

Commonly used Kernels

Polynomial kernel with degree d (d= (Point) d=2 (line), d=3(plane) etc)
Radial Basis Function (RBF)

Example. Polynomial kernel with d=3

Radial Basis Function (RBF)

Under construction

Kernel Trick

Kernel functions only calculate the relationships between every pair of points as if they are in the higher dimensions. They actually do not do transformation. This trick is known as Kernel Trick.

The Kernel Trick reduces the computation required for SVM by avoiding the math that transforms the data from low to high dimension.

Polynomial Kernel

An example: $(ab+r)^d$

Polynomial kernel computes relationship between pairs of observation(a, b). The parameter r, d are determined by cross**-**validation.

a,b are the two observations we want to calculate the high dimensional relationship for
r determines the polynomial coefficient
d is degree of polynomial

Kernel Trick은 새로운 고차원 공간 H 로 직접 매핑하지 않고, Kernel 함수를 이용하여 H 공간의 내적( dot product)를 얻는 방법임.
H 공간에서 연산이 내적으로 표현되어야 하는 조건이 있음

Example: r=0.5, d=2

Example: r=1, d=2

Using Polynomial kernel, 1D observation a or b is converted to 2D (ignoring z-axis coordinate).

Applying Polynomial kernel is also equivalent to Dot Product between each pair of points.

For observation a=9, b=14, the dot product results will give a value of the high-dimension relationship, without actually transforming into a high dimension.

Radial kernel

Another popular kernel is Radial Basis Kernel.

e^{- \gamma (a-b)^2}

RBF finds SVC in infinite dimension, it is hard to visualize the process.

It behaves like a Weighted Nearest Neighbor model. Near observations has more influence than further away observations

How the Radial Kernel determines how much influence each observation in the Training Dataset on classifying new observation? $e^{- \gamma (a-b)^2}$