SVM concept
Last updated
Last updated
For classification, the shortest distance between the observation and the threshold
Maximum Marginal Classifier
It is using the threshold to give the largest margin to make classification.
If the margin is halfway between the two end-point observations of red, green, then the margin is the larget.
If the threshold is moved to left or right, the margin becomes smaller.
If there is an outlier and if misclassifcation of outlier is NOT allowed. then, the Maximum Margin Classifier would NOT classify new observation properly.
Maximum Margin Classifier is very sensitive to outliers in the training data
Solution: Allow misclassification
If the threshold is halfway, it allows misclassification. But it can classify new observation more sensibly.
New observation will be classify as obese, and makes sense for it is more closer to most obese observations.
Distance between the observations and threshold, with misclassification.
How to choose a better Soft margin? use cross validation that counts number of misclassification/correct classification to allow in the Soft margin.
Using soft margins to find the best threshold to classify.
Support Vectors: Observations on the edge and within the soft margin
The data within the soft margin is misclassified in this example. Use Cross Validation to determine that allowing this misclassification results in better classification
For 3D, a plane is used for Support Vector Classifier. For 4D or higher, SVC is a hyperplane.
SVC may not work for certain data without changing the dimensions. Example:
Support Vector Machine can solve this limitation as shown in the following example.
Lets transform the data 1D to 2D ( y=Dosage^2 )
Then, it can classify new observation as
(1) With Low Dimension Data, it may not divide nicely. Then, (2) Move the data into a higher dimension by Kernel function. (3) Find a SVC that separates the higher dimension data
Question: Which functions to use to make the data into a higher dimension? x^2? x^3?
Use Kernel Functions to systematically find SVC.
Commonly used Kernels
Polynomial kernel with degree d (d= (Point) d=2 (line), d=3(plane) etc)
Radial Basis Function (RBF)
Example. Polynomial kernel with d=3
Under construction
Kernel functions only calculate the relationships between every pair of points as if they are in the higher dimensions. They actually do not do transformation. This trick is known as Kernel Trick.
The Kernel Trick reduces the computation required for SVM by avoiding the math that transforms the data from low to high dimension.
Polynomial kernel computes relationship between pairs of observation(a, b). The parameter r, d are determined by cross**-**validation.
a,b are the two observations we want to calculate the high dimensional relationship for
r determines the polynomial coefficient
d is degree of polynomial
Kernel Trick은 새로운 고차원 공간 H 로 직접 매핑하지 않고, Kernel 함수를 이용하여 H 공간의 내적( dot product)를 얻는 방법임.
H 공간에서 연산이 내적으로 표현되어야 하는 조건이 있음
Using Polynomial kernel, 1D observation a or b is converted to 2D (ignoring z-axis coordinate).
Applying Polynomial kernel is also equivalent to Dot Product between each pair of points.
For observation a=9, b=14, the dot product results will give a value of the high-dimension relationship, without actually transforming into a high dimension.
Another popular kernel is Radial Basis Kernel.
RBF finds SVC in infinite dimension, it is hard to visualize the process.
It behaves like a Weighted Nearest Neighbor model. Near observations has more influence than further away observations
The square function of the difference between a and b with a scale of gamma.
a=2.5, b=16
A number very close zero if the observation points are far from each other.
We get the high-dimension relationship by plug in the values into the Radial Kernel
Lets start from Polynomial kernel with r=0.
It shifts the data on the original axis(in the original dimension), WITHOUT making it in a higher dimension.
For example, when r=0, d=2, (ab+0)^2=(ab)^2.
We can use Polynomial Kernel with r=0 to explain Radial Function.
Lets add polynomial kernels with r=0, but with d increasing with each term.
For adding two terms with d=1 and d=2:
The terms can be kept adding, with r=0 and d=1 to INF dimension.
Then, it gives a polynomial with infinite number of dimensions.
Lets go back to Radial Function.
The term e^(ab) can be written in Taylor Series that is in terms of Polynomial Kernel with r=0 and d=0 to INF.
The dot product of e^{ab} is
Thus, the Radial Kernel becomes
The Radial Kernel is equal to a Dot Product that has coordinates for an infinite number of dimension.
Thus, the value plugged in the Radial Kernel is the relationship between the two observations in infinite -dimension.
An example:
How the Radial Kernel determines how much influence each observation in the Training Dataset on classifying new observation?
If r=0, d=d,