• The margin between the classes equals 2 / ||w||_2.
• Minimizing ||w||_2 corresponds to maximizing the margin.
• Note that: w’ x1 + b = 1 and w’x2 + b = − 1
⇒ w’(x1* − x2*) = 2 ⇒ w’( x1* − x2*)/||w||_2 = 2 /||w||_2
here x1* and x2* are the nearest points within the hyperplanes of the two different classes and ||w||_2 is L2 norm of the weight matrix.
Here we see the final form of classifier, keep in mind this is for perfect linear classification. But now comes the question of how to get this W weight matrix.
Let’s talk about the above optimization problem, it’s an optimization problem where we are trying to minimize (W and biases) such that alphas are maximized. Basically, it’s a MIN(MAX) problem where we are trying to minimize the product of W’(transpose) and W such that y_k*[W’*X_k + b] >= 1. Before looking at the second equation, let’s talk about constraint and normal optimization. Constraint optimization is just normal optimization but with a condition that is shown in the above diagram. So, Lagrangian is basically a way to write optimization and constraint in one single equation because computers can easily solve Lagrangian rather than solving the optimization and then applying constraints.
Now let’s see the alpha parameter, these are called the Lagrangian multiplier. Now it’s time to define support vectors. Support vectors are all those data points for which the alpha value exceeds 0. Support vectors are data points close to the decision margin and set the orientation of the separating hyperplane.
So, in our above equation, we have two unsolved variables: weight matrix and alpha. For calculating alpha, we use a technique called Sequential Minimal Optimization (SMO), this algorithm is based on the Coordinate ascent algorithm (kind of similar to gradient descent, but it minimizes the function in step-wise shape). What SMO does is basically take pairs of alphas and optimize them to reach the global minima.
Once the alphas are calculated, we solve the lagrangian by taking the partial derivative of the lagrangian function with respect to weight and bias. The partial derivatives show that Weight matrix W is just the product of the Lagrangian multiplier, class_label, and the data point. And the second differential tells us that the linear combination of Lagrangian multiplier*class_label should always be 0.
The Lagrangian equation shows it will be very bad for high-dimensional data as we must multiply every data point with the entire weight matrix. And also, for applying kernel, as it needs to be first applied on each datapoint X, those points will be put into the Lagrangian. Using Even 5-order polynomials will lead to 5! features for every data point, leading to huge time complexity. We will have all the possible combinations of X⁵ + X⁴ + X⁵X⁴X³ + …., which in our case will be huge.
The above equation is a maximization problem, basically a (MAX) problem, there is no term of weights or even margin, then how come these two solve the same problem? This optimization looks completely different from the Primal form. Now how to make sure that both optimizations solve the same problem, for that, we have to do something called KKT conditions. Satisfying these conditions makes sure that MAX(MIN) = MIN(MAX), without satisfying KKT, we have MAX(MIN) <= MIN(MAX). How do we satisfy the KKT condition in our case, the constraints we obtain hereafter take the partial derivative to satisfy the KKT condition for this optimization problem.
Let’s look at the Lagrangian equation, it’s a product of each data point with every other data point, followed by their respective Lagrangian and their class_label. In the above equations also, we take the partial derivative with respect to weights and biases and get two conditions first, a linear combination of Lagrangian multiplier*class_label is 0. Secondly, all Lagrangian multipliers are either 0 or greater than 0. We can clearly see why this is unsuitable for large datasets, as we have to do N*(N-1) multiplication to solve the above optimization problem. After solving this optimization, we will get the same solution as in the primal form. In the dual form, we can easily apply the kernel, there is a reason it is called kernel trick, not the kernel method, which can be understood from the below equations. We don’t actually calculate the high-order mappings. We compute the dot product and then do the high-order mapping of that particular point.
OPTIMIZATION FOR NON-LINEAR CASE
For the non-separable case, we add one more constraint, which is a penalty term for every misclassification, this is called soft margin SVM. Everything else remains the same. Similarly, for the dual problem, we add one more constraint and solve it similarly by taking partial derivatives with respect to weights and biases.
NOTE: We can’t choose any kernel to push our data to a higher dimension; positive definite matrices can only work as acceptable kernels. Otherwise, the optimization will break. Positive-definite matrices are that matrix whose eigenvalue is greater than 0.
If you have understood these optimizations properly, you can even use them for PCA optimization and Function estimation.
Finally, I hope you understand what SVM is all about. There are still many things like how to interpret kernels and hyperparameter tuning. Understanding this enables you to gain real advantages regarding algorithm selection and customization and optimize the computational efficiency of SVM models.
Understanding kernel tricks and optimization is a great way to develop a solid understanding of the concept of optimizations that you will surely encounter in other algorithms as well. Ultimately, all machine learning is some form of optimization only.
Thanks for your time and patience.