Chapter 8 Support Vector Machines

The quadratic programming differs from the linear programming only in that the objective function also include xj2 and xixj (i  j) terms. If we used matrix notation, the problem is to find x so as to Maximize f(x) = cx – (1/2)xTQx, subject to Ax  b and x  0 where c is a row vector, x and b are column vectors, Q and A are matrices. The qij (elements of Q) are given constants such that qij = qji. By performing vector and matrix multiplications, the objective function is expressed in terms of these qij, the cj (elements of c) and the variables as follows:

ppt45 trang | Chia sẻ: vutrong32 | Ngày: 17/10/2018 | Lượt xem: 174 | Lượt tải: 0download
Bạn đang xem trước 20 trang tài liệu Chapter 8 Support Vector Machines, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Chapter 8 Support Vector MachinesAssoc. Prof. Dr. Duong Tuan AnhFaculty of Computer Science and Engineering, HCMC Univ. of Technology3/20151Outline1. Introduction2. Support vector machine for binary classification3. Multi-class classifiction4. Learning with soft margin5. SVM Tools6. Applications of SVMs 21.IntroductionA new method for the classification of both linear and nonlinear data.SVM is an algorithm that works as follows:It uses a nonlinear mapping to transform the original training data into a higher dimension.Within this new dimension, it searches for the linear optimal separating hyperplane.With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can be separated by a hyperplane.SVM finds this hyperplane using support vectors (“essential” training examples) and margins (defined by the support vectors)3Introduction (cont.)The first paper on SVM: 1992 by V. Vapnik et al.Although the training time of SVMs can be very slow, they are highly accurate.SVMs are much less prone to overfitting.SVMs can be used for prediction and classification.SVMs can be applied to:Handwritten digit recognition, object recognition, speaker identification, time series prediction.42. Support vector machine for binary classification The case when the data are linearly separableExample: Two-class problem, linearly separateble.D = {(X1,y1),(X2, y2),,(Xm, ym)}Xi is a training tuple, yi is associated class label.yi  {+1, -1}Assume each tuple has two input attributes A1, A2.From the graph, the 2-D data are linearly separatable.There are an infinite number of separating lines that could be drawn to separate all of the tuples of class +1 from all of the tuples of class -1.5Linearly separable dataThere are an infinite number of separating hyperplanes or “decision boundaries”. Which one is best?Figure 8.1 The 2-D data are linearly separable6SVM - Maximum marginal hyperplaneGeneralizing to n dimensions, we want to find the best hyperplane.Hyperplane = “decision boundary”How can we find the best hyperplane?An SVM approaches this problem by searching for the maximum marginal hyperplane.The following figure shows two possible separating hyperplanes and their associated margins.Both hyperplanes can correctly classify all of the given data tuples.However, we expect the hyperplane with the larger margin to be more accurate at classifying future data tuples than the hyperplane with smaller margin. 7Two possible separating hyperplanes and their associated margins. Which one is better? The one with the larger margins should have greater generalization accuracy.Figure 8.2 Two possible hyperplanes and their associated margins8MarginThe SVM searches for the hyperplane with the largest margin (maximum marginal hyperplane - MMH).What is margin?The shortest distance from a hyperplane to one side of its margin is equal to the shortest distance from the hyperplane to the other side of its margin.When dealing with the MMH, this distance is the shortest distance from the MMH to the closest training tuple of either class.9Maximum marginA separating hyperplane can be writen as W.X + b = 0 (1) where W is a weight vector, W = {w1,w2,,wn}; n is the number of attributes; and b is a scalar (a bias).Let’s consider two input attributes, A1 and A2. Training tuples are 2-D. X = (x1,x2). We can rewrite the above separating hyperplane as: w0 + w1x1 + w2x2 = 0 (2)Any point that lines above the separating hyperplane satisfies w0 + w1x1 + w2x2 >0 Any point that lines below the separating hyperplane satisfies w0 + w1x1 + w2x2 0 and these tuples are the support vectors.All the other points have i* = 0, and the decision function is independent with these points.If we remove some of the non-support vectors, the current separating hyperplane would remain uneffected. 32Kernel functions and SVMEach of the three kernel functions results in a different nonlinear classifier in the (original) input space.There are no rules for determining which kernel will result in the most accurate SVM.In practice, the kernel chosen does not make a large difference in resulting accuracy.SVM training always finds a global solution; unlike neural networks, where many local minima usually exist.A major research goal in SVMs is to improve the speed in training and testing so that SVMs may becomes a more feasible option for large datasets.Other issues include determining the best kernel for a given dataset.333. SVM as Multiclass classifierLinear nonlinear SVMs for binary classification have been described.SVM classifiers can be combined for the multi-class case.Given m classes, trains m classifiers, one for each class (where classifier j learns to return a positive value for class j and a negative value for the rest).A test tuple is assigned the class corresponding to the largest positive distance. di(x) = wi.x + bi (SVMi)344. Learning with noise: soft marginsIn practical applications, datasets contain noise and the two classes are not completely separable, but a hyperplane that maximize the margin while minimizing a quantity proportional to the mis-classification error can still be determined.Figure 8.735This can be done by introducing non-negative slack variables i in constraint (7), which becomes: yi(w.  (xi) + b)  1 - i , i (19)If an error occurs, the corresponding i must exceed 1, so 1mi is an upper bound for the number of misclassification errors. Here the objective function to be minimized can changed into:where C is a parameter chosen by the user that controls the tradeoff between the margin and the misclassification errors. A larger C means that a higher penalty to misclassification errors is assigned.Minimizing (20) under constraint (19) gives the generalized separating hyperplanes. This still remains a quadratic programming problem. We can pose the dual problem (using Lagrange multipliers) and solve it.(20)36SVM with soft marginsFinally, the dual problem to be solved issubject to the constraintsand the classifier becomes37How to solve the Quadratic Programming ProblemTraditional QP algorithms are not suitable for large size problems.A number of algorithms have been suggested to solve the dual problems.Osuma et al. proved a theorem which proves that the large QP problem can be broken down into a series of smaller QP sub-problems.Pratt proposed a Sequential Minimal Optimization (SMO) to quickly solve SVM QP problem without any extra matrix storage and without using numerical QP optimization steps at all. Using Osuma’s theorem to ensure convergence, SMO decomposes the overall QP problem into QP sub-problems. SMO chooses to solve the smallest possible optimization problems at every step. At each step:(1) SMO chooses two Lagrange multipliers to jointly optimize. (2) Finds the optimal values for these multipliers, and(3) Updates the SVMs to reflect the new optimal values.386. SVM toolsIn WEKA, the data mining software, there are ANN and SVM tools for classification.LIBSVM – A library for support vector machines (C. C. Chang & C. J. Lin) vector machine - MATLAB397. Applications of SVMsFace detectionFace recognitionObject recognitionHandwritten character/digit recognitionSpeech recognitionImage retrievalPredictionFinancial time series predictionBankruptcy predictionAnomaly detection in time series data 40ReferenceJ. Han & M. Kamber, Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publisher, CA, 2006.C. Campbell & Y. Ying, Learning with Support Vector Machines, Morgan & Claypool Publishers, 2011.F.E.H. Tay & L. Cao, Application of Support Vector Machines in Financial Time Series Forecasting, Omega, The International Journal of Management Science, Vol. 29, 2001, pp. 309-317.41Appendix: Method of Lagrange Multipliers for maxima and minimaA method for obtaining the relative maximum or minimum values of a function F(x,y,z) subject to a constraint condition (x,y,z)=0 consists of the formation of the auxiliary function G(x,y,z)  F(x,y,z) + (x,y,z) (A1) subject to the conditions:(A2)which are necessary conditions for a relative for a relative maximum or minimum. The parameter  which is independent of x, y, z is called a Lagrange multiplier.The condition (A2) are equivalent to G = 0And hence: F +  = 0 42The method can be generalized. If we wish to find the relative maximum or minimum values of a function F(x1,x2,,xn) subject to the constraint conditions 1(x1,x2,,xn) = 0, 2(x1,x2,,xn) = 0,, k(x1,x2,,xn) = 0, we form the auxiliary function: G(x1,x2,,xn)  F + 11+ 22 + + kk subject to the necessary conditionswhere 1, 2,, k which are independent of x1, x2,,xn are called Lagrange multipliers.43Quadratic ProgrammingThe quadratic programming differs from the linear programming only in that the objective function also include xj2 and xixj (i  j) terms.If we used matrix notation, the problem is to find x so as to Maximize f(x) = cx – (1/2)xTQx, subject to Ax  b and x  0 where c is a row vector, x and b are column vectors, Q and A are matrices. The qij (elements of Q) are given constants such that qij = qji.By performing vector and matrix multiplications, the objective function is expressed in terms of these qij, the cj (elements of c) and the variables as follows: f(x) = cx – (1/2)xTQx44Example of Quadratic ProgrammingConsider an example of a QP problem Maximize f(x1,x2) = 15x1 + 30x2 + 4x1x2 - 2x12 – 4x22, subject to x1 + 2x2  30 and x1  0, x2  0.In this caseNote that = 4x12 – 4x2x1 – 4x1x2 + 8x22 = q11x12 + q21x2x1 + q12x1x2 + q22x22Multiplying through by –(1/2) gives -(1/2) xTQx = -2x12 + 4x1x2 – 4x22 which is the nonlinear part of the objective function.45

Các file đính kèm theo tài liệu này:

  • pptchapter_8_6914.ppt