Chapter 9 Combination of classifiers

An ROC curve is plotted as follows. Starting at the bottom-left hand corner (where the TPF and FPP are both 0), we check the actual class label of the tuple at the top of the list. If we have a true positive then on the ROC curve, we move up and plot a point. If, the tuple really belongs to the ‘no’ class, we have a false positive, on the ROC curve, we move right and plot a point. The process is repeated for each of the test tuples, each time moving up on the curve for a true positive or toward the right for a false positive.

22 trang | Chia sẻ: vutrong32 | Lượt xem: 1503 | Lượt tải: 1Free

Bạn đang xem trước 20 trang tài liệu Chapter 9 Combination of classifiers, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Chapter 9Combination of classifiersAssoc. Prof. Dr. Duong Tuan AnhFaculty of Computer Science and Engineering, HCMC Univ. of Technology3/20151Outline1 Introduction2 Bagging3 Boosting4 ROC curves 2IntroductionA combination or an ensemble of classifiers is a set of classifiers are combined to classify new examples.A combination of classifiers is often more accurate than the individual classifiers that make them up.A combination of classifiers is a way of compensating for imperfect classifiers.Each classifier is also called an expert.Bagging and boosting are general techniques that can be applied to classification as well as prediction problems.The methods for constructing ensembles of classifiers includeSub-sampling the training set3Bagging and boosting each generate a set of classification models, M1, M2,,Mk. Voting strategies are used to combine the classifications for a given unknown sample.Figure 9.14BaggingGiven a set D of d tuples, bagging works as follows. For iterations i (i = 1, 2,k), a training set Di of d tuples is sampled with replacement from the original set of tuples, D. Note that the term bagging stands for boostrap aggregation. Each training set is a boostrap sample. Because sampling with replacement is used, some of the original tuples of D may not be included in Di, whereas others may occur more than once. A classifier model, Mi, is learned for each training set, Di. To classify an unknown tuple, X, each classifier, Mi, returns its class prediction, which counts as one vote. The bagged classifier, M*, counts the votes and assigns the class with the most votes to X.5Algorithm: Bagging1. for i = 1 to k do // create k models2. Create bootstrap sample, Di, by sampling D with replacement;3. Use Di to derive the model Mi;4. endforTo use the composite model to classify tuple, X:1. Let each of the k models classify X and return the majority vote;D: a set of training tuplesk: the number of models in the ensemble.6ExampleConsider the data set given in the figure 9.2 which consists of 10 data points, belonging to 2 classes X and O.Figure 9.27The original training set is divided into a number of disjoint subsets. Then different overlapping training sets are constructed by dropping one of these subsets.For the dataset in the figure, let us take the disjoint subsets as S1 = {1, 2}, S2 = {4, 5}, S3= {3}, S4 = {6,7}, S5 = S6 = {9}.1. If subsets S1 and S4 are left out, then the dataset will be {3, 4, 5, 8, 9, 10} and P will be classified as belonging to Class O.2. If subsets S1 and S5 are left out, then the dataset will be {3, 4, 5, 6, 7, 9} and P will be classified as belonging to Class X.3. If subsets S1 and S6 are left out, then the dataset will be {3, 4, 5, 6, 7, 8, 10} and P will be classified as belonging to Class O.4. If subsets S2 and S4 are left out, then the dataset will be {1, 2, 3, 8, 9, 10} and P will be classified as belonging to Class O.5. If subsets S2 and S5 are left out, then the dataset will be {1, 2, 3, 6, 7, 9} and P will be classified as belonging to Class X.6. If subsets S2 and S6 are left out, then the dataset will be {1, 2, 3, 6,7, 8,10} and P will be classified as belonging to Class O.87. If subset S3 and S4 are left out, then the dataset will be {1, 2, 4, 5, 8, 9, 10} and P will be classified as belonging to Class O.8. If subset S3 and S5 are left out, then the dataset will be {1,2, 4, 5, 6, 7, 9} and P will be classified as belonging to Class X.9. If subset S3 and S6 are left out, then the dataset will be {1, 2, 4, 5, 6, 7, 8,10} and P will be classified as belonging to Class O.The final decision: P is classified as belonging to Class O.Note: Instead of obtaining independent datasets from the domain, bagging just resamples from the original training set. The datasets generated by resampling are different from each other, but are not independent because they are based on one dataset. Bagging doesn’t work for classifiers that are stable, ones whose output is insensitive to small changes in the input.9Difference between bagging and boostingLike bagging, boosting combines models of the same type.However, whereas in bagging individual models are built separately, in boosting each new model is influenced by the performance of those built previously.Boosting encourages new models to become experts for instances classified incorrectly by earlier ones.Boosting weights a model’s contribution by its performance, rather than giving equal weight to all models.10BoostingIn boosting, weights are assigned to each training tuple.A series of k classifiers is iteratively learned.After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to “pay more attention” to the training tuples that were misclassified by Mi.The final boosted classifier M*, combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy.To compute the error rate of model Mi, we sum the weights of each of the tuples in Di that Mi misclassified. That is,where err(Xj) is the misclassification error of tuple Xj: if the tuple was misclassified, the err(Xj) is 1, otherwise, it is 0.11Algorithm: Adaboost1. Initialize the weight of each tuple in D to 1;2. for i = 1 to k do // for each round3. Sample D with replacement according to the tuple weights to obtain Di;4 Use training set Di to derive a model Mi; 5 Compute error(Mi), the error rate of Mi6 if error(Mi) > 0.5 then7 Reinitialize the weights to 18 Go to step 3 and try again9 endif10 for each tuple in Di that was correctly classified do11 Multiply the weight of the tuple by i = error(Mi)/(1 - error(Mi));12 Normalize the weight of each tuple13 endfor D: a set of training tuplesk: the number of models in the ensemble.12To use the composite model to classify tuple, X:1. Initialize weight of each class to 0;2. for i = 1 to k do // for each classifier3. wi = log(1/i) // weight for the classifier’s vote4. ci = Mi(X); // get class prediction for X from Mi5. add wi to weight for class c6. endfor7. Return the class with the largest weightExample: Reuse the dataset given in the Figure 8.1 which consists of 10 data points belonging to 2 classes X and O.Let a weight of 1 be assigned to all the samples, i.e., weight(i) = 1, i = 1,..,10.13Consider three classifiers which are based on Hypothesis 1, Hypothesis 2 and Hypothesis 3, respectively.Classifier 1This classifier is based on Hypothesis 1. It is that if x1  3, the pattern belongs to Class X and Class O otherwise. This classifier misclassifies pattern 3. Which means error(M1) = 1/10 = 0.1 1 = 0.1/(1-0.1) = 1/0.9 = 0.1111 weight(1) = 10.1111 = 0.1111Similarly the weights of the other pattern except pattern 3 will be 0.1111. Only the weight of pattern 3 remains as 1.Normalizing weight(1) = 0.1111  10/(0.1111 + 0.1111 + 0.1111 + 0.1111 + 0.1111 + 0.1111 + 0.1111 + 0.1111 + 0.1111 + 1) = 0.1111  10/(1.9999) =0.5555 weight(2) = 0.5555, weight(4) = 0.5555, weight(5) = 0.5555, weight(6) = 0.5555, weight(7) = 0.5555, weight(8) = 0.5555, weight(9) = 0.5555, weight(10) = 0.5555 weight(3) = 1  10/(1.9999) = 5.0002 14Classifier 2This classifier is based on Hypothesis 2. It is that if x1  5, the pattern belongs to Class X and Class O otherwise. This classifier misclassifies patterns 6,8 and 10. Which means error(M2) = (1/10) (0.5555+0.5555+0.5555) = 0.16665 2 = 0.16665/(1-0.16665) = 0.2000 weight(1) = 0.5555 0.2 = 0.1111; weight(2) = 0.1111; weight(3) = 5.0002 0.2 = 1.0004; weight(4) = 0.1111; weight(5) = 0.1111; weight(6) = 0.5555 1 = 0.5555; weight(7) = 0.1111; weight(8) = 0.5555; weight (9) = 0.1111; weight(10) = 0.5555; Normalizingweight(1) = 0.1111  10/(0.1111 + 0.1111 + 0.1111 + 0.1111 + 0.1111 + 0.1111 + 1.00004 + 0.5555 + 0.5555 +0.5555) = 0.1111  10/(3.33314) =0.333319; weight(2) = 0.333319, weight(3) = 1.00004  10/(3.33314) = 3.0003 weight(4) = 0.333319, weight(5) = 0.333319, weight(6) = 0.5555  10/(3.33314) = 1.6666, weight(7) = 0.33319, weight(8) = 1.6666, weight(9) = 0.333319, weight(10) = 1.6666; 15Classifier 3This classifier is based on Hypothesis 3. It is that if x1 + x2  3.5, the pattern belongs to Class X and Class O otherwise. This classifier misclassifies patterns 3 and 5. Which means error(M3) = (1/10) (3.0003+0.333319) = 0.3334 3 = 0.3334/(1-0.3334) = 0.5002 weight(1) = 0.333319 0.5002 = 0.16673; weight(2) = 0.16673; weight(3) = 3.0003; weight(4) = 0.16673; weight(5) = 0.333319; weight(6) = 1.6666 0.5002 = 0.8336; weight(7) = 0.16673; weight(8) = 0.8336; weight (9) = 0.16673; weight(10) = 0.8336; Normalizingweight(1) = 0.16673  10/(0.16673 + 0.16673 + 3.0003 + 0.16673 + 0.333319 + 0.8336 + 0.16673 + 0.8336 + 0.16673 + 0.8336) = 0.16673  10/(6.668069) =0.2502; weight(2) = 0.2502, weight(3) = 3.0003  10/(6.668069) = 4.4995 weight(4) = 0.2502, weight(5) = 0.333319  10/(6.668069)=0.4999, weight(6) = 0.8338  10/(6.668069)= 1.2501, weight(7) = 0.2502, weight(8) = 1.2501, weight(9) = 0.2502, weight(10) = 1.2501;16If we have a test pattern P (4,2)According to classifier 1, it belongs to class O; According to classifier 2, it belongs to class X;And according to classifier 3, it belongs to class O.For Class X,log(1/i) = log(1/2) = log(1/0.2) = 0.669For Class O, log(1/i)= log(1/1)+log(1/3) = 1.2551P will be classified as belonging to Class O.Note: While both bagging and boosting can significantly improve accuracy in comparison to a single model, boosting tends to achieve greater accuracy.174. ROC curvesROC curves are a useful visual tool for comparing two classification models.ROC stands for Receiver Operating Characteristic.ROC curve is a plot of the true positive fraction (TPF) against the false positive fraction, FPF.TPF: proportion of positive tuples that are correctly identified. FPF: proportion of negative tuples that are incorrectly identified as possitive.The vertical axis of an ROC curve represents the true positive fraction. The horizontal axis of an ROC curve represents the false positive fraction.To plot an ROC curve for a given classification model, M, the model must be able to return a probability for the predicted class of each test tuple. That is, we need to rank the test tuples in decreasing order, where the one the classifier thinks is most likely to belong to the positive class appears at the top of the list.18How to plot ROC curveAn ROC curve is plotted as follows.Starting at the bottom-left hand corner (where the TPF and FPP are both 0), we check the actual class label of the tuple at the top of the list. If we have a true positive then on the ROC curve, we move up and plot a point. If, the tuple really belongs to the ‘no’ class, we have a false positive, on the ROC curve, we move right and plot a point.The process is repeated for each of the test tuples, each time moving up on the curve for a true positive or toward the right for a false positive.19The figure 9.3 shows the ROC curves of two classification models. The closer the ROC curve of a model is to the diagonal, the less accurate the model.To access the accuracy of a model, we can measure the area under the curve. The closer the area is to 0.5, the less accurate the model is. A model with perfect accuracy will have an area of 1.Figure 9.3 The ROC curves of two classification models.Several software packages support the plot of ROC curves and compute the area under the curves (e.g. MATLAB)20Tools for bagging and boostingBagging and ADABOOST are available in WEKA software.21ReferencesM. N. Murty and V. S. Devi, 2011, Pattern Recognition – An Algorithmic Approach, Springer.J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publishers, 200622

Các file đính kèm theo tài liệu này:

chapter_9_9062.ppt