Unsupervised Learning – Clustering

Consider G = G1, G2, ,GM as the clusters from a classified dataset, and A1, A2, ,AM as those obtained by a clustering algorithm. Denote D as the dataset of patterns. For all pairs of patterns (Di, Dj) in D, we count the following quantities a is the number of pairs, each belongs to one cluster in G and are clustered together in A. b is the number of pairs, each belongs to one cluster in G , but are not clustered in A. c is the number of pairs that are clustered together in A, but are not belong to one cluster in G. d is the number of pairs, each neither clustered together in A, nor belongs to the same cluster in G. The clustering evaluation criteria are defined as below: 1. Jaccard score (Jaccard):

48 trang | Chia sẻ: dntpro1256 | Lượt xem: 608 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Unsupervised Learning – Clustering, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Chapter 6Unsupervised Learning – ClusteringAssoc. Prof. Dr. Duong Tuan AnhFaculty of Computer Science and Engineering, HCMC Univ. of Technology3/20151Outline1 Introduction to unsupervised learning and clustering2 Partitional clustering (k-Means algorithm)3 Hierarchical clustering4 Expectation Maximization (EM) algorithm5 Incremental Clustering 21. Introduction to clusteringClustering is the process of grouping a set of patterns. It generates a partition consisting of groups or clusters from a given collection of patterns. Representations or descriptions of the clusters formed are used in decision making – classification, prediction, outlier detection.A clustering-based classification scheme is very useful in solving large-scale pattern classification problems in data mining.Patterns to be clustered are either labeled or unlabeled. We have:Clustering algorithms which group sets of unlabeled patterns. These types of approaches are so popular that clustering is viewed as an unsupervised learning of unlabeled patterns.Algorithms which cluster labeled patterns. These types of approaches are practically important and are called supervised clustering. Supervised clustering is helpful in identifying clusters within collections of labeled patterns. Abstractions in the form of cluster representatives/ descriptions which are useful for efficient classification (e.g., in data reduction for classification).3ClusteringThe process of clustering is carried out so that patterns in the same cluster are similar in some sense and patterns in different clusters are dissimilar in a corresponding sense.Figure 6.1The Euclidean distance between any two points characterizes similarity: intra-cluster distance is small and inter-cluster distance is large.4Centroid and medoidClustering is useful for generating data abstraction. A cluster of points is represented by its centroid or its medoid.A centroid stands for the sample mean of the points in cluster C; it is given by (1/NC)Xi C, where NC is the number of the points in cluster C.The medoid is the most centrally located point in the cluster. The medoid is that point in the cluster from which the sum of the distances from the points in the cluster is the minimum.There is another point in the figure which is far off from any of the points in the cluster. This is an outlier.Clustering algorithms that use medoids are more robust in the presence of noisy patterns or outliers.52. Partitional Clustering Partitional clustering algorithms generate a partition of the data. The most popular of this kind of algorithms is the k-means algorithm.A simple description of the k-means algorithm is given below.Step 1: Select k out of the given n patterns as the initial cluster centers. Assign each of the remaining n - k patterns to one of the k clusters; a pattern is assigned to its closest center/cluster.Step 2: Compute the cluster centers based on the current assignment of patterns.Step 3: Assign each of the n patterns to its closest center/cluster.Step 4: If there is no change in the assignment of patterns to clusters during two successive iterations, then stop; else, goto Step 2.Selecting the initial cluster centers is a very important issue.6Example of k-MeansWe illustrate k-Means with the two-dimensional data set of 7 points shown in Fig 6.2 and 6.3The collection of points are grouped into 3 clusters (k = 3). The patterns are located at A = (1, 1), B = (1, 2), C = (2, 2), D = (6, 2), E = (7, 2), F = (6, 6), G = (7, 6). If A, D and F are selected as initial centers. Cluster 1 has (1, 1) as its cluster center. Cluster 2 has (6, 2) as its cluster center and Cluster 3 has (6, 6) as its cluster center. Now, B, C  Cluster 1; E  Cluster 2; and G  Cluster 3. The new cluster center of Cluster 1 will be the mean of the patterns in Cluster 1 which will be (1.33, 1.66). The new cluster center of Cluster 2 will be (6.5, 4) and the new cluster center of Cluster 3 will be (6.5, 6). Now, A, B, C  Cluster 1, D, E  Cluster 2 and F, G  Cluster 3. Since there is no change in the clusters formed, this is the final set of clusters.7Figure 6.2 Optimal partition when A, D and F are the initial centersThis gives a good partition of three clusters – {A, B, C}, {D, E} and {F, G}.8K-MeansIf starting with A, B and C as the initial centers, we end up with the clusters shown in Fig 6.3. This partition has smaller variances in two clusters and a large variance in the third.Figure 6.3 A non-optimal partition when A, B and C are the initial centers9K-MeansAn important property of the k-means algorithm is that it minimizes the sum of squared deviations of patterns in a cluster from the center. If Ci is the i-th cluster and i is its center, then the criterion function minimized by the algorithm isThe time complexity of the algorithm is O(nkdl), where l is the number of iterations and d is the dimensionality. The space requirement is O(kD). These features make the algorithm very attractive.The k-Means is one of the most frequently used algorithms in a variety of applications.103. Hierarchical algorithmsHierarchical algorithms produce a nested sequence of data partitions. The sequence can be depicted using a tree structure that is known as a dendrogram.The algorithms are either divisive or agglomerative. The divisive starts with a single cluster having all the patterns; at each successive step, a cluster is split. This process continues until we end up with one pattern in a cluster or a collection of singleton.Divisive algorithms use a top-down strategy for generating partitions of the data.Agglomerative algorithms, on the other hand, use a bottom-up strategy. They start with n singleton clusters when the input data set is of size n, where each input pattern is in a different cluster. At successive levels, the most similar pair of clusters is merged to reduce the size of the partition by one. 11Hierarchical algorithms (cont.)An important property of agglomerative algorithms is that once two patterns are placed in the same cluster at a level, they remain in the same cluster at a level, they remain the same cluster at all subsequent levels.Similarly, in the divisive algorithms, once two patterns are placed in two different clusters at a level, they remain in different clusters at all subsequent levels.12Agglomerative clusteringTypically, an agglomerative clustering algorithm goes through the following steps:Step 1: Compute the similarity/dissimilarity matrix between all pairs of patterns. Initialize each cluster with a distinct pattern.Step 2: Find the closest pair of clusters and merge them. Update the proximity matrix to reflect the merge.Step 3: If all the patterns are in one cluster, stop. Else, goto Step 2. Step 1 in the above algorithm requires O(n2) time to compute pair-wise similarities and O(n2) space to store the values, where n is the number of patterns in the data set.13Agglomerative clustering (cont.)There are several ways of implementing the second step. In the single-link algorithm, the distance between two clusters C1 and C2 is the minimum of the distances d(X, Y), where X  C1 and Y  C2. In the complete-link algorithm, the distance between two clusters C1 and C2 is the maximum of the distances d(X, Y), where X  C1 and Y  C2.Example: Given the dataset as in the figure 6.4 consisting of 8 data points.There are 8 clusters to start with. Manhattan distance is used in this example.14Example of HACSince the clusters {F} and {G} are the closest to each other with a distance 0.5, they are merged.Figure 6.4 The data setA = (0.5, 0.5); B = (2, 1.5); C = (2, 0.5);D = (5, 1); E = (5.75, 1)F = (5, 3); G = (5.5, 3);H = (2, 3)15 A B C D E F G HA 0 2.5 1.5 5 5.75 7 7.5 4B 2.5 0 1.0 3.5 4.25 4.5 5 1.5C 1.5 1 0 3.5 4.25 5.5 6 2.5D 5 3.5 3.5 0 0.75 2 2.5 5E 5.75 4.25 4.25 0.75 0 2.75 2.25 5.75F 7 4.5 5.5 2 2.75 0 0.5 3G 7.5 5 6 2.5 2.5 0.5 0 3.5H 4 1.5 2.5 2.5 5.75 3 3.5 0 A B C D E F,G HA 0 2.5 1.5 5 5.75 7 4B 2.5 0 1.0 3.5 4.25 4.5 1.5C 1.5 1 0 3.5 4.25 5.5 2.5D 5 3.5 3.5 0 0.75 2 5E 5.75 4.25 4.25 0.75 0 2.75 5.75F,G 7 4.5 5.5 2 2.75 0 3H 4 1.5 2.5 2.5 5.75 3 0The initial distance matrixThe updated matrix aftermerging {F} and {G} intoone cluster.16Figure 6.5 Dendrogram for the single-link algorithm{D} merges {E}; {B} merges {C}; {B,C} merges {A}; {A, B, C} merges {H}; {D, E} merges {F, G}. At this stage, there are 2 clusters. The process can be stopped.The dendrogram given in Figure 6.5 shows the merging of clusters at various level.17Figure 6.6 Dendrogram for the complete-link algorithmApplying the complete-link on the data set given in Figure 6.4, the dendrogram generated by the complete-link algorithm is shown in Figure 6.618Complete link and simple-linkIn general, the complete-link algorithm generates compact clusters as it relates every pattern in a cluster with every other pattern in the cluster.The single-link algorithm characterizes the presence of a pattern in a cluster, based on its nearest neighbor in the cluster. The single-link algorithm is highly versatile and can generate clusters of different shapes.19Divisive clusteringDivisive algorithms are either polythetic (where the division is based on more than one feature) or monothetic when only one feature is considered at a time.A scheme for polythetic clustering requires finding all possible 2-partitions of the data and choosing the best partition.Here, the partition with the least sum of the sample variances of the two clusters is chosen as the best. From the resulting partition, the cluster with the maximum sample variance is selected and is split into an optimal 2-partition. This process is repeated until we get singleton clusters.The sum of the sample variances is calculated as follows. If the patterns are split into 2 partitions with m patterns X1,,Xm in one cluster and n patterns Y1, ,Yn in the other cluster with the centroids being C1 and C2, then the sum of sample variances is20Example of polythetic clusteringGiven the dataset in Figure 6.4.At the top there is a single cluster consisting of all the 8 patterns. By considering all possible 2-partitions (27-1= 127), the best 2-partition given by {{A, B, C, H} {D, E, F, G}} is obtained. At the next level, {D, E, F, G}  {D, E} & {F, G}. So far, we have 3 clusters.At the next level, {A, B, C, H}  {A, B, C} & {H} and at the subsequent level, the cluster {A, B, C}  {A} & {B, C}. Now, we have 5 clusters.Similarly, a dendrogram depicts partions having 6, 7 and 8 clusters of the data at successive levels. At the final level, each cluster has only one point; such clusters are singleton clusters.The dendrogram is given in Figure 6.721Figure 6.7 Dendrogram for a polythetic clustering of the eight 2-dimensional points 22Monothetic clusteringIt is possible to use one feature at a time to partition (monothetic clustering) the given data set. In such a case, a feature direction is considered and the data is partitioned into clusters based on the gap in the projected values along the feature direction. That is, the data set is split into two parts at a point that corresponds to the mean value of the maximum gap found among the values of the data’s feature. Each of these clusters is further partitioned sequentially using the remaining features.Example: We have 8 two-dimensional patterns; x1 and x2 are the two patterns used to describe these patterns. Taking feature x1, the data is split into two clusters based on the maximum inter-pattern gap which occurred between two successive patterns in the x1 direction. If we consider the x1 values of the patterns in increasing order, they are: A: 05, B: 2, H: 2, C: 2, D: 5, F = 5, G = 5.5, and F = 5.523Figure 6.8 Monothetic divisive clusteringSo, the maximum inter-pattern gap (5 – 2 = 3) occurs between C and D. We select the mid-point between 2 and 5 which is 3.5 and use it to split the data into two clusters. They are: C1 = {A, B, C, H} and C2 = {D, E, F, G}Each of these clusters is split further into two clusters based on the values of x2.Ordering the patterns in C1 based on their x2 values, we getA: 0.5, C:0.5, B:1.5, H=3.0.Here, the maximum gap of 1.5 occurs between B and H. So, splitting C1 at the mid-point, 2.25, of the direction x2, we get two clusters: C11 = {H} , C12 = {A, B, C}.Similarly, by splitting C2 using the value of 2 for x2:C21 = {F, G}, C22 = {D,E}24Complexity of hierarchical clusteringHierarchical clustering algorithms are computationally expensive. The agglomerative algorithms requires computation and storage of a similarity or dissimilarity matrix of values that has O(n2) time and space requirement. They can be used in application where hundreds of patterns were to be clustered. However, when the data sets are larger in size, these algorithms are not feasible because of quadratic time and space demands.Divisive algorithms require exponential time to analyze the number of patterns or number of features. So they too do not scale up well in large-scale problems involving millions of patterns.254. Expectation Maximization algorithmGaussian Mixture Models Here we look at a special case. Suppose that the different classes each come from their own Gaussian distribution. This is known as multi-modal data, since there is one distribution (mode) for each different class.If we know how many classes there are in the data, then we can try to estimate the parameters for that many Gaussians, all at once. If we don’t know, then we can try different numbers and see which one works best. It is possible to use any other probability distribution instead of a Gaussian, but Gaussians are by far the most common choice.26Gaussian Mixture ModelsFor any particular datapoint x, there will be the sum of the values expected by all the M Gaussians:where (x; m, m) is a Gaussian function with mean m and covariance matrix m and the m are membership weights with the constraint that 27ExampleThe figure 6.9 shows an example where the data comes from two different Gaussians and the model is computed as the sum or mixture of the two Gaussians together.The probability that input xi belongs to class m can be written as:Figure 6.928Maximum likelihood methodThe problem is how to choose the weights m. The common approach is to aim for the maximum likelihood solution (the likelihood is the conditional probability of the data given the model, and the maximum likelihood solution varies the model to maximize this conditional probability).In fact, it is common to compute the log likelihood and then to minimize that; it is guaranteed to be negative, since probabilities are all less than 1, and the logarithm spreads out the values, making the optimization more effective. The algorithm that is used is an example of a very general one known as the expectation maximization (EM) algorithm.Here we will see how to use EM for fitting Gaussian mixtures.29The Expectation Maximization algorithmIn order to see how it works, we will consider the simplest interesting case of the Gaussian mixture model: a combination of just two Gaussian mixtures. The assumption now is that data were created by randomly choosing one of two possible Gaussians and then creating a sample from that Gaussian. If the probability of picking Gaussian one is p, then the entire model looks like this (where N(, 2) specifies a Gaussian distribution with mean  and standard deviation ): G1 = N(1, 12) G2 = N(2, 22) y = pG1 + (1 - p)G2If the probability distribution of p is written as , then the probability density is: P(y) = (y; 1, 1) + (1 - )(y; 2, 2) (6.?)30EM algorithm (cont.)Finding the maximum likelihood solution (actually the minimum log likelihood) to this problem is then a case of computing the sum of the logarithm of Equation 6.? over all of the training data, and differentiating it, which would be difficult.There is a way around it.The key insight is that if we knew which of the two Gaussian components the data point came from, then the computation would be easy.Although we don’t know which component each datapoint came from, we pretend we do, by introducing a new indicator variable f. “If f = 0 then the data came from Gaussian one, if f = 1 then it came from Gaussian two”.After adding latent variables, now we just need to work out how to optimize over them. That is why the algorithm is called expectation maximization. 31The two steps in EM algorithmsWe can compute the expectation of variable f from the data: i(’1, ’2, ’1, ’2, ’)= E(f | ’1, ’2, ’1, ’2, ’| D) = P(f = 1 | ’1, ’2, ’1, ’2, ’| D) where D denotes the data and a dash means an estimate of a variable.Computing the value of this expectation is known as the E-step. Then this estimate of the expectation is maximized over the model parameters (the parameters of the two Gaussians and the mixing parameter ), the M-step. These two steps are simply iterated until the algorithm converges. Note that the estimate never gets any smaller, and EM algorithms are guaranteed to reach the local maxima.32The Gaussian Mixture Model EM Algorithm(Two-Gaussian mixture) initializationset ’1, ’2 to be randomly chosen values from the datasetset ’1, ’2 to the variance of the whole dataset, i.e. set ’ = 0.5  repeat until convergence33These two steps are the same as the two steps of k-Means, (1) assigning cluster for each data point and (2) re-estimating of the cluster centers.34Figure 6.10 shows the log likelihood dropping as the EM algorithm learns for the case of two Gaussian Mixtures. (The x axis denotes the number of iterations)The training is quite expensive: O(NM2 + M3) where M is the number of Gaussians and N is the number of data points.The trick with applying EM algorithms to problems is in identifying the correct latent variables to include, and then simply working through the steps.Figure 6.1035EM algorithm for mixture of Gaussians (general case: K-Gaussians)What is a mixture of K Gaussians? with and (x|Θ) is the Gaussian distribution with parameters Θ = {μ,Σ} 36EM algorithm for mixture of Gaussians (general case)If all points xєX are mixtures of K Gaussians thenGoal: Find π1,, πk and Θ1,, Θk such that P(X) is maximizedOr, ln(P(X)) is maximized: 37Mixtures of Gaussians -- notesEvery point xi is probabilistically assigned (generated) to (by) the k-th GaussianProbability that point xi is generated by the k-th Gaussian is38Mixtures of Gaussians -- notesEvery Gaussian (cluster) Ck has an effective number of points assigned to it NkWith meanAnd variance39EM for Gaussian MixturesInitialize the means μk, variances Σk (Θk=(μk,Σk)) and mixing coefficients πk, and evaluate the initial value of the loglikelihoodExpectation step: Evaluate weights 40EM for Gaussian MixturesMaximization step: Re-evaluate parametersEvaluate L(Θnew) and stop if converged41Tool for k-Means and EM EM clustering algorithm is available in the WEKA software.X-Means, an improved variant of k-Means, is available in WEKA software.425. Incremental clusteringIncremental clustering is based on the assumption that it is possible to consider patterns one at a time and assign them to existing clusters. A new data pattern is assigned to a cluster without affecting the existing clusters significantly.An incremental clustering algorithm (Leader algorithm)Step 1: Assign the first data pattern, P1 to cluster C1; set i = 1 and j = 1 and let the leader Li be Pj.Step 2: Set j = j + 1; consider clusters C1 to Ci in increasing order of the index and assign Pj to cluster Cm ( 1  m  i) if the distance between Lm and Pj is less than a user-specified threshold T; if no cluster satisfies this property, then set i = i + 1 and assign Pj to the new cluster Ci. Set the leader Li to be Pj.Repeat Step 2 until all the data patterns are assigned to clusters.The above incremental algorithm requires a single database scan.Unfortunately, the incremental algorithm is order-dependent.436. Clustering evaluation criteriaWe can use classified datasets and compare how good the clustered results fit with the data labels. Five clustering evaluation criteria: Jaccard, Rand, FM, CSM and NMI can be used in this case. (External evaluation)Besides, we can evaluate the quality of clustering by using the objective function:where x are the objects and c are the cluster centers.44Consider G = G1, G2, ,GM as the clusters from a classified dataset, and A1, A2,,AM as those obtained by a clustering algorithm. Denote D as the dataset of patterns. For all pairs of patterns (Di, Dj) in D, we count the following quantitiesa is the number of pairs, each belongs to one cluster in G and are clustered together in A.b is the number of pairs, each belongs to one cluster in G , but are not clustered in A.c is the number of pairs that are clustered together in A, but are not belong to one cluster in G.d is the number of pairs, each neither clustered together in A, nor belongs to the same cluster in G.The clustering evaluation criteria are defined as below: 1. Jaccard score (Jaccard):452. Rand statistic (Rand):3. Folkes and Mallow index (FM):4. Cluster Similarity Measure (CSM):where465. Normalized Mutual Information (NMI):where N the number of patterns in the dataset,|Ai| the number of patterns in Ai.|Gi| the number of patterns in Gi Ni,j = |Gi  Aj| All the evaluation criteria have value ranging from 0 to 1, where 1 corresponds to the case when G and A are identical.The criterion value is bigger, the clustering quality is better.47ReferencesM. N. Murty and V. S. Devi, 2011, Pattern Recognition – An Algorithmic Approach, Springer.S. Marsland, 2009, Machine Learning - An Algorithm Perspective, Chapman & Hall/CRC.48

Các file đính kèm theo tài liệu này:

chapter_6_8106_1810882.ppt