Chapter 2: Data Mining

Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

154 trang | Chia sẻ: vutrong32 | Lượt xem: 1747 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Chapter 2: Data Mining, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Chapter 2Data MiningFaculty of Computer Science and EngineeringHCM City University of Technology October- 20101OutlineOverview of data miningAssociation rulesClassificationRegressionClusteringOther Data Mining problemsApplications of data mining2DATA MINING Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast amount of data. To be practically useful, data mining must be carried out efficiently on large files and databases.This chapter briefly reviews the state-of-the-art of this extensive field of data mining.Data mining uses techniques from such areas as machine learning, statistics, neural networks genetic algorithms. 3OVERVIEW OF DATA MINING Data Mining as a Part of the Knowledge Discovery Process.Knowledge Discovery in Databases, abbreviated as KDD, encompasses more than data mining. The knowledge discovery process comprises six phases: data selection, data cleansing, enrichment, data transformation or encoding, data mining and the reporting and displaying of the discovered information.4ExampleConsider a transaction database maintained by a specially consumer goods retails. Suppose the client data includes a customer name, zip code, phone number, date of purchase, item code, price, quantity, and total amount. A variety of new knowledge can be discovered by KDD processing on this client database. During data selection, data about specific items or categories of items, or from stores in a specific region or area of the country, may be selected. The data cleansing process then may correct invalid zip codes or eliminate records with incorrect phone prefixes. Enrichment enhances the data with additional sources of information. For example, given the client names and phone numbers, the store may purchases other data about age, income, and credit rating and append them to each record. Data transformation and encoding may be done to reduce the amount of data. 5Example (cont.)The result of mining may be to discover the following type of “new” information:Association rules – e.g., whenever a customer buys video equipment, he or she also buys another electronic gadget.Sequential patterns – e.g., suppose a customer buys a camera, and within three months he or she buys photographic supplies, then within six months he is likely to buy an accessory items. This defines a sequential pattern of transactions. A customer who buys more than twice in the regular periods may be likely buy at least once during the Christmas period.Classification trees – e.g., customers may be classified by frequency of visits, by types of financing used, by amount of purchase, or by affinity for types of items, and some revealing statistics may be generated for such classes.6We can see that many possibilities exist for discovering new knowledge about buying patterns, relating factors such as age, income group, place of residence, to what and how much the customers purchase. This information can then be utilized to plan additional store locations based on demographics, to run store promotions, to combine items in advertisements, or to plan seasonal marketing strategies. As this retail store example shows, data mining must be preceded by significant data preparation before it can yield useful information that can directly influence business decisions.The results of data mining may be reported in a variety of formats, such as listings, graphic outputs, summary tables, or visualization.7Goals of Data Mining and Knowledge DiscoveryData mining is carried out with some end goals. These goals fall into the following classes: Prediction – Data mining can show how certain attributes within the data will behave in the future.Identification – Data patterns can be used to identify the existence of an item, an event or an activity.Classification – Data mining can partition the data so that different classes or categories can be identified based on combinations of parameters.Optimization – One eventual goal of data mining may be to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales or profits under a given set of constraints.8Data Mining: On What Kind of Data?Relational databasesData warehousesTransactional databasesAdvanced DB and information repositoriesObject-oriented and object-relational databasesSpatial databasesTime-series data and temporal dataText databases and multimedia databasesHeterogeneous and legacy databasesWorld Wide Web9Types of Knowledge Discovered During Data Mining. Data mining addresses inductive knowledge, which discovers new rules and patterns from the supplied data. Knowledge can be represented in many forms: In an unstructured sense, it can be represented by rules. In a structured form, it may be represented in decision trees, semantic networks, or hierarchies of classes or frames.It is common to describe the knowledge discovered during data mining in five ways:Association rules – These rules correlate the presence of a set of items with another range of values for another set of variables.10Types of Knowledge Discovered (cont.)Classification hierarchies – The goal is to work from an existing set of events or transactions to create a hierarchy of classes. Patterns within time seriesSequential patterns: A sequence of actions or events is sought. Detection of sequential patterns is equivalent to detecting associations among events with certain temporal relationship.Clustering – A given population of events can be partitioned into sets of “similar” elements.11Main function phases of the KD processLearning the application domain:relevant prior knowledge and goals of applicationCreating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:Find useful features, dimensionality/variable reduction, invariant representation.Choosing functions of data mining summarization, classification, regression, association, clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentationvisualization, transformation, removing redundant patterns, etc.Use of discovered knowledge12Main phases of data miningData CleaningData IntegrationData SourcesData WarehouseKnowledgeTask-relevant DataSelection/TransformationData MiningPattern Evaluation/PresentationPatterns132. ASSOCIATION RULESWhat Is Association Rule Mining?Association rule mining is finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.Applications:Basket data analysis, cross-marketing, catalog design, clustering, classification, etc.Rule form: “Body  Head [support, confidence]”.14Association rule miningExamples. buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]major(x, “CS”)  takes(x, “DB”)  grade(x, “A”) [1%, 75%]Association Rule Mining Problem:Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)Find: all rules that correlate the presence of one set of items with that of another set of itemsE.g., 98% of people who purchase tires and auto accessories also get automotive services done.15Rule Measures: Support and ConfidenceLet J = {i1, i2,,im} be a set of items. Let D, the task-relevant data, be a set of database transactions where each transaction T is a set of items such that T  J. Each transaction T is said to contain A if and only if A  T.An association rule is an implication of the form A  B where A  J, B  J and A  B = . The rule A  B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A  B (i.e. both A and B). This is taken to be the probability P(A  B ). The rule A  B has the confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B. 16Support and confidenceThat is. support, s, probability that a transaction contains {A  B } s = P(A  B ) confidence, c, conditional probability that a transaction having A also contains B. c = P(A|B).Rules that satisfy both a minimum support threhold (min_sup) and a mimimum confidence threhold (min_conf) are called strong. 17Frequent item setA set of items is referred as an itemset. An itemset that contains k items is a k-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset. An itemset satisfies minimum support if the occurrence frequency of the itemset is greater than or equal to the product of min_suf and the total number of transactions in D. The number of transactions required for the itemset to satisfy minimum support is referred to as the minimum support count. If an itemset satisfies minimum support, then it is a frequent itemset. The set of frequent k-itemsets is commonly denoted by Lk.18Example 2.1Transaction-ID Items_bought-------------------------------------------2000 A, B, C1000 A, C4000 A, D5000 B, E, FLet minimum support 50%, and minimum confidence 50%, we have A  C (50%, 66.6%) C  A (50%, 100%)19Types of Association RulesBoolean vs. quantitative associations (Based on the types of values handled) buys(x, “SQLServer”)  buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%]age(x, “30..39”)  income(x, “42..48K”)  buys(x, “PC”) [1%, 75%]Single dimension vs. multiple dimensional associations The rule that references two or more dimensions, such as the dimensions buys, income and age is a multi-dimensional association rule.Single level vs. multiple-level analysisSome methods for association rule mining can find rules at different levels of abstractions. For example, suppose that a set of association rule mined includes the following rules: age(x, “30..39”)  buys(x, “laptop computer”) age(x, “30..39”)  buys(x, “ computer”)in which “computer” is a higher level abstraction of “laptop computer”.20How to mine association rules from large databases?Association rule mining is a two-step process: 1. Find all frequent itemsets (the sets of items that have minimum support) A subset of a frequent itemset must also be a frequent itemset. (Apriori principle) i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) 2.Generate strong association rules from the frequent itemsets.The overall performance of mining association rules is determined by the first step.21The Apriori AlgorithmApriori is an important algorithm for mining frequent itemsets for Boolean association rules. Apriori algorithm employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database.To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space.22Apriori propertyApriori property: All nonempty subsets of a frequent itemset must also be frequent. The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy the minimum support threhold, min_sup, then I is not frequent, that is, P(I) 40”:s13 = 3 s23 = 2 I(s13, s23) = -(3/5)log2(3/5) – (2/5)log2(2/5)= 0.971Using Equation (2), the expected information needed to classify a given sample if the samples are partitioned according to age isE(age) = (5/14)I(s11, s21) + (4/14) I(s12, s22) + (5/14)I(s13, s23) = (10/14)*0.971 = 0.694.45Hence, the gain in information from such a partitioning would be Gain(age) = I(s1, s2) – E(age) = 0.940 – 0.694 = 0.246Similarly, we can compute Gain(income) = 0.029, Gain(student) = 0.151, and Gain(credit_rating) = 0.048. Since age has the highest information gain among the attributes, it is selected as the test attribute. A node is created and labeled with age, and branches are grown for each of the attribute’s values. The samples are then partitioned accordingly, as shown in Figure 3.46age?403140income student credit_rating class high no fair no high no excellent no medium no fair no low yes fair yes medium yes excellent yes income student credit_rating class high no fair yes low yes excellent yes medium no excellent yes high yes fair yes income student credit_rating class medium no fair yes low yes fair yes low yes excellent no medium yes fair yes medium no excellent no 47Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand.ExampleIF age = “40” AND credit_rating = “excellent” THEN buys_computer = “no”IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “yes”48Neural Networks and Classification Neural network is a technique derived from AI that uses generalized approximation and provides an iterative method to carry it out. ANNs use the curve-fitting approach to infer a function from a set of samples. This technique provides a “learning approach”; it is driven by a test sample that is used for the initial inference and learning. With this kind of learning method, responses to new inputs may be able to be interpolated from the known samples. This interpolation depends on the model developed by the learning method.49ANN and classificationANNs can be classified into 2 categories: supervised and unsupervised networks. Adaptive methods that attempt to reduce the output error are supervised learning methods, whereas those that develop internal representations without sample outputs are called unsupervised learning methods.ANNs can learn from information on a specific problem. They perform well on classification tasks and are therefore useful in data mining. 50Information processing at a neuron in an ANN51A multilayer Feed-Forward Neural Network52Backpropagation algorithm5354Classification with ANNOutput nodesInput nodesHidden nodesOutput vectorInput vector: xiwij55Example:56Example:5758Other Classification Methods k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches59The k-Nearest Neighbor Algorithm All instances (samples) correspond to points in the n-dimensional space. The nearest neighbor are defined in terms of Euclidean distance. The Euclidean distance of two points, X = (x1, x2, ,xn) and Y = (y1, y2, ,yn) is n d(X,Y) =   (xi –yi)2 1When given an unknown sample, the k-Nearest Neighbor classifier search for the space for the k training samples that are closest to the unknown sample xq. The unknown sample is assigned the most common class among its k nearest neighbors. The algorithm has to vote to determine the most common class among the k nearest neighbor. When k = 1, the unknown sample is assigned the class of the training sample that is closest to it in the space.Once we have obtained xq’s k-nearest neighbors using the distance function, it is time for the neighbors to vote in order to determine xq’s class. 60Two approaches are common.Majority voting: In this approach, all votes are equal For each class, we count how many of the k neighbors have that class. We return the class with the most votes.Inverse distance-weighted voting: In this approach, closer neighbors get higher votes. While there are better-motivated methods, the simplest version is to take a neighbor’s vote to be the inverse of its distance to xq:Then we sum the votes and return the class with the highest vote.61Genetic AlgorithmsGA: based on an analogy to biological evolutionEach rule is represented by a string of bits.Example: The rule “IF A1 and Not A2 then C2“ can be encoded as the bit string “100”, where the two left bits represent attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly, the rule “IF NOT A1 AND NOT A2 THEN C1” can be encoded as “001”.An initial population is created consisting of randomly generated rulesBased on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examplesOffsprings are generated by crossover and mutation.624. RegressionPredictive data miningDescriptive data miningDefinition: (J. Han et al., 2001&2006) Regression is a method used to predict continuous values for given input.63RegressionRegression analysis can be used to model the relationship between one or more independent or predictor variables and one or more response or dependent variables.Categories Linear regression and nonlinear regressionUni-variate and multi-variate regressionparametric, nonparametric and semi-parametric64Regression functionRegression function: Y = f(X, β)X: predictor/independent variablesY: response/dependent variablesβ: regression coefficientsX: used to explain the changes of response variable Y.Y: used to describe the target phenomenon.The relationship between Y and X can be representeb y the functional dependence of Y to X.β describes the influence of X to Y.65 Regression with a single predictor variableGiven N observed objects, this linear model is described in the following form with εi representing the part of response Y that can not be explained from X: Line form:Parabola form:66Linear regression with single predictor variableEstimate the parameter set β ( ) in order to obtain linear regression model:residualSum of squared residuals minimizeEstimated value β67Linear multiple regressionThis linear regression model analyses the relationship between response/dependent variable and two or more independent variables.yi = b0 + b1xi1 + b2xi2 + + bkxik i = 1..n (n is the number of observed objects)k = the number of predictor variables (the number of attributes)Y = dependent variablesX = independent variablesb0 = Y’s value when X = 0b1..k = regression coefficients68Linear multiple regression Estimated value of YEstimated values of b69Linear multiple regressionExample: a sales manager of Tackey Toys, needs to predict sales of Tackey products in selected market area. He believes that advertising expenditures and the population in each market area can be used to predict sales. He gathered sample of toy sales, advertising expenditures and the population as below. Find the linear multiple regression equation which the best fit to the data. 70Linear multiple regressionMarket AreaAdvertising Expenditures (Thousands of Dollars) x1Population (Thousands) x2Toy sales (Thousands of Dollars) yA1.0200100B5.0700300C8.0800400D6.0400200E3.0100100F10.060040071Linear multiple regressionSoftwares: SPSS, SAS, R72Nonlinear regressionY = f(X, β)Y is a nonlinear function for the combining of the coefficients β.Examples: exponential function, logarithmic function, Gauss function, Determine the optimal β: optimization algorithmsLocal optimizationGlobal optimization for the sum of squared residuals73Applications of regressionData MiningPreprocessing stageData Mining stageDescriptive data miningPredictive data miningApplication areas: biology, agriculture, social issues, economy, business, 74Some problems with regressionSome assumptions going along with regression.Danger of extrapolation.Evaluation of regression models.Other advanced techniques for regression:Artificial Neural Network (ANN)Support Vector Machine (SVM)755. CLUSTERINGWhat is Cluster Analysis? Cluster: a collection of data objectsSimilar to one another within the same clusterDissimilar to the objects in other clusters.Cluster analysisGrouping a set of data objects into clusters.Clustering is unsupervised learning: no predefined classes, no class-labeled training samples.Typical applicationsAs a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms76General Applications of Clustering Pattern RecognitionSpatial Data Analysis create thematic maps in GIS by clustering feature spacesdetect spatial clusters and explain them in spatial data miningImage ProcessingEconomic Science (especially market research)World Wide WebDocument classificationCluster Weblog data to discover groups of similar access patterns77Examples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.Land use: Identification of areas of similar land use in an earth observation database.Insurance: Identifying groups of motor insurance policy holders with a high average claim cost.City-planning: Identifying groups of houses according to their house type, value, and geographical location.Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults.78Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objectsSome popular ones include: Minkowski distance:where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two p-dimensional data objects, and q is a positive integerIf q = 1, d is Manhattan distance79Euclid distanceIf q = 2, d is Euclidean distance:Propertiesd(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j)Also one can use weighted distance, or other disimilarity measures.80Type of data in cluster analysisInterval-scaled variables/attributesBinary variables/attributesCategorical variables/attributesOrdinal variables/attributesRatio-scaled variables/attributesVariables/attributes of mixed types81Type of dataInterval-scaled variables/attributesMean absolute deviationMeanZ-score measurement82Type of dataBinary variables/attributesObject iObject j (= a + b + c + d)Dissimilarity (if symmetric):Dissimilarity (if asymmetric):83Type of dataBinary variables/attributes Examplegender: symmetric Other binary attributes: asymmetricY, P  1, N  084Type of dataVariables/attributes of mixed typesGeneral case:if xif or xjf is missing) then f (variable/attribute): binary (categorical)dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwisef : interval-scaled dij(f) = |xif-xjf|/(maxhxhf-minhxhf)f : ordinalcompute ranks rif and zif becomes interval-scaled85Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clustersGiven a k, find a partition of k clusters that optimizes the chosen partitioning criterion. - Global optimal: exhaustively enumerate all partitions - Heuristic methods: k-means and k-medoids algorithmsk-means (MacQueen’67): Each cluster is represented by the center of the clusterk-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 86The K-Means Clustering MethodInput: a database D, of m records, r1, r2,,rm and a desired number of clusters k.Output: set of k clusters that minimizes the square error criterion.Given k, the k-means algorithm is implemented in 4 steps:Step 1: Randomly choose k records as the initial cluster centers.Step 2: Assign each records ri, to the cluster such that the distance between ri and the cluster centroid (mean) is the smallest among the k clusters. Step 3: recalculate the centroid (mean) of each cluster based on the records assigned to the cluster. Step 4: Go back to Step 2, stop when no more new assignment.87The algorithm begins by randomly choosing k records to represent the centroids (means), m1, m2,,mk of the clusters, C1, C2,,Ck. All the records are placed in a given cluster based on the distance between the record and the cluster mean. If the distance between mi and record rj is the smallest among all cluster means, then record is placed in cluster Ci. Once all records have been placed in a cluster, the mean for each cluster is recomputed. Then the process repeats, by examining each record again and placing it in the cluster whose mean is closest. Several iterations may be needed, but the algorithm will converge, although it may terminate at a local optimum. 88Square-error criterionThe terminating condition is usually the squared-error criterion.Typically, the square-error criterion is used, defined as: k E = i=1pCi |p – mi|2 where E is the sum of square-error for all objects in the database, p is the point in space representing a given object, and mi is the mean of cluster Ci. This criterion tries to make the resulting clusters as compact and as separate as possible.89Example 4.1: Consider the K-means clustering algorithm that works with the (2-dimensional) records in Table 2. Assume that the number of desired clusters k is 2.RID Age Years of Service--------------------------------------1 30 52 50 253 50 154 25 55 30 106 30 25Let the algorithm choose records with RID 3 for cluster C1 and RID 6 for cluster C2. as the initial cluster centroids. The first iteration: distance(r1, C1) =  (50-30)2+(15-5)2 = 22.4; distance(r1, C2) = 32.0, so r1 C1. distance(r2, C1) = 10.0 and distance(r2, C2) = 5.0 so r2  C2. distance(r4, C1) = 25.5 and distance(r4, C2) = 36.6 so r4  C1 distance(r5, C1) = 20.6 and distance(r5, C2) = 29.2 so r5  C1Now the new means (centroids) for the two clusters are computed. 90The means for a cluster Ci with n records of m dimensions is the vector: (1/n rj Ci rj1, 1/n rj Ci rjm)The new mean for C1 is (33.75, 8.75) and the new mean for C2 is (52.5, 25).The second iteration: r1, r4, r5  C1 and r2, r3, r6  C2.The mean for C1 and C2 are recomputed as (28.3, 6.7) and (51.7, 21.7). In the next iteration, all records stay in their previous clusters and the algorithm terminates.91Clustering of a set of objects based on the k-means method.92Comments on the K-Means MethodStrength Relatively efficient: O(tkn), where n is # objects, k is # of clusters, and t is # of iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithmsWeakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Quality of clustering is dependent on the choice of initial cluster centroids.93Partitioning around metroids (k-metroids)94K-metroidsCompute “total cost S of swapping Oj và Orandom” = ΣpCp/OiOrandom95K-metroidsCp/OiOrandom = d(p,Oi) – d(p,Oj)Cp/OiOrandom = d(p,Orandom) – d(p,Oj)Cp/OiOrandom = 0Cp/OiOrandom = d(p,Orandom) – d(p,Oi)Compute “total cost S of swapping Oj và Orandom” = ΣpCp/OiOrandom96Properties of k-metroids algorithmEach cluster has its representative objective, the medoid, or most centrally located objects, of its cluster.Reduce the influence of noise (outlier/irregularitites/extrema).The number of clusters k needs to be predetermined.Complexity for each iteration: O(k(n-k)2)Algorithm becomes very costly for large values of n and k.97Hierarchical ClusteringA hierarchical clustering method works by grouping data objects into a tree of clusters.In general, there are two types of hierarchical clustering methods:  Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until a certain termination conditions are satisfied. Most hierarchical clustering methods belong to this category. They differ only in their definition of intercluster similarity.  Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfied certain termination condition, such as a desired number clusters is obtained or the distance between two closest clusters is above a certain threshold distance.98Agglomerative algorithm Assume that we are given n data records r1, r2,,rm and a function D(Ci, Cj) for measuring the distance between two clusters Ci and Cj. Then an agglomerative algorithm for clustering can be as follows:for i = 1,,n do let Ci = {ri};while there is more than one cluster left dobegin Let Ci and Cj be the clusters which minimize the distance D(Ck, Ch) between any two clusters Ck, Ch; Ci = Ci  Cj; Remove cluster Cjend99ExampleExample 4.2: Figure 4 shows the application of AGNES (Agglomerative NESting), an agglomerative hierarchical clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method to a data set of five objects, {a, b, c, d, e}. Initially, AGNES places each object into a cluster of its own. The clusters are then merged step-by-step according to some criterion. For example, clusters C1 and C2 may be merged if an object in C1 and an object in C2 form the minimum Euclidean distance between any two objects from different clusters. This is a single-link approach in that each cluster is represented by all of the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. The cluster merging process repeats until all of the objects are eventually merged to form one cluster.100Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}101Hierarchical ClusteringIn DIANA, all of the objects are used to form one initial cluster. The cluster is split according to some principle, such as the maximum Euclidean distance between the closest neighboring objects in the cluster. The cluster splitting process repeats until, eventually, each new cluster contains only a single objects.In general, divisive methods are more computationally expensive and tend to be less widely used than agglomerative methods.There are a variety of methods for defining the intercluster distance D(Ck, Ch). However, local pairwise distance measures (i.e., between pairs of clusters) are especially suited to hierarchical methods. 102Hierarchical ClusteringOne of the most important of intercluster distances is the nearest neighbor or single link method. This defines the distance between two clusters as the distance between the two closest points, one from each cluster; Dsl(Ci, Cj) = min{d(x,y)| x Ci, y Cj} where d(x, y) is the distance between objects x and y.If the distance between two clusters is defined as the distance between the two farthest points, one from each cluster: Dcl(Ci, Cj) = max{d(x,y)| x Ci, y Cj} where d(x, y) is the distance between objects x and y The method is called a complete-linkage algorithm.1031234123412341234Single-linkageComplete-linkageCriteria to merge two clusters: single-linkage and complete-linkage104Major weakness of hierarchical clustering methods:1. They do not scale well: time complexity of at least O(n2), where n is the number of total objects. The agglomerative algorithm requires in the first iteration that we locate the closest pair of objects. This takes O(n2) time and so, in most cases, the algorithm require O(n2) time, and frequently much more.2. We can never undo what was done previously.105Some advanced hierarchical clustering algorithmsBIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): partitions the objects using two concepts: cluster-feature and cluster-feature tree.ROCK (Robust Clustering using linKs): clustering algorithm for categorical/discrete attributes., Chameleon: A hierarchical clustering algorithm using dynamic modeling.106Introduction to BIRCH (1996) Hierarchial representation of a clustering The data set needs to be scanned only once Clustering decisions on-the-fly when new data points are inserted. Outlier removal Good for very large datasetLimited computational resources 107Introduction to BIRCHFeatures of a cluster:Centroid: Euclidean meanRadius: average distance from any member point to centroid.Diameter: Average pairwise distance between two data points108Alternative of measures for closeness of two clustersAny of the following can be used as distancemetric to compare a new data point toexisting clusters: in BIRCH algorithm: D0: Euclidean distance between centroidsD1: Manhattan distance between centroids109Alternative of measures forcloseness of two clustersANd for deciding whether to merge clusters: D2=Average Inter-cluster distance between 2 clusters D3=Average intra-cluster distance inside a cluster (geometrically, the diameter of new cluster if 2 clusters are merged) D4=Variance increase distance: amount by which the intracluster distance variance changes if 2 clusters are merged 110Clustering featureMaintained for each subcluster Enough information to calculate intra-cluster distances111Additivity theoremClustering features are additive Allows us to merge two subclusters112CF treeHierarchial representation of a clusteringUpdated dynamically when new data points are insertedEach entry in a leaf node is not a data point, but a subcluster113CF tree parameters The diameter of a leaf node has to be less than T A nonleaf node contains at most B entries A leaf node contains at most L entries The tree size is a function of T B and L are determined by the requirement that each node fits into a memory page of given size114Example CF tree115CF-Tree116The BIRCH algorithm1. Build an in-memory CF tree2. Optional: condense into a smaller CF tree3. Global clustering4. Optional: cluster refining117Phase 1Insert data points one at a time, building a CF-tree dynamicallyIf the tree grows too large, increase the threshold T and rebuild from the current treeOptionally remove outliers when rebuilding the tree118Insertion algorithmStart from the root node and find the closest leaf node and leaf entry If the distance to the centroid is less than the threshold T, the entry will absorb the new data point Otherwise create a new leaf entry If there is no space on the leaf for new entry, split the leaf node119CF-Tree InsertionChoose the farthest pair of entries, and redistribute the remaining entriesInsert a new nonleaf entry into the parent nodeWe may have to split the parent node as wellIf the root is split, the tree height increases by one120CF-Tree Insertion121CF-Tree Insertion122CF-Tree Insertion123CF-Tree Insertion124CF-Tree Insertion125CF-Tree Insertion126CF-Tree Insertion127Phase 2Scan the leaf entries to rebuild a smaller CF-tree Remove outliers Group more crowded subclusters into larger ones The idea is to make next phase more efficient128Phase 3Leaf nodes do not necessarily represent clusters A regular clustering algorithm is used to cluster the leaf entries129Phase 4Redistribute data points to their closest centroid1306. OTHER DATA MINING PROBLEMS Discovering of Sequential Patterns The discovery of sequential patterns is based on the concept of a sequence of itemsets. We assume that transactions are ordered by time of purchase. That ordering yields a sequence of itemsets.For example, {milk, bread, juice}, {bread, eggs}, {cookies, milk, coffee} may be such a sequence of itemsets based on three visits of the same customer to the store. The support for a sequence S of itemsets is the percentage of the given set U of sequences of which S is a subsequence.In this example, {milk, bread, juice}, {bread, eggs} and {bread, eggs}, {cookies, milk, coffee} are considered subsequences. 131The problem of identifying sequential patterns, then, is to find all subsequences from the given sets of sequences that have a user-defined minimum support. The sequence S1, S2, S3, is a predictor of the fact that a customer who buys itemset S1 is likely to buy itemset S2 and then S3, and so on. This prediction is based on the frequency (support) of this sequence in the past. Various algorithms have been investigated for sequence detection.132Mining Time seriesDiscovering of Patterns in Time SeriesA time series database consists of sequences of values or events changing with time. The values are typically measured at equal time intervals. Time series databases are popular in many applications, such as studying daily fluctuations of a stock market, traces of scientific experiments, medical treatments, and so on.A time series can be illustrated as a time-series graph which describes a point moving with the passage of time.133Time series data: Stock price of IBM over time134Categories of Time-Series Movements: Long-term or trend movements Cyclic movements or cycle variations, e.g., business cycles Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. Irregular or random movements135Similarity Search in Time-Series AnalysisNormal database query finds exact match. Similarity search finds data sequences that differ only slightly from the given query sequence.Two categories of similarity queries Whole matching: find a sequence that is similar to the query sequence Subsequence matching: find all pairs of similar sequencesTypical Applications Financial market Transaction data analysis Scientific databases (e.g. power consumption analysis) Medical diagnosis (e.g. cardiogram analysis)136Data transformation For similarity analysis of time series data, Euclidean distance is typically used as a similarity measure.Many techniques for signal analysis require the data to be in the frequency domain. Therefore, distance-preserving transformations are often used to transform the data from time domain to frequency domain.Usually data-independent transformations are used where the transformation matrix is determined a priori.E.g., discrete Fourier transform (DFT), discrete wavelet transform (DWT)The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domainDFT does a good job of concentrating energy in the first few coefficients. If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance.137Multidimensional IndexingMultidimensional indexConstructed for efficient accessing using the first few Fourier coefficientsUse the index to retrieve the sequences that are at most a certain small distance away from the query sequence.Perform post-processing by computing the actual distance between sequences in the time domain and discard any false matches.138Subsequence Matching Break each sequence into a set of pieces of window with length w. Extract the features of the subsequence inside the windowMap each sequence to a “trail” in the feature spaceDivide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangle. (R-trees and R*-trees have been used to store minimal bounding rectangles so as to speed up the similarity search.)Use a multipiece assembly algorithm to search for longer sequence matches.139R1R2R5R3R7R9R8R6R4We can group clusters of datapoints with “boxes”, called Minimum Bounding Rectangles (MBR).We can further recursively group MBRs into larger MBRs. 140R10 R11 R12R1 R2 R3R4 R5 R6R7 R8 R9Data nodes containing pointsR10R11R12these nested MBRs are organized as a tree (called a spatial access tree or a multidimensional tree). Examples include R-tree, Hybrid-Tree etc.141DiscretizationDiscretization of a time series is tranforming it into a symbolic string. The main benefit of this discretization is that there is an enormous wealth of existing algorithms and data structures that allow the efficient manipulations of symbolic representations. Lin and Keogh et al. (2003) proposed a method called Symbolic Aggregate Approximation (SAX), which allows the descretization of original time series into symbolic strings. 142Symbolic Aggregate Approximation (SAX) [Lin et al. 2003]baabccbcThe first symbolic representation of time series, that allow Lower bounding of Euclidean distance Dimensionality Reduction Numerosity Reduction 143How do we obtain SAX0 20 40 60 80 100 120 C C 0 - - 020406080100120b b b ac cc abaabccbcFirst convert the time series to PAA representation, then convert the PAA to symbolsIt take linear time144Two parameter choices 0 - - 020406080100120b b b ac cc a0 20 40 60 80 100 120 C C 123456718The word size, in this case 8The alphabet size (cardinality), in this case 3321145Time Series Data Mining tasksSimilarity SearchClassificationClusteringMotif DiscoveryNovelty/Anomaly DetectionTime series visualizationTime series prediction146Some representative data mining toolsAcknosoft (Kate) Decision trees, case-based reasoningDBMiner Technology (DBMiner) OLAP analysis, Associations, classification, clustering.IBM (Intelligent Miner) Classification, Association rules, predictive models.NCR (Management Discovery Tool) Association rulesSAS (Enterprise Miner) Decision trees, Association rules, neural networks, Regression, clusteringSilicon Graphics (MineSet) Decision trees, Association rulesOracle (Oracle Data Mining) classification, prediction, regression, clustering, association, feature selection, feature extraction, anomaly selection. Weka system ( University of Waikato, Newzealand. The system is written in Java. The platforms: Linux, Windows, Macintosh.1477. POTENTIAL APPLICATIONS OF DM Database analysis and decision supportMarket analysis and managementtarget marketing, customer relation management, market basket analysis, cross selling, market segmentationRisk analysis and managementForecasting, customer retention, improved underwriting, quality control, competitive analysisFraud detection and managementOther ApplicationsText mining (news group, email, documents) and Web analysis.Intelligent query answering148Market Analysis and Management Where are the data sources for analysis?Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studiesTarget marketingFind clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.Determine customer purchasing patterns over timeConversion of single to a joint bank account: marriage, etc.149Cross-market analysisAssociations/co-relations between product salesPrediction based on the association informationCustomer profilingdata mining can tell you what types of customers buy what products (clustering or classification)Identifying customer requirementsidentifying the best products for different customersuse prediction to find what factors will attract new customersProvides summary informationvarious multidimensional summary reportsstatistical summary information (data central tendency and variation)150Corporate Analysis and Risk Management Finance planning and asset evaluationcash flow analysis and predictioncontingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)Resource planning:summarize and compare the resources and spendingCompetition:monitor competitors and market directions group customers into classes and a class-based pricing procedureset pricing strategy in a highly competitive market151Fraud Detection and Management Applicationswidely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.Approachuse historical data to build models of fraudulent behavior and use data mining to help identify similar instancesExamplesauto insurance: detect a group of people who stage accidents to collect on insurancemoney laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references152Fraud Detection and Management (cont.)Detecting inappropriate medical treatmentAustralian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).Detecting telephone fraudTelephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. RetailAnalysts estimate that 38% of retail shrink is due to dishonest employees. 153Other Applications SportsIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami HeatAstronomyJPL and the Palomar Observatory discovered 22 quasars with the help of data miningInternet Web Surf-AidIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.154

Các file đính kèm theo tài liệu này:

data_mining_7853.ppt