Quản trị mạng - Chapter 7: Neural networks

In RBF network, responses of neurons are “locally tuned” or “selective” for some part of the input space. The parameters of MLP networks are nonlinear while those of RBF networks are linear. RBF networks guarantees convergence to globally optimum parameters (without local minima). RBF networks learn faster than MLP networks.

75 trang | Chia sẻ: nguyenlam99 | Lượt xem: 757 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Quản trị mạng - Chapter 7: Neural networks, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Chapter 7: Neural NetworksAssoc. Prof. Dr. Duong Tuan AnhHCMC University of TechnologyJuly 20151Outline1. Neural Networks Representation2. Appropriate problems for Neural Network Learning3. Perceptrons4. Multilayer Networks and the Backpropagation algorithm5. Remarks on the Backpropagation algorithm6. Neural network application development7. Benefits and Limitations of Neural networks8. Neural network applications9. RBF neural network21. NEURAL NETWORK REPRESENTATIONAn ANN is composed of processing elements called or perceptrons, organized in different ways to form the network’s structure.Processing ElementsAn ANN consists of perceptrons. Each of the perceptrons receives inputs, processes inputs and delivers a single output.The input can be raw input data or the output of other perceptrons. The output can be the final result (e.g. 1 means yes, 0 means no) or it can be inputs to other perceptrons.3The networkEach ANN is composed of a collection of perceptrons grouped in layers. A typical structure is shown in Fig.2. Note the three layers: input, intermediate (called the hidden layer) and output. Several hidden layers can be placed between the input and output layers.Figure 242. Appropriate Problems for Neural Network ANN learning is well-suited to problems in which the training data corresponds to noisy, complex sensor data. It is also applicable to problems for which more symbolic representations are used. The backpropagation (BP) algorithm is the most commonly used ANN learning technique. It is appropriate for problems with the characteristics: Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Long training times accepted Fast evaluation of the learned function required. Not important for humans to understand the weightsExamples:Speech phoneme recognition Image classification Financial prediction53. PERCEPTRONSA perceptron takes a vector of real-valued inputs, calculates a linear combination of these inputs, then outputs a 1 if the result is greater than some threshold –1 otherwise. Given real-valued inputs x1 through xn, the output o(x1, , xn) computed by the perceptron is o(x1, , xn) = 1 if w0 + w1x1 + + wnxn > 0 -1 otherwise where wi is a real-valued constant, or weight.Notice the quantify (-w0) is a threshold that the weighted combination of inputs w1x1 + + wnxn must surpass in order for perceptron to output a 1.6Figure 3. A perceptron7To simplify notation, we imagine an additional constant input x0 = 1, allowing us to write the above inequality as Learning a perceptron involves choosing values for the weights w0, w1,, wn. or in vector form as For brevity, we will sometimes write the perceptron function as8Representation Power of PerceptronsWe can view the perceptron as representing a hyperplane decision surface in the n-dimensional space of instances (i.e. points). The perceptron outputs a 1 for instances lying on one side of the hyperplane and outputs a –1 for instances lying on the other side, as in Figure 4. The equation for this decision hyperplane is Some sets of positive and negative examples cannot be separated by any hyperplane. Those that can be separated are called linearly separated set of examples. Figure 4. Decision surface9Perceptron training ruleAlthough we are interested in learning networks of many interconnected units, let us begin by understanding how to learn the weights for a single perceptron. Here learning is to determine a weight vector that causes the perceptron to produce the correct +1 or –1 for each of the given training examples.Several algorithms are known to solve this learning problem. Here we consider two: the perceptron training rule and the delta rule. 10One way to learn an acceptable weight vector is to begin with random weights, then iteratively apply the perceptron to each training example, modifying the perceptron weights whenever it misclassifies an example. This process is repeated, iterating through the training examples as many as times needed until the perceptron classifies all training examples correctly. Weights are modified at each step according to the perceptron training rule, which revises the weight wi associated with input xi according to the rule. wi  wi + wi where wi = (t – o) xiHere: t is target output value for the current training example o is perceptron output  is small constant (e.g., 0.1) called learning rate11Perceptron training rule (cont.)The role of the learning rate is to moderate the degree to which weights are changed at each step. It is usually set to some small value (e.g. 0.1) and is sometimes made to decrease as the number of weight-tuning iterations increases.We can prove that the algorithm will convergeIf training data is linearly separableand  sufficiently small.If the data is not linearly separable, convergence is not assured.12Gradient Descent and the Delta RuleAlthough the perceptron training rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separatable. A second training rule, called the delta rule, is designed to overcome this difficulty.The key idea of delta rule: to use gradient descent to search the space of possible weight vectors to find the weights that best fit the training examples. This rule is important because it provides the basis for the backpropagration algorithm, which can learn networks with many interconnected units.The delta training rule: considering the task of training an un-thresholded perceptron, that is a linear unit, for which the output o is given by: o = w0 + w1x1 + ··· + wnxn (1)Thus, a linear unit corresponds to the first stage of a perceptron, without the threhold.13In order to derive a weight learning rule for linear units, let specify a measure for the training error of a weight vector, relative to the training examples. The Training Error can be computed as the following squared errorwhere D is set of training examples, td is the target output for the training example d and od is the output of the linear unit for the training example d.Here we characterize E as a function of weight vector because the linear unit output O depends on this weight vector.(2)14Hypothesis SpaceTo understand the gradient descent algorithm, it is helpful to visualize the entire space of possible weight vectors and their associated E values, as illustrated in Figure 5. Here the axes wo,w1 represents possible values for the two weights of a simple linear unit. The wo,w1 plane represents the entire hypothesis space. The vertical axis indicates the error E relative to some fixed set of training examples. The error surface shown in the figure summarizes the desirability of every weight vector in the hypothesis space.For linear units, this error surface must be parabolic with a single global minimum. And we desire a weight vector with this minimum.15Figure 5. The error surfaceHow can we calculate the direction of steepest descent along the error surface? This direction can be found by computing the derivative of E w.r.t. each component of the vector w.16Derivation of the Gradient Descent RuleThis vector derivative is called the gradient of E with respect to the vector , written E .Notice E is itself a vector, whose components are the partial derivatives of E with respect to each of the wi. When interpreted as a vector in weight space, the gradient specifies the direction that produces the steepest increase in E. The negative of this vector therefore gives the direction of steepest decrease. Since the gradient specifies the direction of steepest increase of E, the training rule for gradient descent is(3)(4)17Here  is a positive constant called the learning rate, which determines the step size in the gradient descent search. The negative sign is present because we want to move the weight vector in the direction that decreases E. This training rule can also be written in its component form wi wi + wi wherewhich makes it clear that steepest descent is achieved by altering each component wi of weight vector in proportion to E/wi. The vector of E/wi derivatives that form the gradient can be obtained by differentiating E from Equation (2), as(5)18where xid denotes the single input component xi for the training example d. We now have an equation that gives E/wi in terms of the linear unit inputs xid, output od and the target value td associated with the training example. Substituting Equation (6) into Equation (5) yields the weight update rule for gradient descent.(6)19The gradient descent algorithm for training linear units is as follows: Pick an initial random weight vector. Apply the linear unit to all training examples, then compute wi for each weight according to Equation (7). Update each weight wi by adding wi , them repeat the process. The algorithm is given in Figure 6.Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, regardless of whether the training examples are linearly separable, given a sufficiently small  is used. If  is too large, the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it. For this reason, one common modification to the algorithm is to gradually reduce the value of  as the number of gradient descent steps grows.(7)20Figure 6. Gradient Descent algorithm for training a linear unit. (8)(9)21Stochastic Approximation to Gradient DescentThe key difficulties in applying gradient descent are:Converging to a local minimum can sometimes be quite slow (i.e., it can require many thousands of steps).If there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global minimum.One common variation on gradient descent intended to alleviate these difficulties is called incremental gradient descent (or stochastic gradient descent). The key differences between standard gradient descent and stochastic gradient descent are:In standard gradient descent, the error is summed over all examples before upgrading weights, whereas in stochastic gradient descent weights are updated upon examining each training example. The modified training rule is like the training rule given by Equation (7) except that as we iterate through each example we update the weight according to wi = (t – o) xi (10) where t, o and xi are the target value, unit output, and the ith input.22To modify the gradient descent algorithm in Figure 6 to implement this stochastic approximation, Equation wi wi + wi is simply deleted and Equation wi  wi + (t - o)xi is replaced by wi wi + (t - o)xi.One way to view this stochastic gradient descent is to consider a distinct error function defined for each individual training example d as followswhere td and od are the target value and the unit output value for training example d.We come to the stochastic gradient descent algorithm (Figure. 7)23Summing over multiple examples in standard gradient descent requires more computation per weight update step. On the other hand, because it uses the true gradient, standard gradient descent is often used with a larger step size per weight update than stochastic gradient descent.(11)Figure 7. Stochastic gradient descent algorithm24Stochastic gradient descent (i.e. incremental mode) can sometimes avoid falling into local minima because it uses the various gradient of E rather than overall gradient of E to guide its search.Both stochastic and standard gradient descent methods are commonly used in practice.SummaryPerceptron training rulePerfectly classifies training dataConverge, provided the training examples are linearly separable Delta Rule using gradient descentConverge asymptotically to minimum error hypothesisConverge regardless of whether training data are linearly separable254. MULTILAYER NETWORKS AND THE BACKPROPOGATION ALGORITHMSingle perceptrons can only express linear decision surfaces. In contrast, the kind of multilayer networks learned by the backpropagation algorithm are capaple of expressing a rich variety of nonlinear decision surfaces.This section discusses how to learn such multilayer networks using a gradient descent algorithm similar to that discussed in the previous section.A Differentiable Threshold UnitWhat type of unit as the basis for multilayer networks ? Perceptron : not differentiable -> can’t use gradient descent Linear Unit : multi-layers of linear units -> still produce only linear function Sigmoid Unit : smoothed, differentiable threshold function26Figure 7. The sigmoid threshold unit.(12)27Like the perceptron, the sigmoid unit first computes a linear combination of its inputs, then applies a threshold to the result. In the case of sigmoid unit, however, the threshold output is a continuous function of its input.The sigmoid function (x) is also called the logistic function.Interesting property: Output ranges between 0 and 1, increasing monotonically with its input.We can derive gradient decent rules to train One sigmoid unit Multilayer networks of sigmoid units  Backpropagation28The Backpropagation (BP)Algorithm The BP algorithm learns the weights for a multilayer network, given a network with a fixed set of units and interconnections. It employs a gradient descent to attempt to minimize the squared error between the network output values and the target values for these outputs.Because we are considering networks with multiple output units rather than single units as before, we begin by redefining E to sum the errors over all of the network output unitswhere outputs is the set of output units in the network, and tkd and okd are the target and output values associated with the k-th output unit and training example d.(13)29The Backpropagation Algorithm (cont.)The BP algorithm is presented in Figure 8. The algorithm applies to layered feed-forward networks containing 2 layers of sigmoid units, with units at each layer connected to all units from the preceding layer. This is an incremental gradient descent version of Backpropagation. The notation is as follows: xji denotes the input from node i to unit j, and wji denotes the corresponding weight. n denotes the error term associated with unit n. It plays a role analogous to the quantity (t – o) in our earlier discussion of the delta training rule.30Initialize all weights to small random numbersUntil satisfied, do for each training example, do 1. Input the training example to the network and compute the network outputs 2. For each output k k  ok(1 - ok)(tk – ok) (14) 3. For each hidden unit h 4. Update each network weight wji wji  wji + wji where wji =  jxji (16) (15)The Backpropagation algorithm (Fig. 8)31In the BP algorithm, step1 propagates the input forward through the network. And the steps 2, 3 and 4 propagates the errors backward through the network.The main loop of BP repeatedly iterates over the training examples. For each training example, it applies the ANN to the example, calculates the error of the network output for this example, computes the gradient w. r. t. the error on the example, then updates all weights in the network. This gradient descent step is iterated until ANN performs acceptably well.A variety of termination conditions can be used to halt the procedure. One may choose to halt after a fixed number of iterations through the loop, or once the error on the training examples falls below some threshold, or once the error on a separate validation set of examples meets some criteria.32Figure 8: An example of a multilayer feed-forward neural network. Assume that the learning rate  is 0.9 and the first training example, X = (1,0,1) whose class label is 1.Note: The sigmoid function is applied to hidden layer and output layer.An Example of Back-propagation algorithm33Table 1: Initial input and weight valuesx1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 w04 w05 w06 -----------------------------------------------------------------------------------1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1Table 2: The net input and output calculationUnit j Net input Ij Output Oj-----------------------------------------------------------------------------------4 0.2 + 0 -0.5 -0.4 = -0.7 1/(1+e0.7)=0.3325 -0.3 +0+0.2 +0.2 =0.1 1/(1+e0.1)=0.5256 (-0.3)(0.332)-(0.2)(0.525)+0.1 = -0.105 1/(1+e0.105)=0.474Table 3: Calculation of the error at each nodeUnit j j-----------------------------------------------------------------------------6 (0.474)(1-0.474)(1-0.474)=0.13115 (0.525)(1-0.525)(0.1311)(-0.2)=-0.00654 (0.332)(1-0.332)(0.1311)(-0.3)=-0.008734Table 4: Calculation for weight updatingWeight New value------------------------------------------------------------------------------w46 -03+(0.9)(0.1311)(0.332)= -0.261w56 -0.2+(0.9)(0.1311)(0.525)= -0.138w14 0.2 +(0.9)(-0.0087)(1) = 0.192w15 -0.3 +(0.9)(-0.0065)(1) = -0.306w24 0.4+ (0.9)(-0.0087)(0) = 0.4w25 0.1+ (0.9)(-0.0065)(0) = 0.1w34 -0.5+ (0.9)(-0.0087)(1) = -0.508w35 0.2 + (0.9)(-0.0065)(1) = 0.194w06 0.1 + (0.9)(0.1311) = 0.218w05 0.2 + (0.9)(-0.0065)=0.194w04 -0.4 +(0.9)(-0.0087) = -0.408 35Adding MomentumBecause BP is a widely used algorithm, many variations have been developed. The most common is to alter the weight-update rule in Step 4 in the algorithm by making the weight update on the nth iteration depend partially on the update that occurred during the (n -1)-th iteration, as follows:Here wi,j(n) is the weight update performed during the n-th iteration through the main loop of the algorithm. - n-th iteration update depend on (n-1)th iteration- : constant between 0 and 1 is called the momentum.Role of momentum term: - keep the ball rolling through small local minima in the error surface. - Gradually increase the step size of the search in regions where the gradient is unchanging, thereby speeding convergence.(18)36Derivation of the Backpropagation Rule Recall from the equation: Stochastic gradient descent involves iterating through the training examples one at a time.In other words, for each training example d, every wji is updated by adding to it ji:where Ed is the error on training example d, summed over all ouput units.(21)37Notationxji = the ith input to unit jwji = the weight associated with the ith input to unit jnetj = i wjixji (the weighted sum of input for unit j)oj = the output computed by unit jtj = the target output for unit j = the sigmod functionoutputs = the set of units in the final layer of the networkDownstream(j) = the set of units whose immediate inputs include the output of unit j.Now we derive an expression for Ed/ wji in order to implement the stochastic gradient descent rule in Equation (21).38To begin, notice that weight wji can influence the rest of the network through netj. So, we can use the chain rule to write:Now our remaining task is to derive a convenient expression for Ed/ netj.We consider two cases: (1) the case where unit j is an output unit and (2) the case where j is an internal unit.(22)39Case 1: Training rule for output unit weights.Just as wji can influence the rest of the network only through netj, netj can influence the network only through oj. So, we can use the chain rule again to write:To begin, consider the first term in Equation (23)The derivatives in the right hand side will be zero for all output units k except when k = j.(23)40We have:Next consider the second term in Equation (23). Since oj = (netj), the derivative oj/ netj is just the derivative of the sigmod function, which we have already noted is equal to (netj)(1- (netj)). Therefore,(24)(25)41Substituting expressions (24) and (25) into (23), we obtain:And combining this with Equation (21) and (22), we have the stochastic gradient descent rule for output unitsNote this training rule is exactly the weight update rule, implemented by Equation (14) and (15) in the Backpropagation algorithm. Furthermore, we can see that k in Equation(14) is equal to the quantity -  Ed/  netk.(26)(27)42Case 2: Training rule for Hidden Unit WeightsIn the case where j is an hidden unit in the network, the derivation of the training rule for wji must take into account the indirect ways in which wji can influence the network outputs and hence Ed.For this reason, we will find it useful to refer to the set of all units immediately downstream of unit j in the network. We denote this set of units by Downstream(j).Notice that netj can influence the network outputs (and therefore Ed) only through the units in Downstream(j). Therefore, we can write43Rearranging terms and using j to denote - Ed/ netj, we haveandwji =  j xji445. REMARKS ON THE BACKPROPAGATION ALGORITHM Convergence and Local MinimaGradient descent to some local minimum Perhaps not global minimum...Heuristics to alleviate the problem of local minima Add momentum Use stochastic gradient descent rather than true gradient descent. Train multiple nets with different initial weights using the same data.456. NEURAL NETWORK APPLICATION DEVELOPMENT The development process for an ANN application has eight steps. Step 1: (Data collection) The data to be used for the training and testing of ANN are collected. Important considerations are that the particular problem is amenable to ANN solution and that adequate data exist and can be obtained.Step 2: (Training and testing data separation) Trainning data must be identified, and a plan must be made for testing the performance of ANN. The available data are divided into training and testing data sets. For a moderately sized data set, 80% of the data are randomly selected for training, 10% for testing, and 10% secondary testing.Step 3: (Network architecture) A network architecture and a learning method are selected. Important considerations are the exact number of nodes and the number of layers. 46Step 4: (Parameter tuning and weight initialization) There are parameters for tuning ANN to the desired learning performance level. Part of this step is initialization of the network weights and parameters, followed by modification of the parameters as training performance feedback is received. Often, the initial values are important in determining the effectiveness and length of training. Step 5: (Data transformation) Transforms the application data into the type and format required by the ANN. Step 6: (Training) Training is conducted iteratively by presenting input and known output data to the ANN. The ANN computes the outputs and adjusts the weights until the computed outputs are within an acceptable tolerance of the known outputs for the input cases.47Step 7: (Testing) Once the training has been completed, it is necessary to test the network. The testing examines the performance of ANN using the derived weights by measuring the ability of the network to classify the testing data correctly. Black-box testing (comparing test results to historical results) is the primary approach for verifying that inputs produce the appropriate outputs.Step 8: (Implementation) Now a stable set of weights are obtained. Now ANN can reproduce the desired output given inputs like those in the training set. The ANN is ready to use as a stand-alone system or as part of another software system where new input data will be presented to it and its output will be a recommended decision.487. BENEFITS AND LIMITATIONS OF NEURAL NETWORKS 6.1 Benefits of ANNsUsefulness for pattern recognition, classification, generalization, abstraction and interpretation of imcomplete and noisy inputs. (e.g. handwriting recognition, image recognition, voice and speech recognition, weather forecasing).Providing some human characteristics to problem solving that are difficult to simulate using the logical, analytical techniques of expert systems and standard software technologies. (e.g. financial applications).Ability to solve new kinds of problems. ANNs are particularly effective at solving problems whose solutions are difficult to define. This opened up a new range of decision support applications formerly either difficult or impossible to computerize.49Robustness. ANNs tend to be more robust than their conventional counterparts. They have the ability to cope with imcomplete or fuzzy data. ANNs can be very tolerant of faults if properly implemented. Fast processing speed. Because they consist of a large number of massively interconnected processing units, all operating in parallel on the same problem, ANNs can potentially operate at considerable speed (when implemented on parallel processors). Flexibility and ease of maintenaince. ANNs are very flexible in adapting their behavior to new and changing environments. They are also easier to maintain, with some having the ability to learn from experience to improve their own performance. 6.2 Limitations of ANNs ANNs do not produce an explicit model even though new cases can be fed into it and new results obtained. ANNs lack explanation capabilities. Justifications for results is difficults to obtain because the connection weights usually do not have obvious interpretations.50Network parametersThe following parameters of the ANN are chosen for a closer inspection: The number of input units: The number of input units determines the number of periods the ANN “looks into the past” when predicting the future. The number of input units is equivalent to the size of the input window.The number of hidden units: Whereas it has been shown that one hidden layer is sufficient to approximate continuous function, the number of hidden units necessary is “not known in general”. Some examples of ANN architectures that have been used for time series prediction can be 8-8-1, 6-6-1, and 5-5-1.51The learning rate:  (0<  < 1) is a scaling factor that tells the learning algorithm how strong the weights of the connections should be adjusted for a given error. A higher  can be used to speed up the learning process, but if  is too high, the algorithm will skip the optimum weights. (The learning rate  is constant across presentations).The momentum parameter  (0 <  < 1) is another number that affects the gradient descent of the weights: to prevent each connection from following every little change in the solution space immediately, the momentum term is added that keeps the direction of the previous step thus avoiding the descent into local minima. (The momentum term is constant across presentations).528. SOME ANN APPLICATIONSANN application areas:Face detectionFace recognitionObject recognitionHandwritten character/digit recognitionSpeech recognitionImage retrieval Tax form processing to identify tax fraud Enhancing auditing by finding irregularites Bankruptcy prediction53ANN applications (cont.)Customer credit scoring Loan approvals Credit card approval and fraud detection Financial prediction Energy forecasting Computer access security (intrusion detection and classification of attacks) Fraud detection in mobile telecommunication networks54ANN software toolsIn WEKA, the data mining software, there is ANN tool for classification.Spice-Neuro softwareNeural Network Toolbox – MATLABNeural Network in R 55RBF Neural NetworksIntroductionRadial Basis Function NetworksTraining of a RBF networksApplicationsConclusion56IntroductionAmong ANN classes, multi-layer-perceptrons (MLP) and radial basis function networks (RBF) are popular.RBF differs from MLP in training procedure.MLP are trained by supervised techniques: the set of weights are computed by solving an optimization problem.The training of RBF networks can be split into an unsupervised part and a supervised part.Unsupervised part is straightforward and relatively fast.Supervised part (solving a linear problem) is also fast.57Radial Basis Function NetworksThe common configuration of an RBF network consists of 3 layers: the input layer, the hidden layer and the output layer.The input layer is directly connected with the hidden layer (without weights).Only the connections between the hidden layer and the output layer are weighted.The activation functions used in hidden layer are locally restricted functions (not sigmoid function).58Radial Basis Function NetworksFigure 9. Configuration of an RBF network59Radial basis functionsConsider an unknown approximation function: f(x): d  . (x is a d-dimensional vector)In RBF network, f(x) is approximated by a set of d-dimensional radial activation functions.These radial basis functions are centered on well-positioned data points (centroids).Suppose we want to approximate function f(x) with a set of M radial basis functions j(x), centered on the centroid cj and defined as: j: d  : j(x) = j(||x – cj||) (1) where ||.|| denotes Euclidean distance, cj  d and 1 j  M.60The approximation of the function f(x) may be expressed as a linear combinations of the radial basis functions:where j are weights.A typical choice for the radial basis functions is the set of multi-dimensional Gaussian kernel:where j is the width of the j-th hidden unit in the hidden layer.(2)(3)61Gaussian functionFigure 10. Illustration of two Gaussian functions in 1-dimensionA Gaussian RBF monotonically decreases with distance from the center.62Training of a RBF networkOnce the number (M) and the shape of the radial basis function is chosen, the RBF has to be trained properly.Given a training data set T of size NT, T = {(xp, yp)  d, 1 p NT: yp = f(xp)} the training algorithm consists of finding the parameters cj, j and j such that the approximation function fits the unknown function f(x) as close as possible. This is realized by minimizing a cost function.63Error criterionAfter the best-fit function is calculated, the performance of the RBF network is estimated by computing an error criterion.Consider a training data set V, containing NV data points: V = {(xq, yq)  d, 1 q  NV: yq = f(xq)}The error criterion can be chosen as the mean square error: where yq are the desired output.(4)64Training AlgorithmOften, the training algorithm is a three-stage procedure:1. determine the center cj of the Gaussian kernels.2. compute the widths of the Gaussian kernels j .3. compute the weights j .During the first two stages only the inputs xp of the training data set T are used. The parameters are adapted according to an unsupervised updating rules.In the third step the weights are updated w.r.t. the corresponding desired outputs. Meanwhile cj and j remained fixed.65Computing the centroidsSeveral algorithms and heuristics are proposed for computation of the centroids cj.Notice that each unit in the hidden layer of RBF networks corresponds to a specific class of objects.So, we can use clustering to determine the centroids cj for RBF networks. We use k-Means algorithm to cluster the set of all xi in the training set. The parameter k here is also the number of units in the hidden layer.66Clustering to determine the centroidsFigure 11. Splitting two-dimensional objects into clusters67K-Means clustering to find centroids68Finding width factorsThere are two alternative heuristics for finding width factors:1. Fix a constant width for all Gaussian functions2. Use r-nearest-neighbor heuristicThe first heuristic: The widths are fixed as follows:  = dmax/sqrt(2M) where M is the number of centers and dmax is the maximum distance between 2 centers. Note: Such a procedure fixes the degree of overlapping of the Gaussian kernels.69Finding width factors (2nd Heuristic)Moody & Darken (1989) proposed to compute the width factors by the r-nearest-neighbors heuristic:where the ci are the r-nearest-neighbors of centroid cj. A suggested value for r is 2.This method offers the advantage of taking the distribution variations of the data into account.70Training the output weights The output weights j is the result of minimization of the error function (4).They can be found by gradient descent optimization of the error function or Least-Square estimation.There are two training schemes:Two-phase learning: Here the two layers of the RBF networks are trained separately, first RBF centers cj and the kernel widths j are determined, and subsequently the output layer is solved.Three-phase learning: After the initialization of RBF network using two-phase training, the whole architecture is adjusted through a further optimization procedure. The further optimization can be an error-backpropagation algorithm as in MPL neural networks. 71RBF network toolsIn WEKA, the data mining software, there areANN and RBF tools for classification.RBF network in R package.Radial Basis Function network - MATLAB72Application of RBF networksClassificationRegressionTime series Prediction73Conclusions on RBF networksIn RBF network, responses of neurons are “locally tuned” or “selective” for some part of the input space.The parameters of MLP networks are nonlinear while those of RBF networks are linear.RBF networks guarantees convergence to globally optimum parameters (without local minima).RBF networks learn faster than MLP networks.74ReferencesTom M. Mitchell, Machine Learning, The McGraw Hill, 1997.M. Verleysen, K. Hlavackova, Learning in RBF Networks, Proc. of Int. Conf. on Neural Networks (ICNN), Washington, D.C., June 3-6, 1996.F. Schwenker, H. A. Kestler, G. Palm, Three learning phases for radial-basis-function networks, Neural Networks 14(2001) 439-458.N.Benoudjit, C. Archambeau, A. Lendasse, J. Lee, M. Verleysen, Width optimization of the Guassian kernels in RBF Networks, Proc. of European Symposium on Artificial Neural Networks, Bruges, Belgium, 24-26 April 2002, pp. 425-432.J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publishers, 200675

Các file đính kèm theo tài liệu này:

chapter_7_3935.ppt