Chapter 10 Hidden Markov Models

Similarly, from the data given in Figure 10.5, we can estimate the parameter values of the HMM corresponding to digit 7. 1 = 0/12 = 0 and 2 = 12/12 = 1 a11 = 4/5, a12 = 1/5, a21 = 6/7 and a22 = 1/7 There are six different observation symbols (rows) in the training patterns given in Figure 10.5. They are “111”, “001”, “011”, “101”, “110”, and “000”. The normalized b values are: b1(111) = 1/4 and b2(111) = ¾ b1(001) = 3/4 and b2(001) = 1/4 b1(011) = 1/2 and b2(011) = 1/2 b1(101) = 1/2 and b2(101) = ½ b1(110) = 0 and b2(110) = 1 b1(000) = 1 and b2(000) = 0 Note: bi(k) depends on the observation symbol k and the state i; it does not depend on the digit here. So, we get the same value of bi(k) for either class 1 (digit 1) or class 2 (digit 2).

53 trang | Chia sẻ: vutrong32 | Lượt xem: 1015 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Chapter 10 Hidden Markov Models, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Chapter 10Hidden Markov ModelsAssoc. Prof. Dr. Duong Tuan AnhFaculty of Computer Science and Engineering, HCMC Univ. of Technology3/20151Outline1 Introduction2. Discrete Markov processes3. Hidden Markov models4 Three basic problems of HMMs5. Evaluation problem6. Finding the state sequence7. Learning model parameters21. IntroductionHidden Markov models (HMMs) are important in pattern recognition because they are suited to classify patterns where each pattern is made up of a sequence of sub-patterns and these sub-patterns are dependent.HMMs can be used to characterize classes of patterns, where each pattern is viewed as a sequence of states. It is possible that the actual state is hidden and only probabilistic variations are observed.HMMs are popular in speech recognition. There are several other important applications including recognition of protein and DNA sequences/subsequences in bioinformatics.32. Discrete Markov processesConsider a system that any time is in one of a set of N distinct states: S1, S2,,SN. The state at time t is denoted as qt, t = 1, 2,, so for example qt = Si means that at time t, the system is in state Si.At regularly spaced discrete times, the system moves to a state with a given probability, depending on the values of the previous states: P(qt+1 = Sj | qt = Si, qt-1 = Sk,)For a special case of a first-order Markov model, the state at time t +1 depends only on state at time t, regardless of the states in the previous times: P(qt+1 = Sj | qt = Si, qt-1 = Sk,) = P(qt+1 = Sj | qt = Si) (10.1)We further simplify the model by assuming that these transition probabilities are independent of time (stationary):4 aij  P(qt+1 = Sj| qt = Si) (10.2) satisfying aij > 0 and So, going from Si to Sj has the same probability no matter when it happens, or where it happens in the observation sequence. A = [aij] is a N  N matrix whose rows sum to 1.This can be seen as a stochastic automaton (see Fig. 10.1). From each state Si, the system moves to state Sj with probability aij, and this probability is the same for any time t.(10.3)5Fig. 10.1 Example of a Markov model with three states is a stochastic automaton. i is the probability that the system starts in state Si, and aij is the probability that the system moves from Si to state Sj.6Initial probabilitiesThe only special case is the first state.We define initial probabilities, i, which is the probability that the first state in the sequence is Si: i  P(q1 = Si) (10.4) satisfying = [i ] is a vector of N elements that sum to 1(10.5)7Observable Markov modelIn an observable Markov model, the states are observable. At any time t, we know qt, and as the system moves from one state to another, we get an observation sequence that is a sequence of states.We have an observation sequence O that is the state sequence O = Q = {q1q2qT}, whose probability is given as.q1 is the probability that the first state is q1, aq1q2 is the probability of going from q1 to q2, and so on. We multiply these probabilities to get the probability of the whole sequence.(10.6)8ExampleAssume we have N urns where each urn contains balls of only one color. So there is an urn of red balls, another of blue balls, and so forth. Somebody draws balls from urns one by one and shows us their color.Let qt denote the color of the ball drawn at time t. Let us say we have three states: S1: red, S2: blue, S3: green With initial probabilities:  = [0.5, 0.2, 0.3]T aij is the probability of drawing from urn j (a ball of color j) after drawing a ball of color i from urn i. The transition matrix is:9Example (cont.)Given  and A, it is easy to generate K random sequences each of length T.How to calculate the probability of a sequence: Assume that the first four balls are “red, red, green, green.” This corresponds to the observation sequence O = {S1, S1, S3, S3}. Its probability isP(O|A, ) = P(S1)P(S1|S1).P(S3|S1).P(S3|S3) = 1.a11.a13.a33 = 0.5*0.4*0.3*0.8 = 0.048 (10.7)10How to learn the parameter , AGiven K sequences of length T, where qtk is the state at time t of sequence k, the initial probability estimate is the number of sequences starting with Si divided by the number of sequences:where 1(b) is 1 if b is true and 0 otherwise.As for the transition probabilities, the estimate for aij is the number of transitions from Si to Sj divided by the total number of transition from Si over all sequences:(10.9)(10.8)113. Hidden Markov modelsIn a hidden Markov model (HMM), the state are not observable, but when we visits a state, an observation is recorded that is a probabilistic function of the state.We assume a discrete observation in each state from the set {v1, v2,,vM} bj(m)  P(Ot = vm| qt = Sj} bj(m) is the observation probability that we observe vm, m = 1,,M in state Sj.Again, we assume that the probabilities do not depend on t (homogeneous model).The values thus observed constitute the observation sequence O.The state sequence Q is not observed, but it should be inferred from the observation sequence O.12HMM (cont.)Note: There are many different state sequences Q could have generated the same observation sequence O, but with different probabilities.Given a sample, we are interested in the state sequence having the highest likelihood of generating the sample.Note: In the case of HMM, there are two sources of randomness: additional to randomly moving from one state to another, the observation in a state is also random.13ExampleThe hidden case corresponds to the urn-and-ball example where each contains balls of different colors. Let bj(m) denote the probability of drawing a ball of color m from urn j. We again observe a sequence of ball colors but without knowing the sequence of urns from which the balls were drawn. So it is as if now the urns are placed behind a curtain and somebody picks a ball at random from one of the urns and shows us only the ball, without showing us the urn from which it is picked. The ball is returned to the urn to keep the probabilities the same. The number of ball colors may be different from the number of urns.For example, let us have three urns and the observation sequence is O = {red, red, green, blue, yellow}14Example (cont.)In the previous case, knowing the observation (ball color), we knew the state (urn) exactly because there were urns for separate colors and each urn contained balls of only one color. The observable model is a special case of the hidden model where M = N and bj(m) = 1 if j = m and 0 otherwise. But in the case of a hidden model, a ball could have been picked from any urn. In this case, for the same observation sequence O, there may be many possible state sequences Q that could have generated O 15Figure 10.2 A HMM unfolded in time as a lattice showing all the possible trajectories. One path, shown in thicker lines, is the actual (unknown) state trajectory that generated the observation sequence.1617SummaryTo summarize and formalize, an HMM has the following elements:1. N: Number of states in the model S = {S1, S2,,SN}2. M: number of distinct observation symbols in the alphabet V = {v1, v2,,vM}3. State transition probabilities: A = [aij] where aij  P(qt+1 = Sj | qt = Si)4. Observation probabilities: B = [bj(m)] where bj(m)  P(Ot = vm| qt = Sj)5. Initial state probabilities:  = [i] where i  P(q1 = Si)M and N are implicitly defined in the other parameters so  = (A, B, ) is taken as the parameter set of an HMM. Given , the model can be used to generate an arbitrary number of observation sequences of arbitrary length, but as usual, we are interested in the other direction, that estimating the parameters of the model given a training set of parameters.184 Three basic problems of HMMsGiven a number of sequences of observations, we are interested in three problems:1. Given a model , we would like to evaluate the probability of any given observation sequence, O = {O1 O2OT}, namely, P(O|).2. Given a model  and an observation sequence O, we would like to find out the state sequence Q = {q1q2qT}, which has the highest probability of generating O, namely, we want to find Q* that maximizes P(Q|O,).3. Given a training set of observation sequences, X = {Ok},k =1,,K, we would like to learn the model that maximizes the probability of generation X, namely, we want to find  that maximizes P(X |).The first problem is called evaluation problem, the second is finding state sequence and the third is learning model.195. Evaluation problemGiven an observation sequence O = {O1 O2OT} and a state sequence Q = {q1q2qT}, the probability of observing Q given the state sequence Q is simplywhich we cannot calculate because we do not know the state sequence. The probability of the state sequence Q isThen the joint probability is20Forward procedureWe can compute P(O|) by summing up over all possible Q:However, this is not practical since there are NT possible Q, assuming that all the probabilities are nonzero. Fortunately, there is an efficient procedure to calculate P(O|), which is called the forward procedure. It is based on the idea of dividing the observation sequence into two parts: the first one starting from time 1 until time t, and the second one from t + 1 until T.We define the forward variable t(i) as the probability of observing the partial sequence {O1 O2Ot} until time t and being in St at time t, given the model : t(i)  P(O1Ot, qt = Si |)21Forward procedure (cont.)The nice thing about this is that it can be calculated recursively by accumulating results on the way:Initialization: 1(i)  P(O1, q1 = Si | ) = P(O1| q1 = Si, )P(q1 = Si | ) = ibi(O1)Recursion: t+1(j)  P(O1 Ot+1, qt+1 = Sj| ) = P(O1Ot+1| qt+1 = Sj, )P(qt+1 = Sj| ) = P(O1Ot| qt+1 = Sj, )P(Ot+1|qt+1 = Sj, ) P(qt+1 = Sj| ) = P(O1Ot, qt+1 = Sj| )P(Ot+1|qt+1 = Sj, )22 (a) Forward (b) BackwardFigure 10.3 (a) computation of t(j) and (b) computation of t(i)23t(i) explains the first t observations and ends in state Si. We multiply this by the probability aij to move to state Sj and because there are N possible previous states, we need to sum up over all such possible previous Si. bj(Ot+1) is the probability we generate the (t + 1)st observation while in state Sj at time t + 1.When we calculate the forward variables, it is easy to calculate the probability of the observation sequence:= P(Ot+1|qt+1 = Sj, )24The backward procedureT(i) is the probability of generating the full observation sequence and ending up in state Si. We need to sum up over all such possible final states.Computing t(i) is O(N2T), and this solves our evaluation problem in a reasonable amount of time.We do not need it now but let us similarly define the backward variable t(i), which is the probability of being in Si at time t and observing the partial sequence Ot+1OT: t(i)  P(Ot+1 OT, qt = Si |)This can again be calculated recursively as follows, this time going in the backward direction:25The backward procedure (cont.)Initialization (arbitrarily to 1): T(i) = 1Recursion: t(i)  P(Ot+1 OT, qt = Si |)P(Ot+2 OT|qt+1 = Sj, qt = Si, )P(qt+1 = Sj |qt = Si, )26When in state Si, we can to go N possible next state Sj, each with probability aij. While there, we generate the (t + 1)st observation and t+1(j) explains all the observations after time t +1, continuing from there.One word of caution about implementation is necessary here: Both t and t values are calculated by multiplying small probabilities, and with long sequences we risk getting underflow.To avoid this, at each time step, we normalize t(i) by dividing it with ct = jt(j). We also normalize t(i) by dividing it with the same ct. 276. Finding the state sequenceNow we move on to the second problem, that of finding the state sequence Q = {q1q2qT} having the highest probability of generating the observation sequence O = {O1 O2OT}, given the model .Let us define t(i) as the probability of being in state Si at time t, given O and , which can be computed as t(i)  P(qt = Si |O, ) = P(O|qt = Si, )P(qt = Si| )/P(O| )[ Since P(O, qt = Si, ) = P(qt = Si| O, ).P(O| ).P() = P(O|qt = Si, ).P(qt = Si| ).P() P(qt = Si| O, ).P(O| ) = P(O|qt = Si, ).P(qt = Si| ) P(qt = Si| O, ) = P(O|qt = Si, ).P(qt = Si| )/P(O| ) ]28Finding the state sequenceHere we see how nicely t(i) and t(i) split the sequence between them: The forward variable t(i) explains the starting part of the sequence until time t and ends in Si, and the backward variable t(i) takes it from there and explains the ending part until time T.The numerator t(i)t(i) explains the whole sequence given that at time t, the system is in state Si.29The Viterbi algorithmWe need to normalize by dividing this over all possible intermediate states that can be traversed at time t, and guarantee that it(i)=1. To find the state sequence, for each time step t, we can choose the state that has the highest probability: qt* = arg maxi t(i)But this may choose Si and Sj as the most probable states at time t and t + 1 even when aij = 0. To find the single best state sequence (path), we use the Viterbi algorithm, based on dynamic programming, which takes such transition probabilities into account.Given state sequence Q = q1q2qT and observation sequence O = O1OT, we define t(i) as the probability of the highest probability path at time t that accounts for the first t observations and ends in Si.30The Viterbi algorithm (cont.)We can recursively calculate t+1(i) and the optimal path can be read by backtracking from T, choosing the most probable at each instant. The algorithm is as follows:1. Initialization: 1(i) = ibi(O1) 1(i) = 0[t(j) keeps track of the state that maximizes t(j) at time i -1, i.e., the best previous state.] 2. Recursion t(j) = maxi t-1(i) aij.bj(Ot) t(j) = arg maxi t-1(i)aij 3. Termination: p* = maxi T(i) qT*= arg maxi T(i) 31The Viterbi algorithm (cont.)4. Path (state sequence) backtracking: qt* = t+1(q*t+1), t = T-1, T-2,, 1 Using the lattice structure of figure 11.2, t(j) keeps track of the state that maximizes t(j) at time i -1, i.e., the best previous state.The values t(i) are stored in a Q  T dynamic programming matrix.The Viterbi algorithm has the same complexity with the forward procedure, where instead of the sum, we take the maximum at each step.The Viterbi algorithm is more efficient than the forward procedure.327. Learning model parametersThe approach for this problem is maximum likelihood and we would like to calculate * that maximizes the likelihood of the set of training sequences, X = {Ok} k = 1,,K., namely, P(X | ).We define t(i, j) as the probability of being in Si at time t and in Sj at time t +1, given the whole observation O and : t(i, j)  P(qt = Si, qt+1 = Sj | O, ) which can be computed as (see Figure 11.4) t(i, j)  P(qt = Si, qt+1 = Sj | O, )33Figure 10.4 Computation of arc probabilities t(i, j) Since P(a, O, ) = P(O|a, ).P(a| ).P() P(a, O, ) = P(a|O, ). P(O| ).P() P(O|a, ).P(a| ).P() = P(a|O, ). P(O| ).P() P(O|a, ).P(a| ) = P(a|O, ). P(O| ) P(a|O, ) = P(O|a, ).P(a| ) /P(O| )34t(i) explains the first t observations and ends in state Si at time t.35We move on to state Sj with probability aij, generate the (t +1)st observation, and continue from Sj at time t +1 to generate the rest of the observation sequence.We normalize by dividing for all such possible pairs that can be visited at time t and t +1.If we want, we can also calculate the probability of being in state Si at time t by summing over the arc possibilities for all possible next states.In unsupervised clustering using EM, not knowing the class labels, we estimated them first (in the E-step) and calculated the parameters with these estimates (in the M-step). Similarly here we have the Baum-Welch algorithm, which is an EM procedure. At each iteration, first in E-step, we compute t(i, j) and t(i). These two steps are alternated until convergence during which, it has been shown that P(O | ) never decreases.36These are 0/1 in the case of observable Markov model and are hidden random variables in the case of an HMM. In the latter case, we estimate them in the E-step as E[zit ] = t(i). E[zijt ] = t(i, j) In the M-step, we calculate the parameters given these estimated values. The estimated number of transitions from Si to Sj is j t(i, j) and the total number of transitions from Si is j t(j). The ratio of these two gives us the probability of transition from Si to Sj at any time.Assume indicator variables zit and zijt as37Note that this is the same as equation 11.9 except that the actual counts are replaced with estimated soft counts.The probability of observing vm in Sj is the expected number of times vm is observed when the system is in Sj over the total number of times the system is in Sj:38When there are multiple observation sequences X = {Ok} k = 1,,K which we assume to be independentThe parameters are now averages over all observations in all sequences.39ExampleConsider two printed characters each of size 3  3 as follows Printed characters 1 and 7 ----------------------------------------- 0 0 1 1 1 1 0 0 1 0 0 1 0 0 1 0 0 1In the 3  3 pattern corresponding to digit 1, the first row is “001”; let label this state as S1. So the states corresponding to digit 1 are S1, S1 and S2. Similarly, labeling 111 as S2, the sequence of states for the digit 7 is S2, S1, S1.Here, each digit has three rows and each state is a row.40Consider the noisy versions of digit 1: 0 0 1 1 0 1 0 0 1 0 0 1 1 0 1 0 0 1We update the model by including a better probabilistic structure to deal with noisy digits by viewing the observations as noisy instantiations of the state sequences. In such a model, the states may not be observable. They are hidden. So we use hidden Markov model in this case.Now, we consider the example of digit recognition. The training examples of digit “1” and the training examples of digit “7” are given in the Figure 10.5.Here, we consider a noisy variant of a state as a possible observation. Ex: “101” may be viewed as a noisy version of the state S1 (i.e. “001”). In this simple case, we permit at most one bit to be affected by noise.41(a) Training data for digit “1” including noisy versions (12 examples)(b) Training data for digit “7” including noisy versionsFigure 10.542Learning HMMIn the example, we have two classes corresponding to the digits “1” and “7”. We have 12 training patterns for each of the classes. Digit 1 is characterized by the state sequences S1, S1, S1 and digit “7” is described by the sequences S2, S1, S1.There are only two states: S1: 001 and S2: 111.Each row of each of the 12 training patterns corresponds to a class.We learn the parameters as follows. We start with the patterns of digit “1”.We consider each row of a pattern as an observation symbol. There are a total of 36 observation symbols. There are two states “001” and “111”.There are two clusters corresponding to these two states; cluster 1 is represented by “001” (S1) and cluster 2 by “111” (S2).43We cluster 36 observations into the two clusters based on the distance between the cluster representative and the observation considered. We use the Hamming distance, which can be defined as the number of mismatching bits.Out of 36 observations (rows), 27 are “001”; they are assigned to cluster 1 because they are closer to S1 (“001”) than to S2 (“111”). There are 3 observations that are “000”; these are assigned to cluster 1 as the distance between S1 and “000” is 1 whereas the distance between S2 and “000” is 2. So 30 observations are assigned to cluster 1.The remaining 6 observations may be assigned to either cluster 1 or cluster 2 as S1 or S2 are equally distant (distance is 1) from these observations. From here, we assume that they are assigned to cluster 2.We calculate the various HMM parameters as follows:(a) initial state probability:44 1 = 10/12 = 5/6 and 2 = 2/12 = 1/6(b) probability of state transition a11 = 16/20 = 4/5 and a12 = 4/20 = 1/5 a21 = 4/4 = 1 and a22 = 0/4 = 0There are different ways of estimating bj(k), the probability of generating observation symbol k at time t when the state is j. We compute the probabilities based on the distance of the observation symbol from each of the states or clusters. The specific algorithm is as follows.Step 1: Compute the distance between the observation vector k and each of the N states. We use the Hamming distance. Let the distance be d1, d2,,dN. Thenj = 1, 2,,N45Step 2: Normalize the distances so that b1(k) + b2(k) + +bN(k) = 1. This is obtained by dividing each value by the sum of the non-normalized values. Use the normalized probabilities.Let compute b1(001) and b2(001) w.r.t the data given in Figure 10.5. The states are S1 = 001 and S2 = 111. The distances are d1 = 0 and d2 = 2. The maximum possible distance = 3. The values before normalization are: b1(001) = 1 – 0/3 = 1 and b2(001) = 1-2/3 = 1/3.Sum of the values is 4/3. So, the values after normalization are b1(001) = 1/(4/3) = ¾ and b2(001) = (1/3)/(4/3) = ¼.Similarly, we can compute the probabilities associated with the other observation symbols: “101”, “011”, and “000”. The normalized values are: b1(101) = ½ and b2(101) = ½ b1(011) = ½ and b2(011) = ½ b1(000) = 1 and b2(000) = 0So, we finished all the parameters for the HMM associated with the digit “1”.46Similarly, from the data given in Figure 10.5, we can estimate the parameter values of the HMM corresponding to digit 7. 1 = 0/12 = 0 and 2 = 12/12 = 1 a11 = 4/5, a12 = 1/5, a21 = 6/7 and a22 = 1/7There are six different observation symbols (rows) in the training patterns given in Figure 10.5. They are “111”, “001”, “011”, “101”, “110”, and “000”. The normalized b values are: b1(111) = 1/4 and b2(111) = ¾ b1(001) = 3/4 and b2(001) = 1/4 b1(011) = 1/2 and b2(011) = 1/2 b1(101) = 1/2 and b2(101) = ½ b1(110) = 0 and b2(110) = 1 b1(000) = 1 and b2(000) = 0Note: bi(k) depends on the observation symbol k and the state i; it does not depend on the digit here. So, we get the same value of bi(k) for either class 1 (digit 1) or class 2 (digit 2).47Figure 10.6 Parameters of the HMMs of two classes: digit 1 and digit 7Figure 10.7: Test patterns a) test1 b) test2 c) test348Classification Using HMMConsider the test pattern, test1 (Fig. 10.7) for classification. We compute the probability for 1 to generate test1 and for 7 to generate test1; we assign test1 to that class which has a higher probability of generating it.More specifically, we have: P(test1|) = 1  b1(001)  a11  b1(001) a11  b1(001).We plug in the values of various parameters from the HMM for digit 1 (1) given in Figure 10.6 to compute P(test1|1). This is given by P(test1|1) = (5/6)*(3/4)*(4/5)*(3/4)*(4/5)*(3/4)=0.225Similarly, we can compute P(test1|7) by plugging in the values of the parameters of HMM for digit 7 from Figure 10.6 and it is given by P(test1|7) = 0*(3/4)*(4/5)*(3/4)*(4/5)*(3/4) = 0Here, P(test1| 1) is greater than P(test1| 7). So, test1 is classified as digit 1 as it has a higher probability of being generated by 1 than that of 7.49Now, consider the pattern, test2 (Figure 10.7) for classification. The first row of the pattern is “001” which is assigned to cluster 1 represented by S1; the second row “001” is assigned to cluster 1 represented by S1; and the third row “101” is assigned to cluster 2 represented by S2. We calculate the probabilities for test2 to have generated by the two HMMs as follows: P(test2|) = 1  b1(001)  a11  b1(001)  a12  b2(101).We plug in the values of various parameters from the HMM for digit 1 (1) given in Figure 10.6 to compute P(test2| 1). This is given by P(test2|1) = (5/6)*(3/4)*(4/5)*(3/4)*(1/5)*(1/2)=0.0375Similarly, we can compute P(test2| 7) by plugging in the values of the parameters of HMM for digit 7 from Figure 10.6 and it is given by P(test2|7) = 0*(3/4)*(4/5)*(3/4)*(1/5)*(1/2) = 0Here, again P(test2|1) is greater than P(test2|7). So, test2 is classified as digit 150Now, consider the pattern, test3 (Figure 10.7) for classification. The first row “101” which is assigned to cluster 2; the second row “001” is assigned to cluster 1; and the third row “001” is assigned to cluster 2.So, the corresponding probabilities are computed as follows: P(test3|) = 2  b2(101)  a21  b1(001)  a11  b1(001).Correspondingly, for 1 we have P(test3|1) = (1/6)*(1/2)*1*(3/4)*(4/5)*(3/4) = 0.0375And for 7 we have P(test3|7) = 1*(1/2)*(6/7)*(3/4)*(4/5)*(3/4) = 0.193In this case, P(test3|7) is greater than P(test3|1) . So we classify test3 as digit 7.Note that the first row “101” is equidistant from S1 and S2 and we have resolved ties in favor of S2 in case of a tie. 51HMM software tools HMM tool box in MATLAB.HMM tool in R software.HTK tool kit (Cambrige University Engineering Department) S.J. Young Alpaydin, 2004, Introduction to Machine Learning, The MIT Press.M. N. Murty and V. S. Devi, 2011, Pattern Recognition – An Algorithmic Approach, Springer.53

Các file đính kèm theo tài liệu này:

chapter_10_2342.ppt