The Analytic Garden: Viterbi Training - Part 2 The Viterbi Algorithm

We saw in the previous post that it is relatively easy to generate a data series from a sequence of states that possess a Markov property. Given a data set that we think might be usefully analyzed using an HMM, what sort of things do we want to know? A couple of things come to mind. Can we determine the state associated with each data point? Can we determine the parameters, transition and emission probabilities, for each state?

Why do we want to know these things. Consider finding new unannotated genes as an example. Given a genomic sequence, we want to know which regions are likely to contain genes and which are inter-genic. If we knew the HMM parameters, we could possibly calculate a score for a particular set of state assignments, that identifies genes states and non-gene states. If we're clever, maybe we can find a set of state assignments that is optimal in some sense.

Let's look at how we might find the state assignments if we happen, through some magic, to know the parameters of the states. An assignment of state labels to the data sequence is called a path. The first problem that we encounter is that there are a lot of paths. In our data from the previous post, there were 1000 data items each could be assigned one of two labels. That means that there are \[{2^{1000}} \approx 1.07 \times {10^{301}}\] possible paths. That's a really big number. One thing that will not work is to try to assign paths at random and see which ones will be good in some sense. Most will be bad.

What do we mean by a good path? Depending on the situation, we might want the "best" path or maybe the most typical path. How do we assign values to various paths? When we generated our toy data, we sampled emissions from a probability distribution that depended on the state at the current point in the sequence and sampled the next state from the transition distribution given the current state. This notion of the underlying process emitting data and choosing states according to some known probability distribution gives us a way to assign a value to a path. We can calculate probability of a path from the states and the emission and transition probabilities. Each path has a probability and the sum of all paths is equal to one.

The transition probabilities are the probability of transition from one state at a given position in the sequence to a state at the next position. One problem with reading about HMMs is that authors use different notations for quantities like transition and emission probabilities. We are going to slip into the notation used in Durbin et al. to save typing, but I'll try and relate the notation to our probability model.

The transition matrix is the probability of transitioning from state k at position i-1 to state l (el) at position i, thus ${a_{kl}} = P({\pi _i} = l|{\pi _{i - 1}} = k)$.

${\pi _{i - 1}} = k$ represents the state label is k at position i-1.

The probability of emitting a value, x, from a state, k, is ${e_k}(x) = P({x_i} = x|{\pi _i} = k)$.

The probability of starting in a particular state, k, is ${a_{0k}} = P({\pi _1} = k)$

In our case, the transition and emission parameters do not vary at different location in the sequence. Our HMM is stationary.

With these definitions, we can get the probability of a particular sequence of states. Note that the state at position i depends only on the state at position i-1, not the states of all of the preceding positions. The probability of the path and the data is given by the following, after a couple of applications of Bayes rule.

Notice that we're indexing the sequence starting at 1. Position 0 is a start state. We start in state 0 and jump into some state at the first position in our sequence.

We can calculate a couple of other quantities. The probability of a particular path and the probability of teh data given a path.

\[\begin{gathered}
P(\pi ) = P({\pi _1})\prod\limits_{i = 2}^L {P({\pi _i}|{\pi _{i - 1}}) = {a_{0,{\pi _1}}}} \prod\limits_{i = 2}^L {{a_{{\pi _{i - 1}},{\pi _i}}}} \hfill \\
P(x|\pi ) = \prod\limits_{i = 1}^L {P({x_i}|{\pi _i}) = \prod\limits_{i = 1}^L {{e_{{\pi _i}}}({x_i})} } \hfill \\
\end{gathered} \]

What does all this semi-scary math do for us? We would like to find the most probable path. In mathematical terms, we would like to find a path that maximizes $P(x,\pi )$. That is we want to find

\[{\pi ^*} = \mathop {\arg \max }\limits_\pi P(x,\pi )\]

${\pi ^*}$ represents the optimal, in the sense of having the highest probability, assignment of state labels for the data.

As we have seen, there are way too many paths to try them all.

Dynamic Programming to the Rescue

Dynamic programming s a method for solving a problem by breaking it down into a collection of simpler subproblems and storing their solutions. The idea is that the solution to a given optimization problem can be obtained by the combination of optimal solutions to its sub-problems. Dynamic Programming gives us a way to avoid having to calculate all of those paths.

We'll take a look at a particular dynamic programming algorithm called the Viterbi algorithm. The Viterbi algorithm is named after Andrew Viterbi, one of the founders of Qualcom. However, it was invented several times by others in different fields.

Suppose that we magically somehow knew the most probable path up to position i and that we ended in state k at that position.

\[{v_k}(i) = P(\pi _i^* = k...{\pi _1},{x_i}...{x_1})\]

We can calculate the maximum probability of being in some state at position i+1 from the transition and emission probabilities.

To get the maximum at the next step, we maximize over the states at position i.

This is an example of induction. We need a base case, i.e. a way to get started and we can calculate the maximum probability path to the end of the sequence. At each step we have to calculate the Viterbi array, v, for each state so the operations complexity is O(number of states x length of sequence). Once we have the Viterbi array, we can obtain the maximum path state assignments by backtracking through the array.

Here is the full algorithm as presented in Durbin et al.

The numbers involved in calculating v can get very small, so in practice it's a good idea to use logs rather than multiplying directly. Here's some code to calculate the Viterbi path.

def log_emit(x, mu, sd):
    return norm.logpdf(x, mu, sd)

"""
LogViterbi - calculate P(max_path, x)
arguments: data - a numpy array of floats or double
           em - an n x 2 numpy array or list of lists of emission parameters. In each row
                the first element is the mean of a normal distribution and the second
                is the standard deviation.
           tr - transition matrix. 
requires: Each row of tr must sum to 1. em must be n x 2, tr must be n x n. n is the number of states.
          This routine assumes states gnenerate data drawn from a normal distribution
returns: P - P(max_path, x)
         path - a list of state labels for each position in data
"""
def LogViterbi(data, em, tr):
    def safe_log(d):
        l = -1 * np.ones(d.shape) * np.inf
        for i in range(d.shape[0]):
            for j in range(d.shape[1]):
                if d[i,j] > 0:
                    l[i,j] = np.log(d[i,j])

        return l

    num_states = tr.shape[0]
    x = np.zeros(data.shape[0]+1, dtype=float)
    x[1: ] = data
    L = x.shape[0]
    v = -1 * np.ones((num_states, L)) * np.inf
    v[0,0] = 0
    ptr = np.zeros((num_states, L))

    t = safe_log(tr)

    for i in range(1, L):
        for l in range(num_states):
            m = v[:, i-1] + t[:, l]
            v[l, i] = log_emit(x[i], em[l, 0], em[l, 1]) + m.max()
            ptr[l, i] = m.argmax()
    
    m = v[:, L -1] + t[:, num_states-1]
    P = m.max()

    states = np.zeros(L, dtype=int)
    states[L-1] = m.argmax()

    for i in range(L-1, 0, -1):
        states[i-1] = ptr[states[i], i]

    return P, [states[1:]+1]

You can download the full code and data from the link at the end of this post.

Lets' see how our routine did on the toy data generated from the last post's code. We'll use the transition and emission parameters used to generate the data to test the program. The parameters can be found in the file, data.json. Remember the path found depends on these parameters. Use different parameters, and you will find a different path. We don't expect the results of our predictions to be perfect. Even though the means of the emission parameters are distinct, any particular data value can be generated by either state

Using pandas we can read the "official" states and the Viterbi generated states and compare. We see that we missed 13 out 1000 state predictions. Not too bad.

n [32]: import pandas as pd

In [33]: states = pd.read_csv('../data/hmm.states', header=None)

In [34]: vt_states = pd.read_csv('../output/viterbi.out', header=None)

In [35]: # get the sum of the differences

In [36]: states.subtract(vt_states, fill_value=None).transpose().abs().sum()
Out[36]: 
0    13
dtype: int64

We can plot the states and our predictions.

The light blue points (hard to see) are the ones where the algorithm's prediction did not match the state used to generate the data.

Download the full Viterbi program and data.

The Analytic Garden

Viterbi Training - Part 2 The Viterbi Algorithm

Dynamic Programming to the Rescue

No comments:

Post a Comment

Labels

Contributors

wfmu