Viterbi Training - Part 3 Training

In previous posts (Part 1, Part 2), we saw that we could model a data sequence with an HMM, and using the Viterbi algorithm, find an optimal assignment of state labels for the data elements. The Viterbi algorithm depended on having a set of emission and transition parameters for the hidden states. If all we have is a data sequence, where do we get the parameters?

The two most common methods for estimating the state parameters for an HMM are the Baum-Welch  (BW) method and Viterbi training (VT). The Baum-Welch algorithm is a form of expectation maximization. In comparing the two methods, BW is typically more accurate, but more compute intensive. Viterbi training offers a trade-off between accuracy and speed. VT is often faster and easier to program.

VT is an iterative method. The idea is simple. You start with a "guess" for the emission and transition parameters, run the Viterbi algorithm to obtain a path, i.e. an assignment of state labels to the data. Given the state assignments, new transition probabilities can calculated by counting the state changes from position to position, and new emission parameters can be obtained by calculating the emissions in each of the predicted states. For the example from the previous post, we would need to calculate the mean and standard deviation for the data in each state. Once you have new parameters, re-run the Viterbi algorithm to get new state assignments. Repeat until the changes in the parameters are below some tolerance or the maximum number of iterations is exhausted.

It's possible that VT will converge to a suboptimal set of parameters, so it is typically run it several times with different starting parameters and accept the solution which produces the maximum $P(\pi ,x)$.

As with most iterative methods that require some sort of  starting point, the better your initial guess, the more likely the algorithm is to converge quickly.

Here's one way to get a reasonable starting point for the data. To get starting estimates of the emission parameters, create a histogram, not plotted, of the data. Emission means are sampled in proportion to the data distribution from the histogram. Standard deviations are sampled uniformly from 0 to the standard deviation of the whole data array.

The transition matrix can be initialized to an identity matrix. Off diagonal elements of the transition matrix are sampled uniformly between 0 and some small number less than 1. The transition matrix is then normalized. This creates a "sticky" transition matrix where the HMM is more likely to stay in the current state than  jump to another state.

It turns out that with our toy data from the previous post, the algorithm is largely insensitive to the initial parameter values. In situations where the differences among states are more subtle, good starting values are important.

"""
InitialParams - get initial emission and transition parameters for an HMM with Normal emission states
                The routine creates a histogram, not plotted, of the data. Emission means are
                sampled in proportion to the data distribution from the histogram.
                Standard deviations are sampled uniformly from 0 to the  standard deviation of
                the whole data array.
                The transition matrix is initialized to an identity matrix.
                Off diagonal elements of the transition matrix are sampled uniformly between 0
                and shift. The transition matrix is normalized.
arguments: data - a 1d numpy data array of floats or doubles
           num_states - the  number of hidden states
returns: tr - a normaized numpy array of transition parameters
         em - a num_states x 2 numpy array. Element 0 in a row is the normal mean for the state
              and element 1 is the standard deviation.
"""
def InitialParams(data, num_states, shift = 0.1):
    h, b = np.histogram(data)
    
    em = np.zeros([num_states, 2])
    s  = np.random.choice(b[1:], size=num_states, replace=False, p=h/sum(h))
    for i in range(num_states):
        em[i][0] = s[i]
        em[i][1] = np.random.uniform(low=0, high=np.std(data))
    
    tr = np.identity(num_states)
    for i in range(num_states):
        for j in range(num_states):
            if i != j:
                tr[i][j] = np.random.uniform(low=0, high=shift)
    row_sums = tr.sum(axis=1)
    tr = tr / row_sums[:, np.newaxis]

    print(tr)
    print(em)

    return tr, em


To do the training, we estimate the original parameters, run the Viterbi algorithm, and re-estimate the parameters as described above. The training algorithm stops when the sum of the squares of the absolute differences in the parameters falls below a fixed tolerance.  Re-run the whole process ten times and save parameters which maximize the probability of the data and states.

Here is the code. You can download the entire program and data from the link at the end of the post.

"""
VtTrain - perform Viterbi training.
arguments: data - a 1d numpy data array of floats or doubles
           num_states - the  number of hidden states
           max_iter - the maximum allowable iterations
returns: em - the estimated emissions parameters
         tr - the estimated transition parameters
         iteration - the number of iterations executes
         P - the final P(max_states, data)
         states - a numpy array of state assignments
"""
def VtTrain(data, num_states, max_iter=100):
    pseudo =  1
    tol = 1.0e-5
    
    tr, em = InitialParams(data, num_states)

    for i in range(max_iter):
        iteration = i
        P, states = LogViterbi(data, em, tr)
        print(P)

        em_prev = np.array(em, copy=True)
        tr_prev = np.array(tr, copy=True)
        states_prev = np.array(states. copy, copy=True)

        states = states[0]
        tr = np.ones([num_states, num_states]) * pseudo
        for i in range(len(states)-1):
            tr[states[i]-1][states[i+1]-1] += 1
        row_sums = tr.sum(axis=1)
        tr = tr / row_sums[:, np.newaxis]
        
        for i in range(num_states):
            if any(states == i+1):
                em[i][0] = np.mean(data[states==i+1])
                em[i][1] = np.std(data[states==i+1])
            else:
                em[i][0] = 1.0e-05                
                em[i][1] = 1.0e-05
            if math.isnan(em[i][1]):
                em[i][1] = 1.0e-5

        if np.linalg.norm(em - em_prev) <= tol and \
          np.linalg.norm(tr-tr_prev) <= tol:
          break

    return em, tr, iteration, P, states


How well does it work on toy data? Here are the predicted transition parameters.


State 1
State 2
State 1
0.90671642
0.09328358
State 2
0.03265306
0.96734694

The emission parameters. 

Mean
Standard Deviation
State 1
-0.07301107
0.88281407
State 2
2.98924518
0.99649979

As you would expect, these parameters don't exactly match the parameters used to generate the original data. We can't expect that from a finite data sample and any value could be generated by either state. However the values are close. If we were being rigorous, we would have separate training and testing data to try to avoid overfitting.

How well does the method do at predicting the states. We can compare the predicted states to the official states.

In [41]: import pandas as pd

In [42]: states = pd.read_csv('../data/hmm.states', header=None)

In [43]: vt_train_states = pd.read_csv('../output/hmm_train.states', header=None)

In [44]: states.subtract(vt_train_states, fill_value=None).transpose().abs().sum()
Out[44]: 
0    12
dtype: int64


We missed 12 states out of 1000. Not too bad.


The results match the results from the R library HiddenMarkov using Baum-Welch.

Download the code and data.


No comments:

Post a Comment