The Analytic Garden

Posts

Showing posts from November, 2017

Viterbi Training Part 4 - Why Doesn't This Work?

- November 30, 2017

We saw previously that Viterbi training can predict the parameters of a simple hidden Markov model (hmm). The toy example that we examined had two fairly distinct states. The emissions were drawn from two normal distributions with well separated means. Let's see how the Viterbi training method does when the states are not so different. Here are the parameters that were used to generate the new data. { "states": 2, "transition": [ { "from": 0, "to": 0, "prob": 0.9 }, { "from": 0, "to": 1, "prob": 0.1 }, { "from": 1, "to": 1, "prob": 0.95 }, { "from": 1, "to": 0, "prob": 0.05 } ], "emission": [ { "state": 0, "mean": 0, "std...

Viterbi Training - Part 3 Training

- November 22, 2017

In previous posts (Part 1 , Part 2 ), we saw that we could model a data sequence with an HMM, and using the Viterbi algorithm, find an optimal assignment of state labels for the data elements. The Viterbi algorithm depended on having a set of emission and transition parameters for the hidden states. If all we have is a data sequence, where do we get the parameters? The two most common methods for estimating the state parameters for an HMM are the Baum-Welch (BW) method and Viterbi training (VT). The Baum-Welch algorithm is a form of expectation maximization . In comparing the two methods, BW is typically more accurate, but more compute intensive. Viterbi training offers a trade-off between accuracy and speed. VT is often faster and easier to program. VT is an iterative method. The idea is simple. You start with a "guess" for the emission and transition parameters, run the Viterbi algorithm to obtain a path, i.e. an assignment of state labels to the data. Given ...

Viterbi Training - Part 2 The Viterbi Algorithm

- November 21, 2017

We saw in the previous post that it is relatively easy to generate a data series from a sequence of states that possess a Markov property. Given a data set that we think might be usefully analyzed using an HMM, what sort of things do we want to know? A couple of things come to mind. Can we determine the state associated with each data point? Can we determine the parameters, transition and emission probabilities, for each state? Why do we want to know these things. Consider finding new unannotated genes as an example. Given a genomic sequence, we want to know which regions are likely to contain genes and which are inter-genic. If we knew the HMM parameters, we could possibly calculate a score for a particular set of state assignments, that identifies genes states and non-gene states. If we're clever, maybe we can find a set of state assignments that is optimal in some sense. Let's look at how we might find the state assignments if we happen, through some magic, to know t...

Viterbi Training - Part 1 Generating Data

- November 17, 2017

Hidden Markov models (HMMs) are workhorses in computational biology. An HMM is useful when you believe the data you are observing is being generated by a process that can be in one of several distinct unobserved "states". We can observe the data, but the states are hidden. The states have different characteristics and the data generated by the system in each states differs in some sense. The use of an HMM implies certain assumptions. Typically, the states are assumed to have a Markov property . That is you can predict the future of the process based solely on its present state, independent of its history. We can say that the future is conditionally independent of the past, given the present state. A second common assumption is that given the state, the data elements are independent. That is, the data generated depends only on the state of the system, not the preceding data items. It's the states that have a Markov property, not the data. This second assumption is somet...

A Brief Tale of Two ggplots

- November 15, 2017

Probably the most fundamental aspect of exploratory data analysis is plotting your data. Just getting a look at the spread or distribution of the data can provide sometimes insight into the processes that generated the data. When plotting a data set, my "goto" tool is R . The traditional R graphics packages are easy to use. Although not fancy, they get the job done and it's possible to produce publication quality plots with the basics plotting functions. In fact, to my eye, R's standard plots look better than the default plots provided by something like matplotlib for Python. Lately, I have been generating most of my plots in R using ggplot2 . ggplot2 is a plotting package based on something called the grammar of graphics developed by Leland Wilkinson . ggplot2 is a set of independent components that can be combined in many different ways. Plots are constructed in a layered manner. You start with a description of the data and add descriptions of how you want to ...