David Bowie - Changes
Suppose you have some data that looks like this.
It might be a time series or some other form of sequential data. It looks like there might be several different regions in the data.The count ranges seems to change somewhere around 25 and maybe again 75 or 80. Assuming there is some underlying process generating the data, we would like to know where do the parameters that control the process change and what are the values of the parameters in each region.
This is toy data generated by this function:
"""
generateSeq - generate a sequence of Poisson distributed data with changes in mean at
specified points
arguments:
length - the total sequence length
cps - change point location
lambdas - mean of data for each segment
returns;
seq - a numpy array of values at each position in the sequence
Note:
This function does no error checking. len(lambdas) must = len(cps) +1
"""
def generateSeq(length, cps, lambdas):
seq = np.zeros(length)
cps2 = [0] + cps + [length-1] # add a fictitious cp at beginning and end of sequence
for i in range(len(cps2)-1):
seq[cps2[i]:(cps2[i+1]+1)] = np.random.poisson(lambdas[i], size = cps2[i+1] - cps2[i] + 1)
return seq
This problem is called change point detection. The grim classic example of this type of problem involves when the rate of coal mining disasters changed during the period from 1851 to 1962. The idea behind change point detection is to assume that the data is drawn from some underlying probability distribution, maybe normal, Poisson, or some other probability distribution. The underlying distribution has some parameters. At some points, the parameter values change. Think of a machine spitting out values. every once in a while someone changes the settings on the machine.
Since the y axis is labeled "counts", a natural choice for the underlying distribution would be Poisson. The Poisson distribution expresses the probability of a given number of events occurring in a fixed interval if these events occur with a known rate and independently of the time since the last event. In fact, since this toy data, the values were generated by simulating draws from a Poisson distribution.
The Poisson probability mass function is
\[P(k;\lambda ) = \frac{{{\lambda ^k}{e^{ - \lambda }}}}{{k!}}\,k = 0,1,2...\]
$\lambda$ is a parameter representing the average number of events per interval. k takes on integer values starting at zero. $P(k;\lambda )$ represents the probability of seeing k events in some fixed interval.
With the data shown above, the goal will be to determine where, if any place, the value of $\lambda$ changes and what are the $\lambda$ values in each region. To accomplish this, I'll try a Bayesian approach similar to the one outlined by Liu and Lawrence. Liu and Lawrence applied their method to DNA sequences, but the method works with other types of discrete data. You can see its use in the context of Gibbs sampling for DNA motif finding here.
If we knew how many different values of $\lambda$ were involved, we could use a hidden Markov model and Baum-Welch or Viterbi training to find the regions and their parameter values. The advantage to the Bayesian approach is that we don't necessarily need to know how many different regions there are in the data. This approach will allow us to infer values for the number of change points, where they occur, and the values of $\lambda$ in the regions. In addition it will give us an idea of the uncertainty of these values.
We have some observed data, ${\mathbf{X}} = ({x_1},{x_2}...{x_n})$, some missing data ${\mathbf{C}} = ({c_1},{c_2}...{c_k})$ representing the change point positions; some unknown parameters ${\mathbf{\lambda }} = ({\lambda _0},{\lambda _1}...{\lambda _k})$ representing the means of the data for each segment, and $\kappa$, the number of change points. For k change points, the data is broken into k+1 segments.
Ideally we would like to sample change points from the joint probability distribution of all of these quantities, $P({\mathbf{X}},{\mathbf{C}},{\mathbf{\lambda }},\kappa )$, but that's hard. Instead, we use a combination of dynamic programming and GIbbs sampling to estimate where the change points occur and the values of $\lambda$ in each segment.
Expanding the joint probability by Bayes rule, we get:
\[P({\mathbf{X}},{\mathbf{C}},{\mathbf{\lambda }},\kappa ) = P({\mathbf{X}}|{\mathbf{C}},{\mathbf{\lambda }})P({\mathbf{{\rm X}}}|{\mathbf{\kappa }})P({\mathbf{\kappa }})P({\mathbf{\lambda }})\]
Let's make a few simplifying assumptions. For starters, we'll assume that k can only take on values $k = 0,1...{k_{\max }}$. That is we'll choose a value for the maximum number of change points.
Consider a single segment ${{\mathbf{X}}_{[i:j]}} = ({x_i},{x_{i + 1}}...{x_{j-1}})$. Since each element is independent of the preceding elements, if we knew the value or values of $\lambda$, we could calculate the probability of the data in a segment by multiplying the individual probabilities.
\[P({{\mathbf{X}}_{[i:j]}}) = \frac{1}{{\prod\limits_i {{x_i}!} }}{e^{ - n\lambda }}{\lambda ^{\sum\limits_i {{x_i}} }}\]
However, we don't know $\lambda$, instead we will marginalize over all possible values of $\lambda$ to get the probability of the data.
\[P({{\mathbf{X}}_{[i:j]}}) = \int {P(} {{\mathbf{X}}_{[i:j]}},\lambda )d\lambda = \int {P({{\mathbf{X}}_{[i:j]}}|{\mathbf{\lambda }})P({\mathbf{\lambda }})d{\mathbf{\lambda }}} \]
${P({{\mathbf{X}}_{[i:j]}}|{\mathbf{\lambda }})}$ is the Poisson probability mass function shown above. ${P({\mathbf{\lambda }})}$ is called the prior for $\lambda$. There are a lot of possible prior distributions for $\lambda$. In Bayesian analysis, it's common to choose a prior, called a conjugate prior, that has a form that makes it possible to evaluate the integral analytically. For a Poisson distribution, the conjugate prior is a Gamma distribution.
\[P(x;\alpha ,\beta ) = \frac{{{\beta ^\alpha }}}{{\Gamma (\alpha )}}{x^{\alpha - 1}}{e^{ - \beta x}}\]
for $x > 0$. $\alpha > 0$ is called the shape and $\beta > 0$ is called the rate. $n$ is the length of the segment. $\Gamma$ is the Gamma function, not the Gamma distribution. Notice that form of this distribution looks similar to the Poisson distribution.
With this prior, we can calculate the probability of the data in the segment.
\[P({X_{[i:j]}}) = \frac{1}{{\prod\limits_i {{x_i}!} }}\frac{{\Gamma (\alpha + \sum\limits_i {{x_i})} }}{{\Gamma (\alpha )}}{\left( {\frac{\beta }{{\beta + n}}} \right)^\alpha }{\left( {\frac{1}{{\beta + n}}} \right)^{\sum\limits_i {{x_i}} }}\]
n is the number of elements in the segment.
This looks messy, but it's tractable. It does add two constants $\alpha$ and $\beta$. We will assume that each observation $x_i$ is independent of the others and the segments are independent of one another. We also will assume that the Gamma prior is the same for each possible segment, so we only need to assign values for one pair of $\alpha$ and $\beta$.
Another useful property arises from using the Gamma distribution. The conditional distribution
\[P(\lambda |{{\mathbf{X}}_{[ij]}}) = Gamma(\alpha + \sum\limits_i {{x_i},\beta + n)} \]
is a Gamma distribution with shape $\alpha + \sum\limits_i {{x_i}}$ and rate $\beta + n$.
The estimate of $\lambda$ for a segment is then given by
\[\bar \lambda = \frac{{\alpha + \sum\limits_i {{x_i}} }}{{\beta + n}}\]
If we use small values for $\alpha$ and $\beta$, the data in the segment will "do the talking", not the prior. That is we will have what is called an uninformative prior.
Suppose a change point is at position v in the sequence. We can calculate the marginal probability of a change point occurring at that point as
\[P({c_k} = v|{\mathbf{X}}) = \frac{1}{{P({\mathbf{X}})}}\sum\limits_\kappa {\sum\limits_k {P({{\mathbf{X}}_{[:v]}}|k)P({{\mathbf{X}}_{[v:]}}|{\mathbf{\kappa }} - k)} } \]
We'll see that we can use this expression to sample change point positions. Suppose there are k change points up to location v in the sequence, then the conditional probability of the data up to some point j can be found by summing over all the possible values for v.
\[P({{\mathbf{X}}_{[:j]}}|k) = \sum\limits_v {P({{\mathbf{X}}_{[:v]}}|k)P({{\mathbf{X}}_{[v:]}}|k = 0)} \]
To put this in a more familiar programming type notation
Before we can sample, we need to make some assumptions about the number of change points. We will assume all number of change points from 0 to $k_max$ are equally likely. $P(k) = \frac{1}{{{k_{\max }}}}$. We will also assume that all segmentations with exactly k change points are equally likely. That is, there are no preferred positions for change points given that there are k change points.
\[P(C|k) = \frac{1}{{\left( {\begin{array}{*{20}{c}}
N \\
k
\end{array}} \right)}}\]
N is the length of the sequence and ${\left( {\begin{array}{*{20}{c}}
N \\
k
\end{array}} \right)}$ is N choose k, the number of ways k things can be chosen from N items.
Finally, we can get the conditional probability of the number of change points.
\[P(k|{\mathbf{X}}) \propto \sum\limits_c {P(} {\mathbf{X}}|{\mathbf{C}})P({\mathbf{C}}|k)P(k)\]
Finally after all that math, here's how we proceed: we sample a value of k, the number of change points. Next, we sample a position for the change point. We update L which keeps track of the end of the current segment and repeat until k change points are selected. This process is repeated 1000 times and we keep track of the number of times each and each position is selected.
For the data shown at the top, the method is pretty sure there are two change points.
The samples of the change point locations shows some uncertainty.
The original data had change points at positions 24 and 74. The change point at 24 is confidently sampled, but there is some uncertainty about the second. That's not so surprising given what we see in the plot of the data.
For each segment, we calculate the posterior estimate of $\lambda$ and average over the total samples. The grey areas represent $\pm $ 2 standard deviations from the mean value of $\lambda$.
In the generated data, the $\lambda$ values for the segments were 5, 14.26, and 8.75. As you can see the algorithm found reasonable values, with some uncertainty at the ends of the segments.
You can download the code from here.
Another useful property arises from using the Gamma distribution. The conditional distribution
\[P(\lambda |{{\mathbf{X}}_{[ij]}}) = Gamma(\alpha + \sum\limits_i {{x_i},\beta + n)} \]
is a Gamma distribution with shape $\alpha + \sum\limits_i {{x_i}}$ and rate $\beta + n$.
The estimate of $\lambda$ for a segment is then given by
\[\bar \lambda = \frac{{\alpha + \sum\limits_i {{x_i}} }}{{\beta + n}}\]
If we use small values for $\alpha$ and $\beta$, the data in the segment will "do the talking", not the prior. That is we will have what is called an uninformative prior.
Suppose a change point is at position v in the sequence. We can calculate the marginal probability of a change point occurring at that point as
\[P({c_k} = v|{\mathbf{X}}) = \frac{1}{{P({\mathbf{X}})}}\sum\limits_\kappa {\sum\limits_k {P({{\mathbf{X}}_{[:v]}}|k)P({{\mathbf{X}}_{[v:]}}|{\mathbf{\kappa }} - k)} } \]
We'll see that we can use this expression to sample change point positions. Suppose there are k change points up to location v in the sequence, then the conditional probability of the data up to some point j can be found by summing over all the possible values for v.
\[P({{\mathbf{X}}_{[:j]}}|k) = \sum\limits_v {P({{\mathbf{X}}_{[:v]}}|k)P({{\mathbf{X}}_{[v:]}}|k = 0)} \]
To put this in a more familiar programming type notation
S = np.zeros((kmax, len(x), len(x)))
# calculate P(x[i:j],k=0) for all possible subsequences
for i in range(len(x)):
for j in range(i, len(x)):
S[0,i,j] = poisson_dist(x[i:j+1], a, b)
# calculate P[x[:j]|k) for k > 0 for all possible ending positions with k-1 change points
for k in range(1, kmax):
for j in range(len(x)):
s = 0
for v in range(j):
s += S[k-1, 0, v] * S[0, v+1, j]
S[k, 0, j] = s
Before we can sample, we need to make some assumptions about the number of change points. We will assume all number of change points from 0 to $k_max$ are equally likely. $P(k) = \frac{1}{{{k_{\max }}}}$. We will also assume that all segmentations with exactly k change points are equally likely. That is, there are no preferred positions for change points given that there are k change points.
\[P(C|k) = \frac{1}{{\left( {\begin{array}{*{20}{c}}
N \\
k
\end{array}} \right)}}\]
N is the length of the sequence and ${\left( {\begin{array}{*{20}{c}}
N \\
k
\end{array}} \right)}$ is N choose k, the number of ways k things can be chosen from N items.
Finally, we can get the conditional probability of the number of change points.
\[P(k|{\mathbf{X}}) \propto \sum\limits_c {P(} {\mathbf{X}}|{\mathbf{C}})P({\mathbf{C}}|k)P(k)\]
# get P(k|x). assume all values for k are equally likely and all segmentations
# with k change points are equally likely
Pkdata = np.zeros(kmax)
for k in range(kmax):
Pkdata[k] = (S[k, 0, len(x)-1]/kmax) / sm.comb(len(x), k)
Finally after all that math, here's how we proceed: we sample a value of k, the number of change points. Next, we sample a position for the change point. We update L which keeps track of the end of the current segment and repeat until k change points are selected. This process is repeated 1000 times and we keep track of the number of times each and each position is selected.
for i in range(samples):
k = int(np.random.choice(kmax, size=1, replace=True, p=pkdata)) # sample number of change points
k_samples[k] += 1
L = len(seq)
if k == 0: # only one segment - the whole sequence
pos_samples[L-1] += 1
else: # sample backwards from end of sequence
for l in range(int(k-1), -1, -1):
probs = np.array([S[l, 0, v-1] * S[0, v, L-1] for v in range(1,L)])
v = int(np.random.choice(len(probs), size=1, replace=True, p=normalize(probs)))
lambdas[v:L] += (a + sum(seq[v:L]))/(b + len(seq[v:L])) # use posterior estimate of lambda
stdevs[v:L] += np.sqrt((a + sum(seq[v:L]))/((b + len(seq[v:L]))**2))
L = v-1
pos_samples[v] += 1
lambdas[0:L] += (a + sum(seq[0:L]))/(b + len(seq[0:L]))
For the data shown at the top, the method is pretty sure there are two change points.
The samples of the change point locations shows some uncertainty.
The original data had change points at positions 24 and 74. The change point at 24 is confidently sampled, but there is some uncertainty about the second. That's not so surprising given what we see in the plot of the data.
For each segment, we calculate the posterior estimate of $\lambda$ and average over the total samples. The grey areas represent $\pm $ 2 standard deviations from the mean value of $\lambda$.
In the generated data, the $\lambda$ values for the segments were 5, 14.26, and 8.75. As you can see the algorithm found reasonable values, with some uncertainty at the ends of the segments.
You can download the code from here.
No comments:
Post a Comment