A Knotty Problem

A pseudoknot is an RNA secondary structure formed by pairing between a loop and a region located outside of the stem flanking the loop. Pseudoknots fold into knot-shaped 3D conformation.

The image below shows a simple pseudoknot structure found in human telomerase.

Pseudoknot
https://en.wikipedia.org/wiki/Pseudoknot

Pseudoknots are involved in a number of biological functions including gene expression and viral replication. Pseudoknots are known to exist in ribosomal RNA, viral RNA and a number of ribosomes. Pseudobase++ is probably the largest database of experimentally determined or predicted RNA sequences containing pseudoknots.

Computational methods for predicting RNA secondary structures without pseudoknots have been around for quite some time. Among the more popular are Mfold, Sfold, and RNAstructure. A common method for predicting secondary structure is to use a thermodynamic energy model to find a structure with the minimum free energy, the MFE structure. A dynamic program is used to optimize the free energy. Alternatively, rather than finding a single optimal structure, structures can be sampled from the dynamic program. Since RNA energies are assumed to follow a Boltzmann distribution, structures can be sampled from the dynamic program in proportion to their probability of assuming a particular shape. By generating a large number of samples, you can obtain a representative distribution of the ensemble of RNA secondary structures. See this paper for some details.

The thermodynamic dynamic programs for predicting pseudoknot free secondary structures have runtime complexity of \(O({n^3})\) and memory use of \(O({n^2})\), where n is the length of the RNA sequence.

Pseudoknots

Predicting structures with pseudoknots ups the computational ante considerably. Pseudoknots are formed when two base pairs i,j and d,e  with i
RNA Secondary Structure - no pseudoknot
RNA Secondary Structure with a pseudoknot
There are a variety of algorithms for calculating pseudoknot structures. Wikipedia lists a number of secondary structure prediction programs. Predicting lowest free energy structures with pseudoknots in general has been shown to be NP-complete. It is possible to create a dynamic program of runtime complexity  \(O({n^6})\) or \(O({n^5})\) with some approximations. The high computational complexity of these algorithms limits their use to short sequences. Most practical algorithms are restricted to a subset of pseudoknots.

Several years ago, I worked with Chip Lawrence on an approximate method taht involved sampling from the \(O({n^3})\) dynamic program with a search to find RNA structures with pseudoknots. Recently, I continued some this work with RPI students Chris May, Joseph Hitchcock, and Paul Revere.

When you sample a set of structures using the non-pseudoknot dynamic program, it's common to cluster the samples. From the clusters, you can derive a centroid structure that consists of base pairs that occur in more than 50% of the samples making up that cluster. In this process, you sometimes see that the clusters form structures that are incompatible with one another. An example is shown below.


Like the previous arc diagrams, the arcs in the circle diagrams connect base pairs. Since the basic dynamic program does not allow non-nested arcs (no crossing lines), no single predicted structure could contain both sets of arcs. The question then arises, in a collection of samples from an RNA that naturally contains a pseudoknot, will we see alternate structures like those above indicating a pseudoknot? For example, the arc diagram for the pseudoknot structure shown in the second figure from the top could be decomposed into to two non-pseudoknot diagrams. By combining sampled structures, can we find pseudoknot structures?


Digging Through the Samples

There are a number of programs that will sample structures from a thermodynamic dynamic program. I like the stochastic program from the RNAstructure package. RNAstructure is open source and has an up to date thermodynamic model. stochastic generates a set of structures sampled in proportion to each structure's Boltzmann probability \({p_i} = \frac{{{e^{ - \beta {E_i}}}}}{{\sum\limits_j {{e^{ - \beta {E_j}}}} }}\). The denominator is a sum over all possible structures for the given sequence and is called the partition function. Rather than calculating the partition function for structures containing pseudoknots, we propose to use the set of samples to generate additional secondary structures, possibly containing pseudoknots.

To generate new structures, we sample a pair of structures from the collection of sampled structures. Each structure consists of a list of base pairs. Pairs of structures are combined into a single list of base pairs. Duplicate base pairs are removed. In the case where a there is a single base on the list that is paired with two different bases, a fair coin is flipped to decide which pair to include  in the new structure. Combining the structures in this way may produce pseudoknotted structures. We sample new structures in this manner to generate a proposal from the distribution of RNA secondary structures, including those with pseudoknots.
 We use a Metropolis-Hastings (MH) algorithm to generate a sequence of structures from the distribution of secondary structures.

1. Sample a pair of structures uniformly, a,b, from the sampled structures.

2. Combine the structures into a new structure, current. Let \({t_{current}}\) be the number of coin flips required to break ties.
 

3. Set \(sample\_list = \{ \} \).
 
4. Repeat steps 5 through 7 for a fixed burn-in period, followed by a fixed sampling period.

5. Sample a new pair of structures, i,j. Combine the structures into new structure, proposal. \({t_{proposal}}\) is the number of coin flips required to break ties.

6. Calculate the transition functions 

\(T(proposal|current) = T(propsal) = P({S_i})P({S_j}){(0.5)^{{t_{proposal}}}}\) and \(T(current|proposal) = T(current) = P({S_a})P({S_b}){(0.5)^{{t_{current}}}}\). The proposal does not depend on the current structure pair.

7. With probability \(\min (1,\frac{{P(proposal)T(current)}}{{P(current)T(proposal)}})\) set current=proposal. If the burn-in iterations have been completed, add current to the sample-list.


8. Calculate a centroid structure from \(sample\_list\) by combining all base pairs with a sampling frequency greater than 0.5.


One problem with algorithm as outlined above is that it depends on the probabilities of the predicted structures. As shown above, the Boltzmann probability includes the partition function as the denominator. Calculation of the  partition function is the expensive part of the structure calculations. However, since the Metropolis Hastings calculation only depends on ratios of probabilities, the partition function cancels out  and we only need to calculate the numerator of the Boltzmann probailities. The free energy of the structure can be calculated by the energy function of the nupack program. Nupack will calculate the energy of a structure containing a pseudoknot. To avoid numerical difficulties, we can do most of the calculations, except for the Metropolis-Hasting step in log space.
 

Putting It All Together

I wrote a Python3 program to implement the algorithm described above. The program requires The RNAstructure package to generate the initial samples as a .ct file. The Python program requires Nupack 3.0 or greater and Numpy.

Download the code.

Does It Work?

Of course, the big question is "Does this idea actually work?". Here's the reported structure of antiHIV1-RT_1.3a, an RNA aptamer against HIV-1 reverse transcriptase from Pseudobase++ in bracket notation:

.......(((((..[[[[..)))))....]]]]....

And the predicted structure from the algorithm:

.......(((((..[[[[..)))))....]]]]....
 
Although the program is stochastic, the predictions are consistent over multiple runs. This is one small example, so obviously much more testing is needed. Please take all of this with a large grain of NaCL. A number of other small sequences yielded prediction matching those reported in Pseuodobase++.

However, not all structures were correctly predicted.  Pseudobase++ lists the following structures for Cas-Br-E-MuLV:

...........((((((((.[[[[[[[)))).))))..................]]]]]]]..

with an energy of -27.9 kcal/mole, but the algorithm predicts

..((((((((........)).))))))((((....))))....(((((......)))))....

with a free energy of -26.1.

The bigger question is whether there is any reason to believe that pseudoknots will show up as alternate structures in a large number of samples from the standard dynamic program. If they do, and the pseudoknot structure has lower free energy than the mfe, non-pseudoknot structure will the algorithm find the pseudoknots? I haven't proved to myself under what conditions this is true. I hope to work on that next.

Much thanks to Chris, Joseph, and Paul who implemented an earlier version of this program. I re-wrote the program because I wanted to use the new version of Nupack.





No comments:

Post a Comment