The prediction of RNA secondary structure from the primary sequence is a well developed art, at least in the absence of pseudoknots, and if the sequence is not too long. Biologically relevant prediction models depend on energy models of the interactions among the nucleotides. For details of predicting RNA structures, see the Mathews Lab RNAStructure programs or the Vienna RNA Package.
We will look at a simpler task, getting a ballpark estimate of the total allowable number of secondary structures available to a sequence. This estimate is likely physically wrong in that it doesn't take the interaction energies into account, so probably allows for non-biological structures. It ignores psuedocounts, although so do most prediction programs. The only constraint on the structure is the base pairing rules and a minimum loop size.
I wrote the program for counting secondary structures some time ago. I needed a rough estimate of how many structures were allowed by a given sequence. Writing the program was easy enough. The problem was to know if it was counting correctly given the small number of constraints on the structures: C-G, A-U, G-U pairs, and a minimum loop size. It is possible to count the number of structured by hand for very small sequences, but the number of sequences grows rapidly with sequence length.For longer structures, I didn't have a reference to compare my program's results to. Fortunately, I stumbled upon this contest. They asked for a method of counting RNA secondary structures and gave a count of 284850219977421 for this 64 nucleotide sequence:
>Test_sequence
AUGCUAGUACGGAGCGAGUCUAGCGAGCGAUGUCGUGAGUACUAUAUAUGCGCAUAAGCCACGU
I never looked at the contest results, so I don't know how the following code compares to the solution at the web site. I was just happy my results matched the posted result. The program is built around a dynamic programming routine based on the Nussinov algorithm. The Nussinov algorithm is one of the earliest secondary structure prediction programs. Given the constraints on base pairing the Nussinov algorithm finds the structure with the maximum number of base pairs. This structure is probably not biologically realistic because it ignores energy considerations so the structure may be unstable.
Here's the basic counting routine. It's passed a sequence and a minimum loop size. The is_complement() routine checks to see if a given pair of nucleotides can form a legitimate pair (A-U, C-G, G-U). The routine recursively builds a table of the count of structures for each substring in the sequence. The table c[i][j] contains the count of structures in the substring i-j. c[0][sequence_length-1] is the count of all the possible structures for the full sequence.
def count_structures(seq, m):
'''
count_structures
count the number of valid RNA structures
based on a modified Nussinov algorithm
see math.mit.edu/classes/18.417/Slides/rna-ensembles.pdf
input:
seq - a sequence string, uppercase, RNA format
m - the minimum loop size
returns:
the number of valid (i.e. no non-valid pairs) non-crossing structures
'''
c = [[0]*(len(seq)) for x in xrange(len(seq))]
for i in xrange(len(seq)):
for j in xrange(len(seq)):
if j - i <= m:
c[i][j] = 1
for i in xrange(len(seq)-1, -1, -1):
for j in xrange(i+1, len(seq)):
s = 0
for k in xrange(i, j-m):
if is_complement(seq[k], seq[j]):
if k - 1 < 0:
s += c[k+1][j-1]
else:
s += c[i][k-1] * c[k+1][j-1]
c[i][j] = c[i][j-1] + s
return c[0][len(seq)-1]
'''
count_structures
count the number of valid RNA structures
based on a modified Nussinov algorithm
see math.mit.edu/classes/18.417/Slides/rna-ensembles.pdf
input:
seq - a sequence string, uppercase, RNA format
m - the minimum loop size
returns:
the number of valid (i.e. no non-valid pairs) non-crossing structures
'''
c = [[0]*(len(seq)) for x in xrange(len(seq))]
for i in xrange(len(seq)):
for j in xrange(len(seq)):
if j - i <= m:
c[i][j] = 1
for i in xrange(len(seq)-1, -1, -1):
for j in xrange(i+1, len(seq)):
s = 0
for k in xrange(i, j-m):
if is_complement(seq[k], seq[j]):
if k - 1 < 0:
s += c[k+1][j-1]
else:
s += c[i][k-1] * c[k+1][j-1]
c[i][j] = c[i][j-1] + s
return c[0][len(seq)-1]
Here's the full program: download
#! /usr/bin/python
import sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
def print_usage():
print "usage: " + sys.argv[0] + " fasta_file"
sys.exit(1)
def read_fasta(fasta_file):
'''
read_fasta
reads a single Fasta record from fasta_file
It dies an ugly death if the file is not there or is not a Fasta file
returns:
a sequence string in RNA format
'''
rec = SeqIO.read(fasta_file, 'fasta')
seq = str(rec.seq)
seq = seq.upper()
seq = seq.replace('T', 'U')
return seq
def is_complement(c1, c2):
comps = {'A':['U'],
'C':['G'],
'G':['C','U'],
'U':['A','G']}
if c2 in comps[c1]:
return True
else:
return False
def count_structures(seq, m):
'''
count_structures
count the number of valid RNA structures
based on a modified Nussinov algorithm
see math.mit.edu/classes/18.417/Slides/rna-ensembles.pdf
input:
seq - a sequence string, uppercase, RNA format
m - the minimum loop size
returns:
the number of valid (i.e. no non-valid pairs) non-crossing structures
'''
c = [[0]*(len(seq)) for x in xrange(len(seq))]
for i in xrange(len(seq)):
for j in xrange(len(seq)):
if j - i <= m:
c[i][j] = 1
for i in xrange(len(seq)-1, -1, -1):
for j in xrange(i+1, len(seq)):
s = 0
for k in xrange(i, j-m):
if is_complement(seq[k], seq[j]):
if k - 1 < 0:
s += c[k+1][j-1]
else:
s += c[i][k-1] * c[k+1][j-1]
c[i][j] = c[i][j-1] + s
return c[0][len(seq)-1]
def main(fasta_file, m):
seq = read_fasta(fasta_file)
count = count_structures(seq, m)
print 'count = ' + str(count)
if __name__ == '__main__':
if len(sys.argv) < 2:
print_usage()
elif len(sys.argv) == 2:
main(sys.argv[1], 3)
else:
main(sys.argv[1], int(sys.argv[2]))
import sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
def print_usage():
print "usage: " + sys.argv[0] + " fasta_file
sys.exit(1)
def read_fasta(fasta_file):
'''
read_fasta
reads a single Fasta record from fasta_file
It dies an ugly death if the file is not there or is not a Fasta file
returns:
a sequence string in RNA format
'''
rec = SeqIO.read(fasta_file, 'fasta')
seq = str(rec.seq)
seq = seq.upper()
seq = seq.replace('T', 'U')
return seq
def is_complement(c1, c2):
comps = {'A':['U'],
'C':['G'],
'G':['C','U'],
'U':['A','G']}
if c2 in comps[c1]:
return True
else:
return False
def count_structures(seq, m):
'''
count_structures
count the number of valid RNA structures
based on a modified Nussinov algorithm
see math.mit.edu/classes/18.417/Slides/rna-ensembles.pdf
input:
seq - a sequence string, uppercase, RNA format
m - the minimum loop size
returns:
the number of valid (i.e. no non-valid pairs) non-crossing structures
'''
c = [[0]*(len(seq)) for x in xrange(len(seq))]
for i in xrange(len(seq)):
for j in xrange(len(seq)):
if j - i <= m:
c[i][j] = 1
for i in xrange(len(seq)-1, -1, -1):
for j in xrange(i+1, len(seq)):
s = 0
for k in xrange(i, j-m):
if is_complement(seq[k], seq[j]):
if k - 1 < 0:
s += c[k+1][j-1]
else:
s += c[i][k-1] * c[k+1][j-1]
c[i][j] = c[i][j-1] + s
return c[0][len(seq)-1]
def main(fasta_file, m):
seq = read_fasta(fasta_file)
count = count_structures(seq, m)
print 'count = ' + str(count)
if __name__ == '__main__':
if len(sys.argv) < 2:
print_usage()
elif len(sys.argv) == 2:
main(sys.argv[1], 3)
else:
main(sys.argv[1], int(sys.argv[2]))
No comments:
Post a Comment