The Analytic Garden: Counting yet again...This time RNA secondary structures

The secondary structure of a sequence of nucleotides refers to the basepairing interactions within a single molecule or interacting molecules. In an RNA molecule, nucleotides can basepair with nucleotides on the same sequence to form C-G, A-U, or G-U pairs. This pairing can yield a complicated 2D structure. For example,

The prediction of RNA secondary structure from the primary sequence is a well developed art, at least in the absence of pseudoknots, and if the sequence is not too long. Biologically relevant prediction models depend on energy models of the interactions among the nucleotides. For details of predicting RNA structures, see the Mathews Lab RNAStructure programs or the Vienna RNA Package.

We will look at a simpler task, getting a ballpark estimate of the total allowable number of secondary structures available to a sequence. This estimate is likely physically wrong in that it doesn't take the interaction energies into account, so probably allows for non-biological structures. It ignores psuedocounts, although so do most prediction programs. The only constraint on the structure is the base pairing rules and a minimum loop size.

I wrote the program for counting secondary structures some time ago. I needed a rough estimate of how many structures were allowed by a given sequence. Writing the program was easy enough. The problem was to know if it was counting correctly given the small number of constraints on the structures: C-G, A-U, G-U pairs, and a minimum loop size. It is possible to count the number of structured by hand for very small sequences, but the number of sequences grows rapidly with sequence length.For longer structures, I didn't have a reference to compare my program's results to. Fortunately, I stumbled upon this contest. They asked for a method of counting RNA secondary structures and gave a count of 284850219977421 for this 64 nucleotide sequence:

>Test_sequence
AUGCUAGUACGGAGCGAGUCUAGCGAGCGAUGUCGUGAGUACUAUAUAUGCGCAUAAGCCACGU

I never looked at the contest results, so I don't know how the following code compares to the solution at the web site. I was just happy my results matched the posted result. The program is built around a dynamic programming routine based on the Nussinov algorithm. The Nussinov algorithm is one of the earliest secondary structure prediction programs. Given the constraints on base pairing the Nussinov algorithm finds the structure with the maximum number of base pairs. This structure is probably not biologically realistic because it ignores energy considerations so the structure may be unstable.

Here's the basic counting routine. It's passed a sequence and a minimum loop size. The is_complement() routine checks to see if a given pair of nucleotides can form a legitimate pair (A-U, C-G, G-U). The routine recursively builds a table of the count of structures for each substring in the sequence. The table c[i][j] contains the count of structures in the substring i-j. c[0][sequence_length-1] is the count of all the possible structures for the full sequence.

def count_structures(seq, m):
    '''
    count_structures
    count the number of valid RNA structures
    based on a modified Nussinov algorithm
    see math.mit.edu/classes/18.417/Slides/rna-ensembles.pdf

    input:
      seq - a sequence string, uppercase, RNA format
      m - the minimum loop size

    returns:
      the number of valid (i.e. no non-valid pairs) non-crossing structures
    '''

    c = [[0]*(len(seq)) for x in xrange(len(seq))]

    for i in xrange(len(seq)):
        for j in xrange(len(seq)):
            if j - i <= m:
                c[i][j] = 1

    for i in xrange(len(seq)-1, -1, -1):
        for j in xrange(i+1, len(seq)):
            s = 0
            for k in xrange(i, j-m):
                if is_complement(seq[k], seq[j]):
                    if k - 1 < 0:
                        s += c[k+1][j-1]
                    else:
                        s += c[i][k-1] * c[k+1][j-1]
            c[i][j] = c[i][j-1] + s

    return c[0][len(seq)-1]

Here's the full program: download

#! /usr/bin/python

import sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

def print_usage():
    print "usage: " + sys.argv[0] + " fasta_file "
    sys.exit(1)

def read_fasta(fasta_file):
    '''
    read_fasta
       reads a single Fasta record from fasta_file
       It dies an ugly death if the file is not there or is not a Fasta file

       returns:
          a sequence string in RNA format
    '''
    rec = SeqIO.read(fasta_file, 'fasta')
    seq = str(rec.seq)
    seq = seq.upper()
    seq = seq.replace('T', 'U')

    return seq

def is_complement(c1, c2):
    comps = {'A':['U'],
             'C':['G'],
             'G':['C','U'],
             'U':['A','G']}
    if c2 in comps[c1]:
        return True
    else:
        return False


def count_structures(seq, m):
    '''
    count_structures
    count the number of valid RNA structures
    based on a modified Nussinov algorithm
    see math.mit.edu/classes/18.417/Slides/rna-ensembles.pdf

    input:
      seq - a sequence string, uppercase, RNA format
      m - the minimum loop size

    returns:
      the number of valid (i.e. no non-valid pairs) non-crossing structures
    '''

    c = [[0]*(len(seq)) for x in xrange(len(seq))]

    for i in xrange(len(seq)):
        for j in xrange(len(seq)):
            if j - i <= m:
                c[i][j] = 1

    for i in xrange(len(seq)-1, -1, -1):
        for j in xrange(i+1, len(seq)):
            s = 0
            for k in xrange(i, j-m):
                if is_complement(seq[k], seq[j]):
                    if k - 1 < 0:
                        s += c[k+1][j-1]
                    else:
                        s += c[i][k-1] * c[k+1][j-1]
            c[i][j] = c[i][j-1] + s

    return c[0][len(seq)-1]


def main(fasta_file, m):
    seq = read_fasta(fasta_file)

    count = count_structures(seq, m)
    print 'count = ' + str(count)

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print_usage()
    elif len(sys.argv) == 2:
        main(sys.argv[1], 3)
    else:
        main(sys.argv[1], int(sys.argv[2]))

The Analytic Garden

Counting yet again...This time RNA secondary structures

No comments:

Post a Comment

Labels

Contributors

wfmu