Mmmm Pie

We have a new algorithm for SNP prediction to be published soon. One thing we are interested in is what genomic features do the SNP predictions overlap and what is the distribution of the feature types. To analyze the SNPs we will use a process similar to the one we used to annotate genes with GO annotation. We will build an interval tree of gene, ncRNA, and EST genomic ranges. We will then search the tree to find the genomic object that contains the SNP location. If it the position is within a gene, we will determine if it lies in an exon, intron, a UTR, or within 1 kb of the 3' UTR.

In order to deal with finding if a particular SNP lies with 1000 bases of the 3' UTR, we will extend the gene location by 1 kb at the 3' end. To describe the extended gene region, we will use  a modification of the Gene class from the GO annotation program. The new class is derived from the Gene class and adds a couple of features to the basic class.

A particular position may overlap multiple types of genomic objects. For example, a position might be contained in the exon of a gene and be overlapped by EST as well. To keep track of the various combinations, we will use a bit mask, where each bit positions indicates a different genomic feature.

class HitType(object):
    """
    HitType - bitmap constants for SNP locatiosn types
    """

    no_annotation = 0x00
    in_5UTR       = 0x01
    in_3UTR       = 0x02
    in_CDS        = 0x04
    in_intron     = 0x08
    in_est        = 0x10
    in_ncRNA      = 0x20
    in_1k3UTR     = 0x40   # within 1 kb of 3' UTR


The new extended gene class will use HitType to indicate the gene features at a position.

class GeneExtended(Gene):
    """
    GeneExtended - an extension of the Gene class
    We want to extent the sequence area that the gene covers to include
    1 kb from the 3' UTR
    """

    def __init__(self, gene_str):
        """
        We add a new start and end location and an extended region to the
        3' UTR
        """

        Gene.__init__(self, gene_str)
        if self.strand == '+':
            self.utr3_extended = (self.txEnd + 1, self.txEnd + 1000)
            self.extStart = self.txStart
            self.extEnd = self.txEnd + 1000
        else:
            self.utr3_extended = (self.txStart - 1000, self.txStart + 1)
            self.extStart = self.txStart - 1000
            self.extEnd = self.txEnd

    def HitKind(self, pos):
        """
        HitKind - method for returning the type of gene feature at pos.
        There may be multiple features at a particular position, so we
        return a HitType bit mask to indicate all of the features.
        """

        hit = 0   # hit is a bit mask

        # look for obvious cases
        if self.type == 'est':
            hit |= HitType.in_est

        if self.type == 'non_coding':
            hit |= HitType.in_ncRNA

        exons = self.exons
        out2 = []
        # now search the exons for hits
        exons.intersect(tr.interval(self.name, pos, pos),
                        lambda y: out2.append(y))
        if out2 != []:
            for x2 in out2:   # in an exon - check UTRs first
                if self.strand == '+':
                    if pos < self.cdsStart:
                        hit |= HitType.in_5UTR
                    elif pos > self.cdsEnd:
                        hit |= HitType.in_3UTR
                    else:
                        hit |= HitType.in_CDS  # in an exon
                else:
                    if pos > self.cdsEnd:
                        hit |= HitType.in_5UTR
                    elif pos < self.cdsStart:
                        hit |= HitType.in_3UTR
                    else:
                        hit |= HitType.in_CDS  # in an exon
        else:
            # not in exon - in extended region?
            if pos >= self.utr3_extended[0] and pos <= self.utr3_extended[1]:
                hit |= HitType.in_1k3UTR
            else:
                # not in exon or extended region, must be in an intron
                hit |= HitType.in_intron

        return hit

The initialization for this class extends the gene range with new start and end locations. We will insert these locations in the interval tree rather than the transcript range. Note that we have to be careful about the coding strand of the gene. The class also adds a method for determining the type of genomic features that the position overlaps. The HitKind method returns a bit mask indicating the features at that position.

The class for  ESTs is similar to the class for ncRNAs introduced with the GO program. We read the EST table from UCSC and add the EST ranges to the interval tree.

Counting SNPs Instead of Sheep

Counting can be a little tricky because we only want to count each SNP once, so we establish a hierarchy for counting. If a position overlaps an exon and an EST, we only count it as an exon. We read the SNP file line by line and search the gene tree for a match to the SNP position. The positions in the SNP file are 1-based and the UCSC genome gene positions are 0-based. HitType2Str is a helper function that decodes the bit map. This function creates the hierarchy for counting SNP positions by the order in which it returns the strings. Along with counting the genome object types, we also write the SNP information to a CSV file. The routine returns a dictionary of counts of the various genome objects.

def ReadSNPFile(snp_file, out_file, gene_tree):
    """
    Read a SNP prediction file and count the features at the SNP positions
    The SNP information is written to out_file
    snp_file - name of the SNP file
    out_file - output file for list of snp types
    returns
    a dictionary of SNP_prediction counts
    """

    def HitType2Str(hit_mask):
        """
        HitType2Str - return a string from the bit mask
        This function establishes a priority for counting features.
        We only return one string even though a position might cover multiple
        features. If a position lies in an exon and an EST, we only count it
        as an exon.
        """
        if hit_mask & HitType.in_CDS:
            return 'CDS'  # this means exon
        if hit_mask & HitType.in_5UTR:
            return 'UTR_5'
        if hit_mask & HitType.in_3UTR:
            return 'UTR_3'
        if hit_mask & HitType.in_intron:
            return 'intron'
        if hit_mask & HitType.in_ncRNA:
            return 'ncRNA'
        if hit_mask & HitType.in_est:
            return 'EST'
        if hit_mask & HitType.in_1k3UTR:
            return 'within_1k_3UTR'
        return 'no_annotation'

    counts = dict(UTR_5=0, CDS=0, UTR_3=0, within_1k_3UTR=0,
                  intron=0, EST=0, no_annotation=0, ncRNA=0)

    output = open(out_file, 'w')
    output.write(','.join(['chrom', 'position', 'ref_nucleotide',
                        'snp_nucleotide', 'coverage', 'posterior_prob',
                           'snp_prob', 'mask', 'hit_types', 'genes']) + '\n')

    with open(snp_file, 'r') as input:
        for line in input:
            snp = SNP_prediction(line.strip())

            # search the gene tree to see if the SNP position overlaps the
            # extended gene
            out = []
            gene_tree.intersect(tr.interval(snp.chrom, snp.position-1,
                                            snp.position-1),
                                lambda x: out.append(x))
            if out == []:
                counts['no_annotation'] += 1
            else:
                pos_type = 0
                for x in out:   # out is a list of overlaps
                    for gene in x.chain:    # loop through each gene in the chain
                             pos_type |= gene.HitKind(snp.position-1)

                             snp.gene_list.append(gene.name)
                snp.classification = pos_type
                hit_str = HitType2Str(pos_type)
                counts[hit_str] += 1  # count the feature

            snp.Print(output)
                       
    output.close()

    return counts

Slicing Pie

The final task we want this program to perform is to draw a pie chart of the percentages of each of object type. This is easy using Matplotlib. counts is the dictionary returned by ReadSNPFile().

       fracs = []
        for cl in counts.keys():
            frac = float(counts[cl])/total
            fracs.append(frac)


        # draw a pie chart and save it
        pylab.figure(figsize = (6,6))
        explode = [0.025] * len(fracs)
        patches, txt, txt2 = pylab.pie(fracs,
                                       labels=counts.keys(),
                                       explode=explode,
                                       autopct='%1.1f%%',
                                       shadow=True)
        pylab.legend(patches, counts.keys(), loc='best')
        pylab.title(typ, bbox={'facecolor':'0.8', 'pad':5})
        pylab.savefig(data_files[typ]['plot_file'], bbox_inches='tight')


Download the whole program.

No comments:

Post a Comment