Let's GO for it

One of the things we would like to know from our sequencing experiments is the number of reads mapped to the various genes and what those genes do. Conceptually, this process is simple; for each read, check to see what gene it maps to; count reads overlapping each gene; for each gene with coverage, look up what the gene does. The latter step involves what is called Gene Ontology (the GO in the title). The Gene Ontology project is an initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. Genes are assigned various Gene Ontology identifiers, a GO ID. Determining  a particular gene's products is then a matter of looking up the GO IDs in a gene ontology database.

Although conceptually simple, actually finding the gene products from a set of reads can be tricky. However, there is software available that seems like it should make the process if not easy, at least doable. In python, there is some GO analysis software, some rudimentary BioPython GO software, and a parser for the GO database. For perl, there is an extensive set of GO modules in the go-perl library. The R BioConductor project has probably the most extensive GO analysis collection.

Initially, I figured that with all of that software, it shouldn't be too difficult to put something together. After spending some time with the limited documentation available, I decided that BioConductor had the most to offer. This looked like a possible bet. Unfortunately, much BioConductor documentation consists of vignettes.  These are extensive examples of using the particular package. Although, I usually prefer learning from examples, the examples depended too much on the particulars of the data being presented. When I tried to extend the methods to my own data, I ran into problems. The R documentation on the various methods for reading bam files, GO terms, gene annotations etc. was complicated. I soon realized that I would have to spend a long time digging through the docs and in the end this package might not be the right one anyway.

According to Larry Wall, the original developer of perl, the three programmer virtues are:  laziness, impatience, and hubris. If impatience is a virtue, impatience with programming docs is one of my greatest virtues. After spending too many hours with the BioConductor docs and considering that hubris might also be a virtue, I decided to roll my own GO analysis.

What's in a name?

One of the first problems encountered when dealing with genomics involves gene names. Genes typically have multiple aliases. Consider the region from 332,801 to 339,016 on D. melanogaster chromosome 3L. It is variously identified as mthl10, methuselah-like 10, NM_001274276.1CG17061, and FBgn0035132.

Aligning reads with Bowtie2 or BWA will give us genomic coordinates for each mapped read. However, alignment doesn't tell us what features the reads might cover. In order to see that, we need to compare the read locations to the locations of features like genes and exons.

One convenient web location to get information about genes and other genomic features is the UCSC Table Browser. From the Table Browser, you can download the data for any of the UCSC supported browser tracks. The tables are tab delimited text files and are easy to parse with python or perl. It sometimes takes a little digging on the Table Browser web site to find what you need, but if it's displayed on the browser tracks, you should be able to find a table or collection of tables with the data you need.

The first collection of data we need are the gene and exon locations. We can find that information in the flybaseGene table. The first few lines of this table look like:

binnamechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEnds
585CG11023-RAchr2L+752894917679927637528,8228,8667,8116,8589,9491,
585CG2671-RCchr2L-983518583112141713699835,11409,11778,12285,13519,13682,14932,17052,18260,11344,11518,12221,12928,13625,14874,15711,17212,18583,
585CG2671-RBchr2L-983521372112141994499835,11409,11778,12285,13519,13682,14932,19884,21135,11344,11518,12221,12928,13625,14874,15711,20020,21372,
585CG2671-RFchr2L-9835213721121415648109835,11409,11778,12285,13519,13682,14932,19879,21065,21348,11344,11518,12221,12928,13625,14874,15711,20020,21200,21372,
585CG2671-REchr2L-983521372112141564899835,11409,11778,12285,13519,13682,14932,19879,21065,11344,11518,12221,12928,13625,14874,15711,20015,21372,
585CG2671-RAchr2L-9835185831121417136109835,11409,11778,12285,13519,13682,14932,17052,18025,18260,11344,11518,12221,12928,13625,14874,15711,17212,18168,18583,
585CG2671-RDchr2L-983521372112141564899835,11409,11778,12285,13519,13682,14932,19879,21065,11344,11518,12221,12928,13625,14874,15711,20020,21372,
585CG2657-RBchr2L-21918251512191825151521918,22742,22993,23928,24746,22687,22935,23873,24210,25151,
585CG31973-REchr2L-254015924226520581501325401,26765,27052,28410,28732,28981,33844,34557,34719,38534,39300,58080,59189,26688,26964,27490,28639,28926,29068,34288,34604,35212,38731,39857,58182,59242,
585CG31973-RBchr2L-254015924226520581501425401,26765,27052,28014,28732,28981,30393,33844,34557,34719,38534,39300,58080,59189,26688,26964,27490,28240,28926,29068,33270,34288,34604,35212,38731,39857,58182,59242,
Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

txStart and txEnd are the transcript start and end locations. cdsStart and cdsEnd are the coding region locations.

We will also need a description of GO terms and their definitions. We get that from the Gene Ontology web site. Ontology files come in various formats. We will use an .obo file, since it's an easy to parse text file. A typical GO record in an obo file look like:

[Term] id: GO:0000001
name: mitochondrion inheritance 
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764] 
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution 

Next, we need some way to map the GO terms to genes. The UCSC table browser allows us to download the fbGO table. The entries for this table look like:



geneIdgoIdaspect
FBgn0025724GO:0006888P
FBgn0025724GO:0048194P
FBgn0025724GO:0007166P
FBgn0025724GO:0006897P
FBgn0025724GO:0006887P
FBgn0025724GO:0006886P
FBgn0025724GO:0006890P
FBgn0025724GO:0030126C
FBgn0025724GO:0008565F
FBgn0043467GO:0048149P

Here's where the gene aliases become important. The fbGO table is indexed by identifiers starting with FBgq. These are Flybase/Ensembl identifiers. The gene table is indexed by Flybase annotation symbols beginning with CG. We will need a way to translate between the two sets of identifiers. This requires another table, the flyBase2004Xref  table from UCSC. The xref table looks like:

namesymbolsynonymsfbtrfbgnfbppfbantype
CG4696-RBMp20
FBtr0005009FBgn0002789FBpp0099513FBan0004696
CG17291-RAPp2A-29B
FBtr0005088FBgn0005776FBpp0099974FBan0013383
CG10890-RAmus201
FBtr0005673FBgn0002887FBpp0099904FBan0032956
CG10890-RBmus201
FBtr0005674FBgn0002887FBpp0099905FBan0032956
CG3484-RAAdhr
FBtr0006151FBgn0000056FBpp0099544FBan0032954
CG9565-RANep3
FBtr0070000FBgn0031081FBpp0070000FBan0009565
CR32826-RAtRNA:CR32826-RA
FBtr0070001FBgn0052826

tRNA
CG9570-RACG9570
FBtr0070002FBgn0031085FBpp0070001FBan0009570
CG32825-RAOr19b
FBtr0070003FBgn0062565FBpp0070002FBan0032825
CG15322-RACG15322
FBtr0070005FBgn0031088FBpp0070004FBan0015322


This table show the relationship between the CG identifiers and the Fbgn identifiers in the GO table.

The relationship among all of these files is shown here.




We will actually add another table for non-coding genes. It has a format similar to the gene table.


The map is not the territory

With these tables in mind, we will need some objects to contain the data. To hold the gene information, and eventually, the read counts and GO annotation, we will create a Gene class:

class Gene(object):
    """
    FlyBaseGeneClass: A class for coding genes
    UCSC gene tables are tab delimited. Instances of this class are initialized
    with a text string from the  gene table.
    Note: All coordinates in these tables are half-open zero-based.
          This means that the first 100 bases of a chromosome are
          represented as [0,100), i.e. 0-99. The second 100 bases are
          represented as [100,200), i.e. 100-199, and so forth.
          see http://www.encodeproject.org/cgi-bin/hgTables?hgsid=366612875
    """

    def __init__(self, gene_str):
    self.type = 'coding'

    data_list = gene_str.split('\t')
    self.name = data_list[1]
    self.chrom = data_list[2]
    self.strand = data_list[3]
    self.txStart = int(data_list[4])
    self.txEnd = int(data_list[5])
    self.cdsStart = int(data_list[6])
    self.cdsEnd = int(data_list[7])
    self.exonCount = int(data_list[8])
       
    exonStarts = data_list[9].split(',')
    exonStarts.pop()   # remove empty item at end
    exonEnds = data_list[10].split(',')
    exonEnds.pop()

    # use an IntervalTree for fast searching
    self.exons = tr.IntervalTree()
    for start, end in zip(exonStarts, exonEnds):
         self.exons.insert(tr.interval(self.name, int(start), int(end)))

    self.symbol = None
    self.flybase_gene = None
    self.go_terms = []
    self.read_count = 0
    self.reads_in_exons = 0

The initialization routine for this class  is passed a line from the gene table. It splits the line and fills out the various fields. Objects of this class have slots for a  list of GO terms, read counts, and something called an IntervalTree to store exon locations. More about that structure in a moment. We read the gene file line by line and create a gene object for each line. We will store the genes in a dictionary keyed on the gene name (the CG identifier).

GO annotation records are stored in a GO object, defined as:

class GO(object):
    """
    GO: a class for GO annotation
    The structure is pretty self-expanatory
    is_a is a list of GO terms
    """

    def __init__(self, go_id, name, namespace, definition, is_a):
        self.id = go_id
        self.name = name
        self.namespace = namespace
        self.definition = definition
        self.is_a = is_a
 

Parsing the GO records is straight forward.

def ReadGOFile(go_file):
    """
    ReadGOFile: read a Gene ontology obo file and create a dictionary of
                GO terms. This is a simple parser for obo files
                Parsing is kind of weirds because Python's limited flow control
                structures, i.e no assignment in conditions, no switch etc.

    go_file - GO obo file see http://geneontology.org/ for details

    return: a dictionary of GO objects keyed on GO id
    """

    go = dict()
    with open(go_file, 'r') as f:
        go_id = None
        for line in f:
            if re.search('^\\[Term\\]', line):
                if go_id:
                    g = GO(go_id, name, namespace, definition, is_a)
                    go[g.id] = g
                    for alt_id in alt_ids:
                        go[alt_id] = g
                go_id = None
                is_a = []
                name = None
                namespace = None
                definition = None
                alt_ids = []
                continue

            m = re.search('^id: (GO:\d+)', line)
            if m:
                go_id = m.group(1)
                continue

            m = re.search('^name: (.+?)\n', line)
            if m:
                name = m.group(1)
                continue

            m = re.search('^namespace: (.+?)\n', line)
            if m:
                namespace = m.group(1)
                continue

            m = re.search('^def: (.+?)\n', line)
            if m:
                definition = m.group(1)
                continue

            m = re.search('^alt_id: (GO:\d+)', line)
            if m:
                alt_ids.append(m.group(1))

            m = re.search('^is_a: (GO:\d+)', line)
            if m:
                is_a.append(m.group(1))
               
        return go
 

GO objects are stored in a dictionary, keyed by GO identifiers.

Reading the xref file allows us to add Flybase identifiers to the genes in the gene dictionary. 

def ReadXrefFile(xref_file, genes):
    """
    ReadXrefFile: Read the UCSC gene-flybase xref table file and
                  add gene symbol and FlyBase Id to gene object
                  It also builds a dictionary of FB ids => UCSC gene names

    xref_file - the name of a UCSC gene/FlyBase xref table file
    genes - a dictionary of Gene objects

    return:
    fb_genes - a dictionary keyed on FBgn ids. Each item is a list of genes
    """
    fb_genes = dict()
    with open(xref_file, 'r') as f:
        f.readline()      # throw away header line
        for line in f:
            data_list = line.split('\t')
            if data_list[0] in genes:
                genes[data_list[0]].symbol = data_list[1]
                genes[data_list[0]].flybase_gene = data_list[4]
                if data_list[4] not in fb_genes:
                    fb_genes[data_list[4]] = []
                fb_genes[data_list[4]].append(genes[data_list[0]])

    return fb_genes

The fb_gene dictionary that is returned is keyed on  Flybase identifiers. It's values are a list of gene objects that Flybase identifier applies to.

The FBgo table contains the GO identifiers associated with the various genes. It's indexed by FBgn identifiers. Reading the FBgo table, along with the fb_genes dictionary will give us the links needed to add GO terms to the genes.

def ReadFBGOFile(fbgo_file, fb_genes):
    """
    ReadFBGOFile: Read the Flybase GO annotation file and add GO terms to genes
    fbgo_file - UCSC file mapping FB ids to GO terms
    fb_genes - dictionary of Gene objects keyed on FBgn id
    """
   
    with open(fbgo_file, 'r') as f:
        f.readline()      # throw away header line
        for line in f:
            data_list = line.strip().split('\t')
            if data_list[0] in fb_genes:
                for gene in fb_genes[data_list[0]]:
                    gene.go_terms.append(data_list[1])

Searching the forest for a tree

Let's look at how we read the gene file:

def ReadGeneFile(gene_file, chrom, gene_tree):
    """
    ReadGeneFile: Read the UCSC gene table file
    gene_file - the name of a UCSC gene table file
    gene_tree - an IntervalTree of gene locations

    returns: a dictionary of gene objects keys on gene names (CG names)
    """

    genes = dict()
    with open(gene_file, 'r') as f:
        f.readline()      # throw away header line
        for line in f:
            gene = Gene(line.strip())
            if gene.chrom == chrom:
                genes[gene.name] = gene
                gene_tree.insert(tr.interval(chrom, gene.txStart, gene.txEnd),
                                 other = gene)
   
    return genes
This routine is passed the file name, chromosome label, and something called gene_tree. gene_tree is an interval tree. An interval tree is a binary tree structure that allows efficient searching to locate all intervals that overlap with any given interval or point. We have a series intervals describing gene locations on the chromosome. In bam files, we have thousands, maybe millions of reads that map to a chromosome. We want to know which, if any, gene and exon that the read overlaps. The naive approach would be to search the gene dictionary and look for a possible overlap with each gene. Fly chromosome 2L has 3769 protein coding genes. This isn't really a lot, but with thousands of reads, and multiple experiments to go through, that's a lot of searching. It requires O(N) for each search, where N is the number of genes. Interval trees gives us a more efficient method of searching, reducing the search time to O(log(N)).

If you gooogle  python interval tree, you find a number of software packages for interval trees. Each has slightly different features. We need to store the transcript start and end locations for each gene. A complication arises because of alternate splicing of genes. We often see several different genes that have the same transcript coordinates, but different exon locations. For each transcript interval, we need to maintain a list of gene products from that transcript. 

After looking at a couple of different python interval tree modules, I decided to try quicksect from bx-python. The quicksect tree node initialization has an other field, that allows us to pass a gene to be associated with the transcript interval. To deal with the issue of alternate splicing of transcripts, I modified the interval tree code to maintain a list of genes with the same transcript at each tree node. The original code is here and the modified version can be downloaded here.

With all this in mind, we can see how we read the gene file and build the tree.

def ReadGeneFile(gene_file, chrom, gene_tree):
    """
    ReadGeneFile: Read the UCSC gene table file
    gene_file - the name of a UCSC gene table file
    gene_tree - an IntervalTree of gene locations

    returns: a dictionary of gene objects keys on gene names (CG names)
    """

    genes = dict()
    with open(gene_file, 'r') as f:
        f.readline()      # throw away header line
        for line in f:
            gene = Gene(line.strip()) # create a new Gene object
            if gene.chrom == chrom:
                genes[gene.name] = gene
                gene_tree.insert(tr.interval(chrom, gene.txStart, gene.txEnd),
                                 other = gene)
   
    return genes

gene_tree.insert is passed an interval (transcript start and end locations) and the gene object. The interval will be inserted in the tree along with the gene object. If the interval already exists, the gene object will be added to the list of genes with those transcript locations.

Finally, we read the bam file and search the tree to find which transcript and exon that read maps to. Exons are stored in a tree associated with the particular gene.

def ReadBAMfile(bam_file, chrom, gene_tree):
    """
    ReadBAMfile: read a bam file and count reads overlapping genes
    bam_file - the bam_file name
    chrom - chromosome
    gene_tree - an IntervalTree. Gene search tree, the intervals are Tx ranges
                We look for overlaps with genes and increment the read count for
                the gene.
    """

    f = pysam.Samfile(bam_file, 'rb')
    for read in f.fetch():
        out = []
        gene_tree.intersect(tr.interval(chrom, read.positions[0], read.positions[-1]),
                            lambda x: out.append(x))
        for x in out:   # out is a list of overlaps
            for gene in x.chain:    # loop through each gene in the chain
                gene.read_count += 1
                exons = gene.exons
                out2 = []
                # now search the exons for hits
                exons.intersect(tr.interval(gene.name, read.positions[0], read.positions[-1]),
                                lambda y: out2.append(y))
                for x2 in out2:
                    gene.reads_in_exons += 1
    

The whole program to read the various files and print the counts of reads overlapping the transcript is available for download along with the interval tree module.

No comments:

Post a Comment