The Analytic Garden: Let's GO for it

One of the things we would like to know from our sequencing experiments is the number of reads mapped to the various genes and what those genes do. Conceptually, this process is simple; for each read, check to see what gene it maps to; count reads overlapping each gene; for each gene with coverage, look up what the gene does. The latter step involves what is called Gene Ontology (the GO in the title). The Gene Ontology project is an initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. Genes are assigned various Gene Ontology identifiers, a GO ID. Determining a particular gene's products is then a matter of looking up the GO IDs in a gene ontology database.

Although conceptually simple, actually finding the gene products from a set of reads can be tricky. However, there is software available that seems like it should make the process if not easy, at least doable. In python, there is some GO analysis software, some rudimentary BioPython GO software, and a parser for the GO database. For perl, there is an extensive set of GO modules in the go-perl library. The R BioConductor project has probably the most extensive GO analysis collection.

Initially, I figured that with all of that software, it shouldn't be too difficult to put something together. After spending some time with the limited documentation available, I decided that BioConductor had the most to offer. This looked like a possible bet. Unfortunately, much BioConductor documentation consists of vignettes. These are extensive examples of using the particular package. Although, I usually prefer learning from examples, the examples depended too much on the particulars of the data being presented. When I tried to extend the methods to my own data, I ran into problems. The R documentation on the various methods for reading bam files, GO terms, gene annotations etc. was complicated. I soon realized that I would have to spend a long time digging through the docs and in the end this package might not be the right one anyway.

According to Larry Wall, the original developer of perl, the three programmer virtues are: laziness, impatience, and hubris. If impatience is a virtue, impatience with programming docs is one of my greatest virtues. After spending too many hours with the BioConductor docs and considering that hubris might also be a virtue, I decided to roll my own GO analysis.

What's in a name?

One of the first problems encountered when dealing with genomics involves gene names. Genes typically have multiple aliases. Consider the region from 332,801 to 339,016 on D. melanogaster chromosome 3L. It is variously identified as mthl10, methuselah-like 10, NM_001274276.1, CG17061, and FBgn0035132.

Aligning reads with Bowtie2 or BWA will give us genomic coordinates for each mapped read. However, alignment doesn't tell us what features the reads might cover. In order to see that, we need to compare the read locations to the locations of features like genes and exons.

One convenient web location to get information about genes and other genomic features is the UCSC Table Browser. From the Table Browser, you can download the data for any of the UCSC supported browser tracks. The tables are tab delimited text files and are easy to parse with python or perl. It sometimes takes a little digging on the Table Browser web site to find what you need, but if it's displayed on the browser tracks, you should be able to find a table or collection of tables with the data you need.

The first collection of data we need are the gene and exon locations. We can find that information in the flybaseGene table. The first few lines of this table look like:

bin	name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds
585	CG11023-RA	chr2L	+	7528	9491	7679	9276	3	7528,8228,8667,	8116,8589,9491,
585	CG2671-RC	chr2L	-	9835	18583	11214	17136	9	9835,11409,11778,12285,13519,13682,14932,17052,18260,	11344,11518,12221,12928,13625,14874,15711,17212,18583,
585	CG2671-RB	chr2L	-	9835	21372	11214	19944	9	9835,11409,11778,12285,13519,13682,14932,19884,21135,	11344,11518,12221,12928,13625,14874,15711,20020,21372,
585	CG2671-RF	chr2L	-	9835	21372	11214	15648	10	9835,11409,11778,12285,13519,13682,14932,19879,21065,21348,	11344,11518,12221,12928,13625,14874,15711,20020,21200,21372,
585	CG2671-RE	chr2L	-	9835	21372	11214	15648	9	9835,11409,11778,12285,13519,13682,14932,19879,21065,	11344,11518,12221,12928,13625,14874,15711,20015,21372,
585	CG2671-RA	chr2L	-	9835	18583	11214	17136	10	9835,11409,11778,12285,13519,13682,14932,17052,18025,18260,	11344,11518,12221,12928,13625,14874,15711,17212,18168,18583,
585	CG2671-RD	chr2L	-	9835	21372	11214	15648	9	9835,11409,11778,12285,13519,13682,14932,19879,21065,	11344,11518,12221,12928,13625,14874,15711,20020,21372,
585	CG2657-RB	chr2L	-	21918	25151	21918	25151	5	21918,22742,22993,23928,24746,	22687,22935,23873,24210,25151,
585	CG31973-RE	chr2L	-	25401	59242	26520	58150	13	25401,26765,27052,28410,28732,28981,33844,34557,34719,38534,39300,58080,59189,	26688,26964,27490,28639,28926,29068,34288,34604,35212,38731,39857,58182,59242,
585	CG31973-RB	chr2L	-	25401	59242	26520	58150	14	25401,26765,27052,28014,28732,28981,30393,33844,34557,34719,38534,39300,58080,59189,	26688,26964,27490,28240,28926,29068,33270,34288,34604,35212,38731,39857,58182,59242,

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

txStart and txEnd are the transcript start and end locations. cdsStart and cdsEnd are the coding region locations.

We will also need a description of GO terms and their definitions. We get that from the Gene Ontology web site. Ontology files come in various formats. We will use an .obo file, since it's an easy to parse text file. A typical GO record in an obo file look like:

[Term] id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

Next, we need some way to map the GO terms to genes. The UCSC table browser allows us to download the fbGO table. The entries for this table look like:

geneId	goId	aspect
FBgn0025724	GO:0006888	P
FBgn0025724	GO:0048194	P
FBgn0025724	GO:0007166	P
FBgn0025724	GO:0006897	P
FBgn0025724	GO:0006887	P
FBgn0025724	GO:0006886	P
FBgn0025724	GO:0006890	P
FBgn0025724	GO:0030126	C
FBgn0025724	GO:0008565	F
FBgn0043467	GO:0048149	P

Here's where the gene aliases become important. The fbGO table is indexed by identifiers starting with FBgq. These are Flybase/Ensembl identifiers. The gene table is indexed by Flybase annotation symbols beginning with CG. We will need a way to translate between the two sets of identifiers. This requires another table, the flyBase2004Xref table from UCSC. The xref table looks like:

name	symbol	fbtr	fbgn	fbpp	fban	type
CG4696-RB	Mp20	FBtr0005009	FBgn0002789	FBpp0099513	FBan0004696
CG17291-RA	Pp2A-29B	FBtr0005088	FBgn0005776	FBpp0099974	FBan0013383
CG10890-RA	mus201	FBtr0005673	FBgn0002887	FBpp0099904	FBan0032956
CG10890-RB	mus201	FBtr0005674	FBgn0002887	FBpp0099905	FBan0032956
CG3484-RA	Adhr	FBtr0006151	FBgn0000056	FBpp0099544	FBan0032954
CG9565-RA	Nep3	FBtr0070000	FBgn0031081	FBpp0070000	FBan0009565
CR32826-RA	tRNA:CR32826-RA	FBtr0070001	FBgn0052826			tRNA
CG9570-RA	CG9570	FBtr0070002	FBgn0031085	FBpp0070001	FBan0009570
CG32825-RA	Or19b	FBtr0070003	FBgn0062565	FBpp0070002	FBan0032825
CG15322-RA	CG15322	FBtr0070005	FBgn0031088	FBpp0070004	FBan0015322

This table show the relationship between the CG identifiers and the Fbgn identifiers in the GO table.

The relationship among all of these files is shown here.

We will actually add another table for non-coding genes. It has a format similar to the gene table.

The map is not the territory

With these tables in mind, we will need some objects to contain the data. To hold the gene information, and eventually, the read counts and GO annotation, we will create a Gene class:

class Gene(object):
    """
    FlyBaseGeneClass: A class for coding genes
    UCSC gene tables are tab delimited. Instances of this class are initialized
    with a text string from the gene table.
    Note: All coordinates in these tables are half-open zero-based.
          This means that the first 100 bases of a chromosome are
          represented as [0,100), i.e. 0-99. The second 100 bases are
          represented as [100,200), i.e. 100-199, and so forth.
          see http://www.encodeproject.org/cgi-bin/hgTables?hgsid=366612875
    """

    def __init__(self, gene_str):
    self.type = 'coding'

    data_list = gene_str.split('\t')
    self.name = data_list[1]
    self.chrom = data_list[2]
    self.strand = data_list[3]
    self.txStart = int(data_list[4])
    self.txEnd = int(data_list[5])
    self.cdsStart = int(data_list[6])
    self.cdsEnd = int(data_list[7])
    self.exonCount = int(data_list[8])

    exonStarts = data_list[9].split(',')
    exonStarts.pop()   # remove empty item at end
    exonEnds = data_list[10].split(',')
    exonEnds.pop()

    # use an IntervalTree for fast searching
    self.exons = tr.IntervalTree()
    for start, end in zip(exonStarts, exonEnds):
         self.exons.insert(tr.interval(self.name, int(start), int(end)))

    self.symbol = None
    self.flybase_gene = None
    self.go_terms = []
    self.read_count = 0
    self.reads_in_exons = 0

The initialization routine for this class is passed a line from the gene table. It splits the line and fills out the various fields. Objects of this class have slots for a list of GO terms, read counts, and something called an IntervalTree to store exon locations. More about that structure in a moment. We read the gene file line by line and create a gene object for each line. We will store the genes in a dictionary keyed on the gene name (the CG identifier).

GO annotation records are stored in a GO object, defined as:

class GO(object):
    """
    GO: a class for GO annotation
    The structure is pretty self-expanatory
    is_a is a list of GO terms
    """

    def __init__(self, go_id, name, namespace, definition, is_a):
        self.id = go_id
        self.name = name
        self.namespace = namespace
        self.definition = definition
        self.is_a = is_a

Parsing the GO records is straight forward.

def ReadGOFile(go_file):
    """
    ReadGOFile: read a Gene ontology obo file and create a dictionary of
                GO terms. This is a simple parser for obo files
                Parsing is kind of weirds because Python's limited flow control
                structures, i.e no assignment in conditions, no switch etc.

    go_file - GO obo file see http://geneontology.org/ for details

    return: a dictionary of GO objects keyed on GO id
    """

    go = dict()
    with open(go_file, 'r') as f:
        go_id = None
        for line in f:
            if re.search('^\\[Term\\]', line):
                if go_id:
                    g = GO(go_id, name, namespace, definition, is_a)
                    go[g.id] = g
                    for alt_id in alt_ids:
                        go[alt_id] = g
                go_id = None
                is_a = []
                name = None
                namespace = None
                definition = None
                alt_ids = []
                continue

            m = re.search('^id: (GO:\d+)', line)
            if m:
                go_id = m.group(1)
                continue

            m = re.search('^name: (.+?)\n', line)
            if m:
                name = m.group(1)
                continue

            m = re.search('^namespace: (.+?)\n', line)
            if m:
                namespace = m.group(1)
                continue

            m = re.search('^def: (.+?)\n', line)
            if m:
                definition = m.group(1)
                continue

            m = re.search('^alt_id: (GO:\d+)', line)
            if m:
                alt_ids.append(m.group(1))

            m = re.search('^is_a: (GO:\d+)', line)
            if m:
                is_a.append(m.group(1))

        return go

GO objects are stored in a dictionary, keyed by GO identifiers.

Reading the xref file allows us to add Flybase identifiers to the genes in the gene dictionary.

def ReadXrefFile(xref_file, genes):
    """
    ReadXrefFile: Read the UCSC gene-flybase xref table file and
                  add gene symbol and FlyBase Id to gene object
                  It also builds a dictionary of FB ids => UCSC gene names

    xref_file - the name of a UCSC gene/FlyBase xref table file
    genes - a dictionary of Gene objects

    return:
    fb_genes - a dictionary keyed on FBgn ids. Each item is a list of genes
    """
    fb_genes = dict()
    with open(xref_file, 'r') as f:
        f.readline()      # throw away header line
        for line in f:
            data_list = line.split('\t')
            if data_list[0] in genes:
                genes[data_list[0]].symbol = data_list[1]
                genes[data_list[0]].flybase_gene = data_list[4]
                if data_list[4] not in fb_genes:
                    fb_genes[data_list[4]] = []
                fb_genes[data_list[4]].append(genes[data_list[0]])

    return fb_genes

The fb_gene dictionary that is returned is keyed on Flybase identifiers. It's values are a list of gene objects that Flybase identifier applies to.

The FBgo table contains the GO identifiers associated with the various genes. It's indexed by FBgn identifiers. Reading the FBgo table, along with the fb_genes dictionary will give us the links needed to add GO terms to the genes.

def ReadFBGOFile(fbgo_file, fb_genes):
    """
    ReadFBGOFile: Read the Flybase GO annotation file and add GO terms to genes
    fbgo_file - UCSC file mapping FB ids to GO terms
    fb_genes - dictionary of Gene objects keyed on FBgn id
    """

    with open(fbgo_file, 'r') as f:
        f.readline()      # throw away header line
        for line in f:
            data_list = line.strip().split('\t')
            if data_list[0] in fb_genes:
                for gene in fb_genes[data_list[0]]:
                    gene.go_terms.append(data_list[1])

Searching the forest for a tree

Let's look at how we read the gene file:

def ReadGeneFile(gene_file, chrom, gene_tree):
    """
    ReadGeneFile: Read the UCSC gene table file
    gene_file - the name of a UCSC gene table file
    gene_tree - an IntervalTree of gene locations

    returns: a dictionary of gene objects keys on gene names (CG names)
    """

    genes = dict()
    with open(gene_file, 'r') as f:
        f.readline()      # throw away header line
        for line in f:
            gene = Gene(line.strip())
            if gene.chrom == chrom:
                genes[gene.name] = gene
                gene_tree.insert(tr.interval(chrom, gene.txStart, gene.txEnd),
                                 other = gene)

    return genes

This routine is passed the file name, chromosome label, and something called gene_tree. gene_tree is an interval tree. An interval tree is a binary tree structure that allows efficient searching to locate all intervals that overlap with any given interval or point. We have a series intervals describing gene locations on the chromosome. In bam files, we have thousands, maybe millions of reads that map to a chromosome. We want to know which, if any, gene and exon that the read overlaps. The naive approach would be to search the gene dictionary and look for a possible overlap with each gene. Fly chromosome 2L has 3769 protein coding genes. This isn't really a lot, but with thousands of reads, and multiple experiments to go through, that's a lot of searching. It requires O(N) for each search, where N is the number of genes. Interval trees gives us a more efficient method of searching, reducing the search time to O(log(N)).

If you gooogle python interval tree, you find a number of software packages for interval trees. Each has slightly different features. We need to store the transcript start and end locations for each gene. A complication arises because of alternate splicing of genes. We often see several different genes that have the same transcript coordinates, but different exon locations. For each transcript interval, we need to maintain a list of gene products from that transcript.

After looking at a couple of different python interval tree modules, I decided to try quicksect from bx-python. The quicksect tree node initialization has an other field, that allows us to pass a gene to be associated with the transcript interval. To deal with the issue of alternate splicing of transcripts, I modified the interval tree code to maintain a list of genes with the same transcript at each tree node. The original code is here and the modified version can be downloaded here.

With all this in mind, we can see how we read the gene file and build the tree.

def ReadGeneFile(gene_file, chrom, gene_tree):
    """
    ReadGeneFile: Read the UCSC gene table file
    gene_file - the name of a UCSC gene table file
    gene_tree - an IntervalTree of gene locations

    returns: a dictionary of gene objects keys on gene names (CG names)
    """

    genes = dict()
    with open(gene_file, 'r') as f:
        f.readline()      # throw away header line
        for line in f:
            gene = Gene(line.strip()) # create a new Gene object
            if gene.chrom == chrom:
                genes[gene.name] = gene
                gene_tree.insert(tr.interval(chrom, gene.txStart, gene.txEnd),
                                 other = gene)

    return genes

gene_tree.insert is passed an interval (transcript start and end locations) and the gene object. The interval will be inserted in the tree along with the gene object. If the interval already exists, the gene object will be added to the list of genes with those transcript locations.

Finally, we read the bam file and search the tree to find which transcript and exon that read maps to. Exons are stored in a tree associated with the particular gene.

def ReadBAMfile(bam_file, chrom, gene_tree):
    """
    ReadBAMfile: read a bam file and count reads overlapping genes
    bam_file - the bam_file name
    chrom - chromosome
    gene_tree - an IntervalTree. Gene search tree, the intervals are Tx ranges
                We look for overlaps with genes and increment the read count for
                the gene.
    """

    f = pysam.Samfile(bam_file, 'rb')
    for read in f.fetch():
        out = []
        gene_tree.intersect(tr.interval(chrom, read.positions[0], read.positions[-1]),
                            lambda x: out.append(x))
        for x in out:   # out is a list of overlaps
            for gene in x.chain:    # loop through each gene in the chain
                gene.read_count += 1
                exons = gene.exons
                out2 = []
                # now search the exons for hits
                exons.intersect(tr.interval(gene.name, read.positions[0], read.positions[-1]),
                                lambda y: out2.append(y))
                for x2 in out2:
                    gene.reads_in_exons += 1

The whole program to read the various files and print the counts of reads overlapping the transcript is available for download along with the interval tree module.

The Analytic Garden

Let's GO for it

What's in a name?

The map is not the territory

Searching the forest for a tree

No comments:

Post a Comment

Labels

Contributors

wfmu