Although conceptually simple, actually finding the gene products from a set of reads can be tricky. However, there is software available that seems like it should make the process if not easy, at least doable. In python, there is some GO analysis software, some rudimentary BioPython GO software, and a parser for the GO database. For perl, there is an extensive set of GO modules in the go-perl library. The R BioConductor project has probably the most extensive GO analysis collection.
Initially, I figured that with all of that software, it shouldn't be too difficult to put something together. After spending some time with the limited documentation available, I decided that BioConductor had the most to offer. This looked like a possible bet. Unfortunately, much BioConductor documentation consists of vignettes. These are extensive examples of using the particular package. Although, I usually prefer learning from examples, the examples depended too much on the particulars of the data being presented. When I tried to extend the methods to my own data, I ran into problems. The R documentation on the various methods for reading bam files, GO terms, gene annotations etc. was complicated. I soon realized that I would have to spend a long time digging through the docs and in the end this package might not be the right one anyway.
According to Larry Wall, the original developer of perl, the three programmer virtues are: laziness, impatience, and hubris. If impatience is a virtue, impatience with programming docs is one of my greatest virtues. After spending too many hours with the BioConductor docs and considering that hubris might also be a virtue, I decided to roll my own GO analysis.
What's in a name?
One of the first problems encountered when dealing with genomics involves gene names. Genes typically have multiple aliases. Consider the region from 332,801 to 339,016 on D. melanogaster chromosome 3L. It is variously identified as mthl10, methuselah-like 10, NM_001274276.1, CG17061, and FBgn0035132.Aligning reads with Bowtie2 or BWA will give us genomic coordinates for each mapped read. However, alignment doesn't tell us what features the reads might cover. In order to see that, we need to compare the read locations to the locations of features like genes and exons.
One convenient web location to get information about genes and other genomic features is the UCSC Table Browser. From the Table Browser, you can download the data for any of the UCSC supported browser tracks. The tables are tab delimited text files and are easy to parse with python or perl. It sometimes takes a little digging on the Table Browser web site to find what you need, but if it's displayed on the browser tracks, you should be able to find a table or collection of tables with the data you need.
The first collection of data we need are the gene and exon locations. We can find that information in the flybaseGene table. The first few lines of this table look like:
bin | name | chrom | strand | txStart | txEnd | cdsStart | cdsEnd | exonCount | exonStarts | exonEnds |
---|---|---|---|---|---|---|---|---|---|---|
585 | CG11023-RA | chr2L | + | 7528 | 9491 | 7679 | 9276 | 3 | 7528,8228,8667, | 8116,8589,9491, |
585 | CG2671-RC | chr2L | - | 9835 | 18583 | 11214 | 17136 | 9 | 9835,11409,11778,12285,13519,13682,14932,17052,18260, | 11344,11518,12221,12928,13625,14874,15711,17212,18583, |
585 | CG2671-RB | chr2L | - | 9835 | 21372 | 11214 | 19944 | 9 | 9835,11409,11778,12285,13519,13682,14932,19884,21135, | 11344,11518,12221,12928,13625,14874,15711,20020,21372, |
585 | CG2671-RF | chr2L | - | 9835 | 21372 | 11214 | 15648 | 10 | 9835,11409,11778,12285,13519,13682,14932,19879,21065,21348, | 11344,11518,12221,12928,13625,14874,15711,20020,21200,21372, |
585 | CG2671-RE | chr2L | - | 9835 | 21372 | 11214 | 15648 | 9 | 9835,11409,11778,12285,13519,13682,14932,19879,21065, | 11344,11518,12221,12928,13625,14874,15711,20015,21372, |
585 | CG2671-RA | chr2L | - | 9835 | 18583 | 11214 | 17136 | 10 | 9835,11409,11778,12285,13519,13682,14932,17052,18025,18260, | 11344,11518,12221,12928,13625,14874,15711,17212,18168,18583, |
585 | CG2671-RD | chr2L | - | 9835 | 21372 | 11214 | 15648 | 9 | 9835,11409,11778,12285,13519,13682,14932,19879,21065, | 11344,11518,12221,12928,13625,14874,15711,20020,21372, |
585 | CG2657-RB | chr2L | - | 21918 | 25151 | 21918 | 25151 | 5 | 21918,22742,22993,23928,24746, | 22687,22935,23873,24210,25151, |
585 | CG31973-RE | chr2L | - | 25401 | 59242 | 26520 | 58150 | 13 | 25401,26765,27052,28410,28732,28981,33844,34557,34719,38534,39300,58080,59189, | 26688,26964,27490,28639,28926,29068,34288,34604,35212,38731,39857,58182,59242, |
585 | CG31973-RB | chr2L | - | 25401 | 59242 | 26520 | 58150 | 14 | 25401,26765,27052,28014,28732,28981,30393,33844,34557,34719,38534,39300,58080,59189, | 26688,26964,27490,28240,28926,29068,33270,34288,34604,35212,38731,39857,58182,59242, |
txStart and txEnd are the transcript start and end locations. cdsStart and cdsEnd are the coding region locations.
We will also need a description of GO terms and their definitions. We get that from the Gene Ontology web site. Ontology files come in various formats. We will use an .obo file, since it's an easy to parse text file. A typical GO record in an obo file look like:
[Term] id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
Next, we need some way to map the GO terms to genes. The UCSC table browser allows us to download the fbGO table. The entries for this table look like:
|
Here's where the gene aliases become important. The fbGO table is indexed by identifiers starting with FBgq. These are Flybase/Ensembl identifiers. The gene table is indexed by Flybase annotation symbols beginning with CG. We will need a way to translate between the two sets of identifiers. This requires another table, the flyBase2004Xref table from UCSC. The xref table looks like:
name | symbol | synonyms | fbtr | fbgn | fbpp | fban | type |
---|---|---|---|---|---|---|---|
CG4696-RB | Mp20 | FBtr0005009 | FBgn0002789 | FBpp0099513 | FBan0004696 | ||
CG17291-RA | Pp2A-29B | FBtr0005088 | FBgn0005776 | FBpp0099974 | FBan0013383 | ||
CG10890-RA | mus201 | FBtr0005673 | FBgn0002887 | FBpp0099904 | FBan0032956 | ||
CG10890-RB | mus201 | FBtr0005674 | FBgn0002887 | FBpp0099905 | FBan0032956 | ||
CG3484-RA | Adhr | FBtr0006151 | FBgn0000056 | FBpp0099544 | FBan0032954 | ||
CG9565-RA | Nep3 | FBtr0070000 | FBgn0031081 | FBpp0070000 | FBan0009565 | ||
CR32826-RA | tRNA:CR32826-RA | FBtr0070001 | FBgn0052826 | tRNA | |||
CG9570-RA | CG9570 | FBtr0070002 | FBgn0031085 | FBpp0070001 | FBan0009570 | ||
CG32825-RA | Or19b | FBtr0070003 | FBgn0062565 | FBpp0070002 | FBan0032825 | ||
CG15322-RA | CG15322 | FBtr0070005 | FBgn0031088 | FBpp0070004 | FBan0015322 |
This table show the relationship between the CG identifiers and the Fbgn identifiers in the GO table.
The relationship among all of these files is shown here.
We will actually add another table for non-coding genes. It has a format similar to the gene table.
The map is not the territory
With these tables in mind, we will need some objects to contain the data. To hold the gene information, and eventually, the read counts and GO annotation, we will create a Gene class:
class Gene(object):
"""
FlyBaseGeneClass: A class for coding genes
UCSC gene tables are tab delimited. Instances of this class are initialized
with a text string from the gene table.
Note: All coordinates in these tables are half-open zero-based.
This means that the first 100 bases of a chromosome are
represented as [0,100), i.e. 0-99. The second 100 bases are
represented as [100,200), i.e. 100-199, and so forth.
see http://www.encodeproject.org/cgi-bin/hgTables?hgsid=366612875
"""
def __init__(self, gene_str):
self.type = 'coding'
data_list = gene_str.split('\t')
self.name = data_list[1]
self.chrom = data_list[2]
self.strand = data_list[3]
self.txStart = int(data_list[4])
self.txEnd = int(data_list[5])
self.cdsStart = int(data_list[6])
self.cdsEnd = int(data_list[7])
self.exonCount = int(data_list[8])
exonStarts = data_list[9].split(',')
exonStarts.pop() # remove empty item at end
exonEnds = data_list[10].split(',')
exonEnds.pop()
# use an IntervalTree for fast searching
self.exons = tr.IntervalTree()
for start, end in zip(exonStarts, exonEnds):
self.exons.insert(tr.interval(self.name, int(start), int(end)))
self.symbol = None
self.flybase_gene = None
self.go_terms = []
self.read_count = 0
self.reads_in_exons = 0
"""
FlyBaseGeneClass: A class for coding genes
UCSC gene tables are tab delimited. Instances of this class are initialized
with a text string from the gene table.
Note: All coordinates in these tables are half-open zero-based.
This means that the first 100 bases of a chromosome are
represented as [0,100), i.e. 0-99. The second 100 bases are
represented as [100,200), i.e. 100-199, and so forth.
see http://www.encodeproject.org/cgi-bin/hgTables?hgsid=366612875
"""
def __init__(self, gene_str):
self.type = 'coding'
data_list = gene_str.split('\t')
self.name = data_list[1]
self.chrom = data_list[2]
self.strand = data_list[3]
self.txStart = int(data_list[4])
self.txEnd = int(data_list[5])
self.cdsStart = int(data_list[6])
self.cdsEnd = int(data_list[7])
self.exonCount = int(data_list[8])
exonStarts = data_list[9].split(',')
exonStarts.pop() # remove empty item at end
exonEnds = data_list[10].split(',')
exonEnds.pop()
# use an IntervalTree for fast searching
self.exons = tr.IntervalTree()
for start, end in zip(exonStarts, exonEnds):
self.exons.insert(tr.interval(self.name, int(start), int(end)))
self.symbol = None
self.flybase_gene = None
self.go_terms = []
self.read_count = 0
self.reads_in_exons = 0
The initialization routine for this class is passed a line from the gene table. It splits the line and fills out the various fields. Objects of this class have slots for a list of GO terms, read counts, and something called an IntervalTree to store exon locations. More about that structure in a moment. We read the gene file line by line and create a gene object for each line. We will store the genes in a dictionary keyed on the gene name (the CG identifier).
GO annotation records are stored in a GO object, defined as:
class GO(object):
"""
GO: a class for GO annotation
The structure is pretty self-expanatory
is_a is a list of GO terms
"""
def __init__(self, go_id, name, namespace, definition, is_a):
self.id = go_id
self.name = name
self.namespace = namespace
self.definition = definition
self.is_a = is_a
"""
GO: a class for GO annotation
The structure is pretty self-expanatory
is_a is a list of GO terms
"""
def __init__(self, go_id, name, namespace, definition, is_a):
self.id = go_id
self.name = name
self.namespace = namespace
self.definition = definition
self.is_a = is_a
Parsing the GO records is straight forward.
def ReadGOFile(go_file):
"""
ReadGOFile: read a Gene ontology obo file and create a dictionary of
GO terms. This is a simple parser for obo files
Parsing is kind of weirds because Python's limited flow control
structures, i.e no assignment in conditions, no switch etc.
go_file - GO obo file see http://geneontology.org/ for details
return: a dictionary of GO objects keyed on GO id
"""
go = dict()
with open(go_file, 'r') as f:
go_id = None
for line in f:
if re.search('^\\[Term\\]', line):
if go_id:
g = GO(go_id, name, namespace, definition, is_a)
go[g.id] = g
for alt_id in alt_ids:
go[alt_id] = g
go_id = None
is_a = []
name = None
namespace = None
definition = None
alt_ids = []
continue
m = re.search('^id: (GO:\d+)', line)
if m:
go_id = m.group(1)
continue
m = re.search('^name: (.+?)\n', line)
if m:
name = m.group(1)
continue
m = re.search('^namespace: (.+?)\n', line)
if m:
namespace = m.group(1)
continue
m = re.search('^def: (.+?)\n', line)
if m:
definition = m.group(1)
continue
m = re.search('^alt_id: (GO:\d+)', line)
if m:
alt_ids.append(m.group(1))
m = re.search('^is_a: (GO:\d+)', line)
if m:
is_a.append(m.group(1))
return go
"""
ReadGOFile: read a Gene ontology obo file and create a dictionary of
GO terms. This is a simple parser for obo files
Parsing is kind of weirds because Python's limited flow control
structures, i.e no assignment in conditions, no switch etc.
go_file - GO obo file see http://geneontology.org/ for details
return: a dictionary of GO objects keyed on GO id
"""
go = dict()
with open(go_file, 'r') as f:
go_id = None
for line in f:
if re.search('^\\[Term\\]', line):
if go_id:
g = GO(go_id, name, namespace, definition, is_a)
go[g.id] = g
for alt_id in alt_ids:
go[alt_id] = g
go_id = None
is_a = []
name = None
namespace = None
definition = None
alt_ids = []
continue
m = re.search('^id: (GO:\d+)', line)
if m:
go_id = m.group(1)
continue
m = re.search('^name: (.+?)\n', line)
if m:
name = m.group(1)
continue
m = re.search('^namespace: (.+?)\n', line)
if m:
namespace = m.group(1)
continue
m = re.search('^def: (.+?)\n', line)
if m:
definition = m.group(1)
continue
m = re.search('^alt_id: (GO:\d+)', line)
if m:
alt_ids.append(m.group(1))
m = re.search('^is_a: (GO:\d+)', line)
if m:
is_a.append(m.group(1))
return go
GO objects are stored in a dictionary, keyed by GO identifiers.
Reading the xref file allows us to add Flybase identifiers to the genes in the gene dictionary.
def ReadXrefFile(xref_file, genes):
"""
ReadXrefFile: Read the UCSC gene-flybase xref table file and
add gene symbol and FlyBase Id to gene object
It also builds a dictionary of FB ids => UCSC gene names
xref_file - the name of a UCSC gene/FlyBase xref table file
genes - a dictionary of Gene objects
return:
fb_genes - a dictionary keyed on FBgn ids. Each item is a list of genes
"""
fb_genes = dict()
with open(xref_file, 'r') as f:
f.readline() # throw away header line
for line in f:
data_list = line.split('\t')
if data_list[0] in genes:
genes[data_list[0]].symbol = data_list[1]
genes[data_list[0]].flybase_gene = data_list[4]
if data_list[4] not in fb_genes:
fb_genes[data_list[4]] = []
fb_genes[data_list[4]].append(genes[data_list[0]])
return fb_genes
"""
ReadXrefFile: Read the UCSC gene-flybase xref table file and
add gene symbol and FlyBase Id to gene object
It also builds a dictionary of FB ids => UCSC gene names
xref_file - the name of a UCSC gene/FlyBase xref table file
genes - a dictionary of Gene objects
return:
fb_genes - a dictionary keyed on FBgn ids. Each item is a list of genes
"""
fb_genes = dict()
with open(xref_file, 'r') as f:
f.readline() # throw away header line
for line in f:
data_list = line.split('\t')
if data_list[0] in genes:
genes[data_list[0]].symbol = data_list[1]
genes[data_list[0]].flybase_gene = data_list[4]
if data_list[4] not in fb_genes:
fb_genes[data_list[4]] = []
fb_genes[data_list[4]].append(genes[data_list[0]])
return fb_genes
The FBgo table contains the GO identifiers associated with the various genes. It's indexed by FBgn identifiers. Reading the FBgo table, along with the fb_genes dictionary will give us the links needed to add GO terms to the genes.
def ReadFBGOFile(fbgo_file, fb_genes):
"""
ReadFBGOFile: Read the Flybase GO annotation file and add GO terms to genes
fbgo_file - UCSC file mapping FB ids to GO terms
fb_genes - dictionary of Gene objects keyed on FBgn id
"""
with open(fbgo_file, 'r') as f:
f.readline() # throw away header line
for line in f:
data_list = line.strip().split('\t')
if data_list[0] in fb_genes:
for gene in fb_genes[data_list[0]]:
gene.go_terms.append(data_list[1])
"""
ReadFBGOFile: Read the Flybase GO annotation file and add GO terms to genes
fbgo_file - UCSC file mapping FB ids to GO terms
fb_genes - dictionary of Gene objects keyed on FBgn id
"""
with open(fbgo_file, 'r') as f:
f.readline() # throw away header line
for line in f:
data_list = line.strip().split('\t')
if data_list[0] in fb_genes:
for gene in fb_genes[data_list[0]]:
gene.go_terms.append(data_list[1])
Searching the forest for a tree
Let's look at how we read the gene file:
def ReadGeneFile(gene_file, chrom, gene_tree):
"""
ReadGeneFile: Read the UCSC gene table file
gene_file - the name of a UCSC gene table file
gene_tree - an IntervalTree of gene locations
returns: a dictionary of gene objects keys on gene names (CG names)
"""
genes = dict()
with open(gene_file, 'r') as f:
f.readline() # throw away header line
for line in f:
gene = Gene(line.strip())
if gene.chrom == chrom:
genes[gene.name] = gene
gene_tree.insert(tr.interval(chrom, gene.txStart, gene.txEnd),
other = gene)
return genes
"""
ReadGeneFile: Read the UCSC gene table file
gene_file - the name of a UCSC gene table file
gene_tree - an IntervalTree of gene locations
returns: a dictionary of gene objects keys on gene names (CG names)
"""
genes = dict()
with open(gene_file, 'r') as f:
f.readline() # throw away header line
for line in f:
gene = Gene(line.strip())
if gene.chrom == chrom:
genes[gene.name] = gene
gene_tree.insert(tr.interval(chrom, gene.txStart, gene.txEnd),
other = gene)
return genes
This routine is passed the file name, chromosome label, and something called gene_tree. gene_tree is an interval tree. An interval tree is a binary tree structure that allows efficient searching to locate all intervals that overlap with any given interval or point. We have a series intervals describing gene locations on the chromosome. In bam files, we have thousands, maybe millions of reads that map to a chromosome. We want to know which, if any, gene and exon that the read overlaps. The naive approach would be to search the gene dictionary and look for a possible overlap with each gene. Fly chromosome 2L has 3769 protein coding genes. This isn't really a lot, but with thousands of reads, and multiple experiments to go through, that's a lot of searching. It requires O(N) for each search, where N is the number of genes. Interval trees gives us a more efficient method of searching, reducing the search time to O(log(N)).
If you gooogle python interval tree, you find a number of software packages for interval trees. Each has slightly different features. We need to store the transcript start and end locations for each gene. A complication arises because of alternate splicing of genes. We often see several different genes that have the same transcript coordinates, but different exon locations. For each transcript interval, we need to maintain a list of gene products from that transcript.
After looking at a couple of different python interval tree modules, I decided to try quicksect from bx-python. The quicksect tree node initialization has an other field, that allows us to pass a gene to be associated with the transcript interval. To deal with the issue of alternate splicing of transcripts, I modified the interval tree code to maintain a list of genes with the same transcript at each tree node. The original code is here and the modified version can be downloaded here.
With all this in mind, we can see how we read the gene file and build the tree.
def ReadGeneFile(gene_file, chrom, gene_tree):
"""
ReadGeneFile: Read the UCSC gene table file
gene_file - the name of a UCSC gene table file
gene_tree - an IntervalTree of gene locations
returns: a dictionary of gene objects keys on gene names (CG names)
"""
genes = dict()
with open(gene_file, 'r') as f:
f.readline() # throw away header line
for line in f:
gene = Gene(line.strip()) # create a new Gene object
if gene.chrom == chrom:
genes[gene.name] = gene
gene_tree.insert(tr.interval(chrom, gene.txStart, gene.txEnd),
other = gene)
return genes
"""
ReadGeneFile: Read the UCSC gene table file
gene_file - the name of a UCSC gene table file
gene_tree - an IntervalTree of gene locations
returns: a dictionary of gene objects keys on gene names (CG names)
"""
genes = dict()
with open(gene_file, 'r') as f:
f.readline() # throw away header line
for line in f:
gene = Gene(line.strip()) # create a new Gene object
if gene.chrom == chrom:
genes[gene.name] = gene
gene_tree.insert(tr.interval(chrom, gene.txStart, gene.txEnd),
other = gene)
return genes
gene_tree.insert is passed an interval (transcript start and end locations) and the gene object. The interval will be inserted in the tree along with the gene object. If the interval already exists, the gene object will be added to the list of genes with those transcript locations.
Finally, we read the bam file and search the tree to find which transcript and exon that read maps to. Exons are stored in a tree associated with the particular gene.
def ReadBAMfile(bam_file, chrom, gene_tree):
"""
ReadBAMfile: read a bam file and count reads overlapping genes
bam_file - the bam_file name
chrom - chromosome
gene_tree - an IntervalTree. Gene search tree, the intervals are Tx ranges
We look for overlaps with genes and increment the read count for
the gene.
"""
f = pysam.Samfile(bam_file, 'rb')
for read in f.fetch():
out = []
gene_tree.intersect(tr.interval(chrom, read.positions[0], read.positions[-1]),
lambda x: out.append(x))
for x in out: # out is a list of overlaps
for gene in x.chain: # loop through each gene in the chain
gene.read_count += 1
exons = gene.exons
out2 = []
# now search the exons for hits
exons.intersect(tr.interval(gene.name, read.positions[0], read.positions[-1]),
lambda y: out2.append(y))
for x2 in out2:
gene.reads_in_exons += 1
"""
ReadBAMfile: read a bam file and count reads overlapping genes
bam_file - the bam_file name
chrom - chromosome
gene_tree - an IntervalTree. Gene search tree, the intervals are Tx ranges
We look for overlaps with genes and increment the read count for
the gene.
"""
f = pysam.Samfile(bam_file, 'rb')
for read in f.fetch():
out = []
gene_tree.intersect(tr.interval(chrom, read.positions[0], read.positions[-1]),
lambda x: out.append(x))
for x in out: # out is a list of overlaps
for gene in x.chain: # loop through each gene in the chain
gene.read_count += 1
exons = gene.exons
out2 = []
# now search the exons for hits
exons.intersect(tr.interval(gene.name, read.positions[0], read.positions[-1]),
lambda y: out2.append(y))
for x2 in out2:
gene.reads_in_exons += 1
No comments:
Post a Comment