In order to deal with finding if a particular SNP lies with 1000 bases of the 3' UTR, we will extend the gene location by 1 kb at the 3' end. To describe the extended gene region, we will use a modification of the Gene class from the GO annotation program. The new class is derived from the Gene class and adds a couple of features to the basic class.
A particular position may overlap multiple types of genomic objects. For example, a position might be contained in the exon of a gene and be overlapped by EST as well. To keep track of the various combinations, we will use a bit mask, where each bit positions indicates a different genomic feature.
class HitType(object):
"""
HitType - bitmap constants for SNP locatiosn types
"""
no_annotation = 0x00
in_5UTR = 0x01
in_3UTR = 0x02
in_CDS = 0x04
in_intron = 0x08
in_est = 0x10
in_ncRNA = 0x20
in_1k3UTR = 0x40 # within 1 kb of 3' UTR
The new extended gene class will use HitType to indicate the gene features at a position.
class GeneExtended(Gene):
"""
GeneExtended - an extension of the Gene class
We want to extent the sequence area that the gene covers to include
1 kb from the 3' UTR
"""
def __init__(self, gene_str):
"""
We add a new start and end location and an extended region to the
3' UTR
"""
Gene.__init__(self, gene_str)
if self.strand == '+':
self.utr3_extended = (self.txEnd + 1, self.txEnd + 1000)
self.extStart = self.txStart
self.extEnd = self.txEnd + 1000
else:
self.utr3_extended = (self.txStart - 1000, self.txStart + 1)
self.extStart = self.txStart - 1000
self.extEnd = self.txEnd
def HitKind(self, pos):
"""
HitKind - method for returning the type of gene feature at pos.
There may be multiple features at a particular position, so we
return a HitType bit mask to indicate all of the features.
"""
hit = 0 # hit is a bit mask
# look for obvious cases
if self.type == 'est':
hit |= HitType.in_est
if self.type == 'non_coding':
hit |= HitType.in_ncRNA
exons = self.exons
out2 = []
# now search the exons for hits
exons.intersect(tr.interval(self.name, pos, pos),
lambda y: out2.append(y))
if out2 != []:
for x2 in out2: # in an exon - check UTRs first
if self.strand == '+':
if pos < self.cdsStart:
hit |= HitType.in_5UTR
elif pos > self.cdsEnd:
hit |= HitType.in_3UTR
else:
hit |= HitType.in_CDS # in an exon
else:
if pos > self.cdsEnd:
hit |= HitType.in_5UTR
elif pos < self.cdsStart:
hit |= HitType.in_3UTR
else:
hit |= HitType.in_CDS # in an exon
else:
# not in exon - in extended region?
if pos >= self.utr3_extended[0] and pos <= self.utr3_extended[1]:
hit |= HitType.in_1k3UTR
else:
# not in exon or extended region, must be in an intron
hit |= HitType.in_intron
return hit
The initialization for this class extends the gene range with new start and end locations. We will insert these locations in the interval tree rather than the transcript range. Note that we have to be careful about the coding strand of the gene. The class also adds a method for determining the type of genomic features that the position overlaps. The HitKind method returns a bit mask indicating the features at that position.
The class for ESTs is similar to the class for ncRNAs introduced with the GO program. We read the EST table from UCSC and add the EST ranges to the interval tree.
Counting SNPs Instead of Sheep
Counting can be a little tricky because we only want to count each SNP once, so we establish a hierarchy for counting. If a position overlaps an exon and an EST, we only count it as an exon. We read the SNP file line by line and search the gene tree for a match to the SNP position. The positions in the SNP file are 1-based and the UCSC genome gene positions are 0-based. HitType2Str is a helper function that decodes the bit map. This function creates the hierarchy for counting SNP positions by the order in which it returns the strings. Along with counting the genome object types, we also write the SNP information to a CSV file. The routine returns a dictionary of counts of the various genome objects.def ReadSNPFile(snp_file, out_file, gene_tree):
"""
Read a SNP prediction file and count the features at the SNP positions
The SNP information is written to out_file
snp_file - name of the SNP file
out_file - output file for list of snp types
returns
a dictionary of SNP_prediction counts
"""
def HitType2Str(hit_mask):
"""
HitType2Str - return a string from the bit mask
This function establishes a priority for counting features.
We only return one string even though a position might cover multiple
features. If a position lies in an exon and an EST, we only count it
as an exon.
"""
if hit_mask & HitType.in_CDS:
return 'CDS' # this means exon
if hit_mask & HitType.in_5UTR:
return 'UTR_5'
if hit_mask & HitType.in_3UTR:
return 'UTR_3'
if hit_mask & HitType.in_intron:
return 'intron'
if hit_mask & HitType.in_ncRNA:
return 'ncRNA'
if hit_mask & HitType.in_est:
return 'EST'
if hit_mask & HitType.in_1k3UTR:
return 'within_1k_3UTR'
return 'no_annotation'
counts = dict(UTR_5=0, CDS=0, UTR_3=0, within_1k_3UTR=0,
intron=0, EST=0, no_annotation=0, ncRNA=0)
output = open(out_file, 'w')
output.write(','.join(['chrom', 'position', 'ref_nucleotide',
'snp_nucleotide', 'coverage', 'posterior_prob',
'snp_prob', 'mask', 'hit_types', 'genes']) + '\n')
with open(snp_file, 'r') as input:
for line in input:
snp = SNP_prediction(line.strip())
# search the gene tree to see if the SNP position overlaps the
# extended gene
out = []
gene_tree.intersect(tr.interval(snp.chrom, snp.position-1,
snp.position-1),
lambda x: out.append(x))
if out == []:
counts['no_annotation'] += 1
else:
pos_type = 0
for x in out: # out is a list of overlaps
for gene in x.chain: # loop through each gene in the chain
pos_type |= gene.HitKind(snp.position-1)
snp.gene_list.append(gene.name)
snp.classification = pos_type
hit_str = HitType2Str(pos_type)
counts[hit_str] += 1 # count the feature
snp.Print(output)
output.close()
return counts
Slicing Pie
The final task we want this program to perform is to draw a pie chart of the percentages of each of object type. This is easy using Matplotlib. counts is the dictionary returned by ReadSNPFile().fracs = []
for cl in counts.keys():
frac = float(counts[cl])/total
fracs.append(frac)
# draw a pie chart and save it
pylab.figure(figsize = (6,6))
explode = [0.025] * len(fracs)
patches, txt, txt2 = pylab.pie(fracs,
labels=counts.keys(),
explode=explode,
autopct='%1.1f%%',
shadow=True)
pylab.legend(patches, counts.keys(), loc='best')
pylab.title(typ, bbox={'facecolor':'0.8', 'pad':5})
pylab.savefig(data_files[typ]['plot_file'], bbox_inches='tight')
Download the whole program.
No comments:
Post a Comment