Posts

Showing posts from May, 2014

Sense and Ant-sense

In genomics sense refers to the direction of transcription of translation. A strand of DNA is called the sense strand if an RNA version of the same sequence is translated or translatable into protein. In the genome reference sequence genes are labeled as being on the '+' or '-' strand to indicate their sense. We want to count the number of reads mapped to genes and exons in the sense and anti-sense directions. As we did, previously, we will use an interval tree to store the gene and exon locations. Given the location, we search the gene interval tree for an overlap. If found and the direction of the read mapping matches the gene direction, we count it as being in the sense direction. We perform a similar search for the exons. One complication is that because of alternate splicing, a read may be indicated as overlapping several genes. We will need to be careful to count each read only once. def ReadBamFile(chrom, chrom_path, bam_file, gene_tree): ""...

The Masked Genome

Often, a genome is masked to replace repeat structures in order to map reads to the genome sequences and avoid problems caused by reads being mapped multiple times to repeats. However, we're interested in what happens in transposons and repeat elements, so we're going to take the opposite approach. We will create a set of genome sequences with everything but the sequences of certain repeat elements masked.The particular type of transposable element that we are interested in is Hoppel , also know as Proto-P. There are a little over 3000 such elements listed in the drosophila melanogaster genome. The common method of masking a genome is to replace regions that you do not wish reads to map to with N's. We create a series of sequences of the same length as the genomic sequences, but consisting of nothing but N characters. We will then insert the Hoppel sequences in their locations in the genome sequences. The first step is to download the RepeatMasker database from UCSC ....

Searching for SNPs and Edit Sites

Image
ModEncode is a big project. Our interest is in a small part of that, predicted edits in flies. The supplemental files to (St. Laurent et al. 2013) contain a spreadsheet. Table 7 of this spread sheet contains three tables of predicted edits from the ModEncode project. The tables look like this We want to see if our data supports edits or SNPs at the reported positions. We can use Pandas to read the spreadsheet table and PySam to iterate over the reads. As we iterate through the reads, we'll count coverage and look for possible mismatches to the reference genome. We will store the sites from  spreadsheet in a site object class Site(object): """ Site - ModEncode predicted SNP init is passed a 7 element list from the supplemental spreadsheet it also contains dictionaries for counting the coverage at the site position. """ def __init__(self, data_series, what): self.what = what self.name = data_series[0]...

More pie, please

This always happens in computation biology and bioinformatics, analysis of a set of data opens up more questions. Recently, we looked at analyzing and plotting the distribution of predicted snps . Now, we would like to see how the snps within coding regions cause synonymous or non-synonymous changes to codons. The process of determining whether a snp causes a synonymous or non-synonymous change is, in principle, straight forward. Start at the beginning of the CDS and count off by threes. You can then map each three-mer to its amino acid. The snp will modify the three-mer and represent a possible synonymous substitution (no change in the amino acid) or a non-synonymous substitution (an change to a different amino acid). Each three-mer is a codon. Problems arise, however, because there are occasionally mismatches between the annotated exon positions and the annotated protein sequence. Luckily, in relation to the total number of annotated genes these problems are few. However, we ha...

Read the Directions

This time, we'll look at a simple task, counting reads mapped to the genome in forward and reverse direction. Once we have the counts, we'll plot histograms of the distribution of coverage for the forward and reverse reads. To do the counting, once again we'll use PySam . We can use pileup to give us the reads for each genomic position and we use a Numpy integer array to hold the counts. As we did previously, we will eliminate any read with a base call quality below 30. Here's the counting code. It's similar to what we did before. We use pileup() to loop through each genomic position. Pileup allows us to iterate over each read. From the read, we can get the direction and increment the appropriate array position. def ReadBamFile(chrom, chrom_path, bam_file): """ ReadBamFile - read a bam file and count coverage chrom - chromosome name e.g. chr2L chrom_path - path to the genome files bam_file - input bam file Return: fo...