The Analytic Garden

Posts

Showing posts from March, 2014

Tracking Edits

- March 31, 2014

Previously , we looked at one way to map reads to genes to annotate expressed transcripts with GO annotation. Now, I'd like to look at a collection of sites known to be edited under certain circumstances and see if our sequence reads confirm that A to I RNA editing may have taken place. Supplementary table 29 from this paper (St Laurent et al.)* contains a list of previously know editing sites. The top few lines of the table look like this: Supplementary Table 29: List of previously known editing sites in Drosophila Gene Name Genome Chr Genome Strand Genome Pos Previous Annotation Identified by modENCODE unc-13-RB 4 T 895009 Reenan_unc13_A YES unc-13-RB 4 T 899596 YES Caps-RA 4 A 1268833 Reenan_CAPS_A YES nAcRbeta-21C-RB 2L A 546751 YES syt-RA 2L T 2785542 Reenan_Syt_...

Let's GO for it

- March 26, 2014

One of the things we would like to know from our sequencing experiments is the number of reads mapped to the various genes and what those genes do. Conceptually, this process is simple; for each read, check to see what gene it maps to; count reads overlapping each gene; for each gene with coverage, look up what the gene does. The latter step involves what is called Gene Ontology (the GO in the title). The Gene Ontology project is an initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. Genes are assigned various Gene Ontology identifiers, a GO ID. Determining a particular gene's products is then a matter of looking up the GO IDs in a gene ontology database. Although conceptually simple, actually finding the gene products from a set of reads can be tricky. However, there is software available that seems like it should make the process if not easy, at least doable. In python, there is some GO analysis softwar...

Mismatches and Quality

- March 17, 2014

How does the mismatch rate of Illumina reads vary with quality? We can use reads from the PhiX control runs to examine that question. Our genome center runs a lane of phiX Control v3 each time they run a set of reads. The phiX run serves as a quality control for the sequencing experiments. We can use it to examine basic error rates arising from the sequencing process. To do this we'll use a bam file that contains reads aligned to the phiX genome. As mentioned in an earlier post , we found some positions where our reads differed from the official genome by almost 100%. We replaced those positions with a fixed value and created a new genome. In addition, we have one position (position 1301) with a well know, approximately 50-50, SNP. In our mismatch counting, we will ignore reads that overlap this position as well as reads overlapping the genome end positions. We will also ignore any read with a base call quality below 30. With this in mind, the count procedure consists of usin...

Counting yet again...This time RNA secondary structures

- March 16, 2014

The secondary structure of a sequence of nucleotides refers to the basepairing interactions within a single molecule or interacting molecules. In an RNA molecule, nucleotides can basepair with nucleotides on the same sequence to form C-G, A-U, or G-U pairs. This pairing can yield a complicated 2D structure. For example, The prediction of RNA secondary structure from the primary sequence is a well developed art, at least in the absence of pseudoknots , and if the sequence is not too long. Biologically relevant prediction models depend on energy models of the interactions among the nucleotides. For details of predicting RNA structures, see the Mathews Lab RNAStructure programs or the Vienna RNA Package . We will look at a simpler task, getting a ballpark estimate of the total allowable number of secondary structures available to a sequence. This estimate is likely physically wrong in that it doesn't take the interaction energies into account, so probably allows for non-biol...