Posts

Showing posts from July, 2014

GC content

The GC content of a sequence is the percentage of G's and C's in a fragment or an entire genome. Genes are often characterized by a higher GC content than background.Variations in GC content can lead to bias in read coverage in ChIP-seq and RNA-seq experiments. See Benjamini and Speed . Calculating GC content is straight forward. Select a segment of DNA or RNA sequence and count A's, C's etc. and calculate \[gc = \frac{{G + C}}{{A + C + G + T}}\]. Often GC content is calculated for an entire chromosome by calculating the GC content in a series of either overlapping or sliding windows of fixed size.  Our problem is slightly more complicated. We want to calculate the number of reads mapped to various values of GC content ranging from 0 to 100%. The first step is to calculate the GC content for a chromosome using a series of non-overlapping, fixed width windows. We can use Python string functions to do this. def GC_pct(seq_rec, window): """ ...

From FASTQ to FASTA

Probably the most common bioinformatics task is conversion between file formats. Recently, we needed to extract sequence data from a FASTQ file and write the sequences to a FASTA file. If all you want to do is extract the sequences, it's very easy in BioPython . Our task was slightly more difficult, but not by much. We also want to filter reads with low basecall quality. Our usual criteria is to reject any read with one or more bases with a quality score lower than thirty. A second, minor problem is that some of our FASQ files are in zip format, others are gzipped, and some are just text. We will use BioPython's SeqIO module to handle reading and writing the sequences. We use Python's ZipFile module to handle zipped files and the gzip module for gzipped files. The code is relatively simple. We open the FASTQ file, after first checking for the appropriate file type; read it one record at a time; check for low quality bases; if the record passes teh check, write it to ...

Duplicates and what to do with them

Ever since I started dealing with ChIP-Seq and RNA-Seq data, PCR duplicates have been a bit of a mystery to me. Googling for PCR duplicates provides a variety of opinions about how they arise, how to handle them and so forth. The best introduction I have encountered is the CureFFI.org blog. I going to swipe their description of the library prep process because it's clear and concise (they actually got it from here ): Shatter genomic DNA, e.g. with a sonicator or DNAase. Ligate adapters to both ends of the fragments. Not every molecule successfully ligates. PCR amplify the fragments with adapters Spread DNA molecules across flowcells.  The goal is to get exactly one DNA molecule per flowcell lawn of primers.  This depends purely on probability, based on the concentration of DNA. Use bridge PCR to amplify the single molecule on each bead or each lawn so that you can get a strong enough signal (whether light or pH) to detect.  Usually this requires several hundre...