GC content

The GC content of a sequence is the percentage of G's and C's in a fragment or an entire genome. Genes are often characterized by a higher GC content than background.Variations in GC content can lead to bias in read coverage in ChIP-seq and RNA-seq experiments. See Benjamini and Speed.

Calculating GC content is straight forward. Select a segment of DNA or RNA sequence and count A's, C's etc. and calculate \[gc = \frac{{G + C}}{{A + C + G + T}}\].

Often GC content is calculated for an entire chromosome by calculating the GC content in a series of either overlapping or sliding windows of fixed size.  Our problem is slightly more complicated. We want to calculate the number of reads mapped to various values of GC content ranging from 0 to 100%.

The first step is to calculate the GC content for a chromosome using a series of non-overlapping, fixed width windows. We can use Python string functions to do this.

def GC_pct(seq_rec, window):
    """
    GC_pct - calculate GC content for a series of sequence records
    seq_rec - a SeqRecord
    window - non-overlapping window size
    returns:
    gc - a lists of gc content for each window size segment
    """
    
    gc = []
    
    seq = seq_rec.seq
    print SeqUtils.GC(seq)
    
    for i in xrange(0, len(seq), window):
        s = seq[i:i+window].upper()
        a = s.count('A')
        c = s.count('C')
        g = s.count('G')
        t = s.count('T')
        if a+c+g+t > 0:
            gc.append((g+c)/float(a+c+g+t))
        else:
            gc.append(0.0)

    return gc
 
SeqUtils.GC is a BioPython function. We use it to get the overall GC of the chromosome.

Next, we read a bam file and count the reads mapped to each window size segment of the chromosome. There are may ways to do this process, but I took the lazy way and let Numpy do the work. The Numpy function digitize will take a list of positions and a list of segments and return the index of the segment that each position falls into. We then increment the count for the appropriate segment.


def ReadBamFile(bam_file, chrom, seq_rec, window):
    """
    ReadBamFile - read a bam file and return the read starting positions                  
    bam_file - input bam file
    chrom - a chromosome ids
    seq_rec - the chromosome sequence record
    window - the window size
    returns:
    counts - a numpy array of the counts of reads in each window size segment
    """

    # get the starting location of window size segments
    segments = np.array([x for x in xrange(0, len(seq_rec.seq), window)])
    counts = np.zeros(len(segments))
    
    # iterate through reads and get read starting positions and add to counts
    f = pysam.Samfile(bam_file, 'rb')
    for read in f.fetch(reference=chrom, start=0, end=len(seq_rec.seq)-1):
        ind = np.digitize([read.pos], segments)  # which segment is the read in
        counts[ind-1] += 1

    return counts

The last step is to accumulate the counts into 101 bins, one for each 1% between 0 and 100.

def BinGC(counts, gc):
    """
    BinGC - put counts into bins 0 to 100
    counts - counts of reads in segments returned from ReadBam
    gc - gc content of each segment
    returns:
    gc_counts - a numpy array. gc_counts[i] = number or reads in segments with gc[i] GC content
    """
    
    gc_counts = np.zeros(101)
    for i in xrange(len(gc)):
        gc_bin = int(round(gc[i]*100.0))
        gc_counts[gc_bin] += counts[i]

    return gc_counts

Finaly, the main routine simply prints gc_counts.

Download the full program.

I tried a second approach to this problem, using an array to map the GC percent at each chromosome position. In the second version, GC_pct function now returns a list of pairs of GC percent and a slot for the counts of reads at that position.

def GC_pct(seq_rec, window):
    """
    GC_pct - calculate GC content for a series of sequence records
    seq_rec - a SeqRecord
    window - non-overlapping window size
    returns:
    gc_seq - a list of [GC%, counts of reads starting at this position]
    """
    
    seq = seq_rec.seq
    print SeqUtils.GC(seq.upper())

    gc_seq = [[0, 0] for _ in xrange(len(seq))]
    
    for i in xrange(0, len(seq), window):
        s = seq[i:i+window].upper()
        a = s.count('A')
        c = s.count('C')
        g = s.count('G')
        t = s.count('T')
        if a+c+g+t > 0:
            gc = int(round(((g+c)/float(a+c+g+t)) * 100.0));
            gc_seq[i:i+window] = [[gc, 0] for _ in xrange(i, i+window)]

    return gc_seq

The ReadBam function is now simplified.

def ReadBamFile(bam_file, gc_seq, chrom, window):
    """
    ReadBamFile - read a bam file and return the read starting positions                  
    bam_file - input bam file
    gc_seq - a list of [GC%, counts of reads starting at this position]
    chrom - a chromosome ids
    window - the window size
    """

    # iterate through reads and get read starting positions and add to counts
    f = pysam.Samfile(bam_file, 'rb')
    for read in f.fetch(reference=chrom, start=0, end=len(gc_seq)-1):
        gc_seq[read.pos][1] += 1

The binning routine iterates through the gc_sec array accumulating counts.

def BinGC(gc_seq):
    """
    BinGC - put counts into bins 0 to 100
    gc_seq - a list of [GC%, counts of reads starting at this position]
    returns:
    gc_counts - a numpy array. gc_counts[i] = number or reads in segments with GC content
    """
    
    gc_counts = np.zeros(101)
    for i in xrange(len(gc_seq)):
        gc_counts[gc_seq[i][0]] += gc_seq[i][1]

    return gc_counts 
 
Download the second version.

From FASTQ to FASTA

Probably the most common bioinformatics task is conversion between file formats. Recently, we needed to extract sequence data from a FASTQ file and write the sequences to a FASTA file. If all you want to do is extract the sequences, it's very easy in BioPython. Our task was slightly more difficult, but not by much. We also want to filter reads with low basecall quality. Our usual criteria is to reject any read with one or more bases with a quality score lower than thirty. A second, minor problem is that some of our FASQ files are in zip format, others are gzipped, and some are just text.

We will use BioPython's SeqIO module to handle reading and writing the sequences. We use Python's ZipFile module to handle zipped files and the gzip module for gzipped files. The code is relatively simple. We open the FASTQ file, after first checking for the appropriate file type; read it one record at a time; check for low quality bases; if the record passes teh check, write it to the FASTA file. Here's the code:

def ProcessFastq(fastq_file, fasta_file):
    """
    ProcessFastq - output a fast file from a fastq file

    fastq_file - fastq file name, may be a fastq file
                 or zipped or gzipped fastq file
    fasta_file - output fasta file name
                 id is the fastq id.
                 description is quality string
    """
    
    def read_ok(qual):
        """
        read_ok - reject reads with a low quality (lt 30) basecalls
        qual - a list of integer basecall qualities
        returns: True if the read is ok
        """
        if any([q < _BASE_QUAL_CUTOFF for q in qual]):
            return False
        else:
            return True

    _BASE_QUAL_CUTOFF = 30

    if re.search('\.gz$', fastq_file):
        inp = gzip.open(fastq_file, 'rb')
    elif re.search('\.zip$', fastq_file):
        z = zipfile.ZipFile(fastq_file)
        (dirName, fileName) = os.path.split(fastq_file)
        m = re.search('(.+?\.fastq)\.zip$', fileName)
        inp = z.open(m.group(1), 'r')
    else:
        inp = open(fastq_file, 'r')

    if fasta_file is not None:
        out = open(fasta_file, 'w')
    else:
        out = sys.stdout

    for rec in SeqIO.parse(inp, 'fastq'):
        if read_ok(rec.letter_annotations['phred_quality']):
            out_rec = SeqRecord(seq = Seq(str(rec.seq), generic_dna),
                                id = rec.id,
                                description = ''.join([chr(c+33) for c in rec.letter_annotations['phred_quality']]))
            SeqIO.write(out_rec, out, 'fasta')

    out.close()
    inp.close()

That's it. The rest of the program just handles getting the file names from the command line.
Download the code.

Duplicates and what to do with them

Ever since I started dealing with ChIP-Seq and RNA-Seq data, PCR duplicates have been a bit of a mystery to me. Googling for PCR duplicates provides a variety of opinions about how they arise, how to handle them and so forth. The best introduction I have encountered is the CureFFI.org blog. I going to swipe their description of the library prep process because it's clear and concise (they actually got it from here):
  • Shatter genomic DNA, e.g. with a sonicator or DNAase.
  • Ligate adapters to both ends of the fragments. Not every molecule successfully ligates.
  • PCR amplify the fragments with adapters
  • Spread DNA molecules across flowcells.  The goal is to get exactly one DNA molecule per flowcell lawn of primers.  This depends purely on probability, based on the concentration of DNA.
  • Use bridge PCR to amplify the single molecule on each bead or each lawn so that you can get a strong enough signal (whether light or pH) to detect.  Usually this requires several hundred or low thousands of molecules.
  • Sequence by synthesis of complementary strand
The third step is where PCR duplicates arise. A duplicate occurs when a copy of the original molecule ends up on a different lawn in the flowcell. With adequate starting amounts of DNA, the duplicate level should be low. With smaller amounts of DNA, the rounds of PCR is increased.  Sometimes more than one copy of the same original molecule will hybridize to a different lawn. These are duplicates.  Ideally, we would like to have the concentration of DNA just right so that duplicates are kept to a minimum.

The CureFFI.org blog has a nice back of the envelope calculation showing how the initial level of DNA affects the amount of duplication you see in the data. Even if you don't care about the rest of this post, I encourage you to read their description of how the duplicate level increases.

Given that we have some level of PCR duplicates, how do we remove them. There are two popular software tools for removing duplicates from bam or sam files: samtools and Picard. They perform similar processes and while discussions on SeqAnswers seem to prefer Picard, I haven't seen much difference when using each tool on our data.

Both perform basically the same operations. They take a bam file sorted by target id and sequence location. Reads that are mapped to the same starting location and in the same orientation are considered duplicates. The read with the highest map quality is retained.

Notice that the sequence of the read is not considered. In our work, we're interested in SNPs, edits, and sequencing errors. We're concerned that we may be throwing out some of the very data that we need. In order to understand the process of duplicate removal and how we might retain more data, I wrote my own duplicate removal program, sort of a reverse Xerox machine. The idea is basically the same. You start with a  sorted bam file, examine the reads mapping to each position and instead of just considering orientation, we save a copy of each different sequence. The result is that reads mapping to identical positions but with differing sequences will be saved.

The code is pretty simple:

 
    # this assumes the bam file is sorted.
    # it does minimal error checking
    count = 0
    out_count = 0
    prev_read_pos = -1
    prev_tid = -1
    reads = dict()

    # iterate through reads and accumulate the reads with the same starting address
    f_in = pysam.Samfile(bam_file, 'rb')
    f_out = pysam.Samfile(out_file, 'wb', template=f_in)
    for read in f_in.fetch(until_eof = True):
        count += 1
        if read_ok(read):
            if read.tid == prev_tid and read.positions[0] < prev_read_pos:
                sys.stderr.write("It look's like this bam file is not sorted" + '\n')
                sys.exit(1)
            
            if (not reads) or read.positions[0] == prev_read_pos:
                if read.seq not in reads.keys():
                    reads[read.seq] = []
                reads[read.seq].append(read)
            else:
                out_count += ProcessReads(reads, f_out)
                if out_count % 10000 == 0:
                    print count, out_count
                    sys.stdout.flush()
                reads = dict()
                reads[read.seq] = [read]
            prev_read_pos = read.positions[0]
            prev_tid = read.tid

    return count, out_count

We iterate through the reads, saving each read in a list in dictionary keyed on the sequence. We could have just sorted the reads by sequence later, but a dictionary is more convenient. When a read mapping to a different location is encountered, we process the reads mapping to the current location and start over with teh new read.

To process the reads, we handle each orientation separately. I'm not really sure this is a good idea, but it's what samtools and Picard do, so we will also. We also process each sequence key separately, so we end of with a copy of each different sequence being saved.

  def ProcessReads(reads, out):
        """
        ProcessReads - remove duplicate reads
                       we consider a read a duplicate if it is mappedto the same position
                       and has the same sequence.
                       Note: we handle forward and reverse reads separately because that's what
                       samtools does.
        reads - dictionary of reads keyed on sequence
        out - file handle of output file
        returns
        reads_written - a count of the number of reads written
        """
        
        reads_written = 0
        for seq in reads.keys():
            for_reads = [x for x in reads[seq] if not x.is_reverse]
            if for_reads:
                r = sorted(for_reads, key=lambda x: x.mapq, reverse=True)
                out.write(r[0])
                reads_written += 1
            rev_reads = [x for x in reads[seq] if x.is_reverse]
            if rev_reads:
                r = sorted(rev_reads, key=lambda x: x.mapq, reverse=True)
                out.write(r[0])
                reads_written += 1

        return reads_written

In the end we have bam file with slightly more reads than we see with samtools or Picard. The code is slower than samtools, written in C, or Picard, written in Java, but it's not so slow that speed is a major problem.

Download the code.