GC content

The GC content of a sequence is the percentage of G's and C's in a fragment or an entire genome. Genes are often characterized by a higher GC content than background.Variations in GC content can lead to bias in read coverage in ChIP-seq and RNA-seq experiments. See Benjamini and Speed.

Calculating GC content is straight forward. Select a segment of DNA or RNA sequence and count A's, C's etc. and calculate \[gc = \frac{{G + C}}{{A + C + G + T}}\].

Often GC content is calculated for an entire chromosome by calculating the GC content in a series of either overlapping or sliding windows of fixed size.  Our problem is slightly more complicated. We want to calculate the number of reads mapped to various values of GC content ranging from 0 to 100%.

The first step is to calculate the GC content for a chromosome using a series of non-overlapping, fixed width windows. We can use Python string functions to do this.

def GC_pct(seq_rec, window):
    """
    GC_pct - calculate GC content for a series of sequence records
    seq_rec - a SeqRecord
    window - non-overlapping window size
    returns:
    gc - a lists of gc content for each window size segment
    """
    
    gc = []
    
    seq = seq_rec.seq
    print SeqUtils.GC(seq)
    
    for i in xrange(0, len(seq), window):
        s = seq[i:i+window].upper()
        a = s.count('A')
        c = s.count('C')
        g = s.count('G')
        t = s.count('T')
        if a+c+g+t > 0:
            gc.append((g+c)/float(a+c+g+t))
        else:
            gc.append(0.0)

    return gc
 
SeqUtils.GC is a BioPython function. We use it to get the overall GC of the chromosome.

Next, we read a bam file and count the reads mapped to each window size segment of the chromosome. There are may ways to do this process, but I took the lazy way and let Numpy do the work. The Numpy function digitize will take a list of positions and a list of segments and return the index of the segment that each position falls into. We then increment the count for the appropriate segment.


def ReadBamFile(bam_file, chrom, seq_rec, window):
    """
    ReadBamFile - read a bam file and return the read starting positions                  
    bam_file - input bam file
    chrom - a chromosome ids
    seq_rec - the chromosome sequence record
    window - the window size
    returns:
    counts - a numpy array of the counts of reads in each window size segment
    """

    # get the starting location of window size segments
    segments = np.array([x for x in xrange(0, len(seq_rec.seq), window)])
    counts = np.zeros(len(segments))
    
    # iterate through reads and get read starting positions and add to counts
    f = pysam.Samfile(bam_file, 'rb')
    for read in f.fetch(reference=chrom, start=0, end=len(seq_rec.seq)-1):
        ind = np.digitize([read.pos], segments)  # which segment is the read in
        counts[ind-1] += 1

    return counts

The last step is to accumulate the counts into 101 bins, one for each 1% between 0 and 100.

def BinGC(counts, gc):
    """
    BinGC - put counts into bins 0 to 100
    counts - counts of reads in segments returned from ReadBam
    gc - gc content of each segment
    returns:
    gc_counts - a numpy array. gc_counts[i] = number or reads in segments with gc[i] GC content
    """
    
    gc_counts = np.zeros(101)
    for i in xrange(len(gc)):
        gc_bin = int(round(gc[i]*100.0))
        gc_counts[gc_bin] += counts[i]

    return gc_counts

Finaly, the main routine simply prints gc_counts.

Download the full program.

I tried a second approach to this problem, using an array to map the GC percent at each chromosome position. In the second version, GC_pct function now returns a list of pairs of GC percent and a slot for the counts of reads at that position.

def GC_pct(seq_rec, window):
    """
    GC_pct - calculate GC content for a series of sequence records
    seq_rec - a SeqRecord
    window - non-overlapping window size
    returns:
    gc_seq - a list of [GC%, counts of reads starting at this position]
    """
    
    seq = seq_rec.seq
    print SeqUtils.GC(seq.upper())

    gc_seq = [[0, 0] for _ in xrange(len(seq))]
    
    for i in xrange(0, len(seq), window):
        s = seq[i:i+window].upper()
        a = s.count('A')
        c = s.count('C')
        g = s.count('G')
        t = s.count('T')
        if a+c+g+t > 0:
            gc = int(round(((g+c)/float(a+c+g+t)) * 100.0));
            gc_seq[i:i+window] = [[gc, 0] for _ in xrange(i, i+window)]

    return gc_seq

The ReadBam function is now simplified.

def ReadBamFile(bam_file, gc_seq, chrom, window):
    """
    ReadBamFile - read a bam file and return the read starting positions                  
    bam_file - input bam file
    gc_seq - a list of [GC%, counts of reads starting at this position]
    chrom - a chromosome ids
    window - the window size
    """

    # iterate through reads and get read starting positions and add to counts
    f = pysam.Samfile(bam_file, 'rb')
    for read in f.fetch(reference=chrom, start=0, end=len(gc_seq)-1):
        gc_seq[read.pos][1] += 1

The binning routine iterates through the gc_sec array accumulating counts.

def BinGC(gc_seq):
    """
    BinGC - put counts into bins 0 to 100
    gc_seq - a list of [GC%, counts of reads starting at this position]
    returns:
    gc_counts - a numpy array. gc_counts[i] = number or reads in segments with GC content
    """
    
    gc_counts = np.zeros(101)
    for i in xrange(len(gc_seq)):
        gc_counts[gc_seq[i][0]] += gc_seq[i][1]

    return gc_counts 
 
Download the second version.

1 comment:

  1. I really enjoyed reading this post, big fan. Keep up the good work andplease tell me when can you publish more articles or where can I read more on the subject?
    Θωρακισμένες πόρτες

    ReplyDelete