Calculating GC content is straight forward. Select a segment of DNA or RNA sequence and count A's, C's etc. and calculate \[gc = \frac{{G + C}}{{A + C + G + T}}\].
Often GC content is calculated for an entire chromosome by calculating the GC content in a series of either overlapping or sliding windows of fixed size. Our problem is slightly more complicated. We want to calculate the number of reads mapped to various values of GC content ranging from 0 to 100%.
The first step is to calculate the GC content for a chromosome using a series of non-overlapping, fixed width windows. We can use Python string functions to do this.
def GC_pct(seq_rec, window): """ GC_pct - calculate GC content for a series of sequence records seq_rec - a SeqRecord window - non-overlapping window size returns: gc - a lists of gc content for each window size segment """ gc = [] seq = seq_rec.seq print SeqUtils.GC(seq) for i in xrange(0, len(seq), window): s = seq[i:i+window].upper() a = s.count('A') c = s.count('C') g = s.count('G') t = s.count('T') if a+c+g+t > 0: gc.append((g+c)/float(a+c+g+t)) else: gc.append(0.0) return gc
SeqUtils.GC is a BioPython function. We use it to get the overall GC of the chromosome.
Next, we read a bam file and count the reads mapped to each window size segment of the chromosome. There are may ways to do this process, but I took the lazy way and let Numpy do the work. The Numpy function digitize will take a list of positions and a list of segments and return the index of the segment that each position falls into. We then increment the count for the appropriate segment.
def ReadBamFile(bam_file, chrom, seq_rec, window): """ ReadBamFile - read a bam file and return the read starting positions bam_file - input bam file chrom - a chromosome ids seq_rec - the chromosome sequence record window - the window size returns: counts - a numpy array of the counts of reads in each window size segment """ # get the starting location of window size segments segments = np.array([x for x in xrange(0, len(seq_rec.seq), window)]) counts = np.zeros(len(segments)) # iterate through reads and get read starting positions and add to counts f = pysam.Samfile(bam_file, 'rb') for read in f.fetch(reference=chrom, start=0, end=len(seq_rec.seq)-1): ind = np.digitize([read.pos], segments) # which segment is the read in counts[ind-1] += 1 return counts
The last step is to accumulate the counts into 101 bins, one for each 1% between 0 and 100.
def BinGC(counts, gc): """ BinGC - put counts into bins 0 to 100 counts - counts of reads in segments returned from ReadBam gc - gc content of each segment returns: gc_counts - a numpy array. gc_counts[i] = number or reads in segments with gc[i] GC content """ gc_counts = np.zeros(101) for i in xrange(len(gc)): gc_bin = int(round(gc[i]*100.0)) gc_counts[gc_bin] += counts[i] return gc_counts
Finaly, the main routine simply prints gc_counts.
Download the full program.
I tried a second approach to this problem, using an array to map the GC percent at each chromosome position. In the second version, GC_pct function now returns a list of pairs of GC percent and a slot for the counts of reads at that position.
def GC_pct(seq_rec, window): """ GC_pct - calculate GC content for a series of sequence records seq_rec - a SeqRecord window - non-overlapping window size returns: gc_seq - a list of [GC%, counts of reads starting at this position] """ seq = seq_rec.seq print SeqUtils.GC(seq.upper()) gc_seq = [[0, 0] for _ in xrange(len(seq))] for i in xrange(0, len(seq), window): s = seq[i:i+window].upper() a = s.count('A') c = s.count('C') g = s.count('G') t = s.count('T') if a+c+g+t > 0: gc = int(round(((g+c)/float(a+c+g+t)) * 100.0)); gc_seq[i:i+window] = [[gc, 0] for _ in xrange(i, i+window)] return gc_seq
The ReadBam function is now simplified.
def ReadBamFile(bam_file, gc_seq, chrom, window): """ ReadBamFile - read a bam file and return the read starting positions bam_file - input bam file gc_seq - a list of [GC%, counts of reads starting at this position] chrom - a chromosome ids window - the window size """ # iterate through reads and get read starting positions and add to counts f = pysam.Samfile(bam_file, 'rb') for read in f.fetch(reference=chrom, start=0, end=len(gc_seq)-1): gc_seq[read.pos][1] += 1
The binning routine iterates through the gc_sec array accumulating counts.
def BinGC(gc_seq): """ BinGC - put counts into bins 0 to 100 gc_seq - a list of [GC%, counts of reads starting at this position] returns: gc_counts - a numpy array. gc_counts[i] = number or reads in segments with GC content """ gc_counts = np.zeros(101) for i in xrange(len(gc_seq)): gc_counts[gc_seq[i][0]] += gc_seq[i][1] return gc_counts
Download the second version.
I really enjoyed reading this post, big fan. Keep up the good work andplease tell me when can you publish more articles or where can I read more on the subject?
ReplyDeleteΘωρακισμένες πόρτες