Calculating GC content is straight forward. Select a segment of DNA or RNA sequence and count A's, C's etc. and calculate \[gc = \frac{{G + C}}{{A + C + G + T}}\].
Often GC content is calculated for an entire chromosome by calculating the GC content in a series of either overlapping or sliding windows of fixed size. Our problem is slightly more complicated. We want to calculate the number of reads mapped to various values of GC content ranging from 0 to 100%.
The first step is to calculate the GC content for a chromosome using a series of non-overlapping, fixed width windows. We can use Python string functions to do this.
def GC_pct(seq_rec, window):
"""
GC_pct - calculate GC content for a series of sequence records
seq_rec - a SeqRecord
window - non-overlapping window size
returns:
gc - a lists of gc content for each window size segment
"""
gc = []
seq = seq_rec.seq
print SeqUtils.GC(seq)
for i in xrange(0, len(seq), window):
s = seq[i:i+window].upper()
a = s.count('A')
c = s.count('C')
g = s.count('G')
t = s.count('T')
if a+c+g+t > 0:
gc.append((g+c)/float(a+c+g+t))
else:
gc.append(0.0)
return gc
SeqUtils.GC is a BioPython function. We use it to get the overall GC of the chromosome.
Next, we read a bam file and count the reads mapped to each window size segment of the chromosome. There are may ways to do this process, but I took the lazy way and let Numpy do the work. The Numpy function digitize will take a list of positions and a list of segments and return the index of the segment that each position falls into. We then increment the count for the appropriate segment.
def ReadBamFile(bam_file, chrom, seq_rec, window):
"""
ReadBamFile - read a bam file and return the read starting positions
bam_file - input bam file
chrom - a chromosome ids
seq_rec - the chromosome sequence record
window - the window size
returns:
counts - a numpy array of the counts of reads in each window size segment
"""
# get the starting location of window size segments
segments = np.array([x for x in xrange(0, len(seq_rec.seq), window)])
counts = np.zeros(len(segments))
# iterate through reads and get read starting positions and add to counts
f = pysam.Samfile(bam_file, 'rb')
for read in f.fetch(reference=chrom, start=0, end=len(seq_rec.seq)-1):
ind = np.digitize([read.pos], segments) # which segment is the read in
counts[ind-1] += 1
return counts
The last step is to accumulate the counts into 101 bins, one for each 1% between 0 and 100.
def BinGC(counts, gc):
"""
BinGC - put counts into bins 0 to 100
counts - counts of reads in segments returned from ReadBam
gc - gc content of each segment
returns:
gc_counts - a numpy array. gc_counts[i] = number or reads in segments with gc[i] GC content
"""
gc_counts = np.zeros(101)
for i in xrange(len(gc)):
gc_bin = int(round(gc[i]*100.0))
gc_counts[gc_bin] += counts[i]
return gc_counts
Finaly, the main routine simply prints gc_counts.
Download the full program.
I tried a second approach to this problem, using an array to map the GC percent at each chromosome position. In the second version, GC_pct function now returns a list of pairs of GC percent and a slot for the counts of reads at that position.
def GC_pct(seq_rec, window):
"""
GC_pct - calculate GC content for a series of sequence records
seq_rec - a SeqRecord
window - non-overlapping window size
returns:
gc_seq - a list of [GC%, counts of reads starting at this position]
"""
seq = seq_rec.seq
print SeqUtils.GC(seq.upper())
gc_seq = [[0, 0] for _ in xrange(len(seq))]
for i in xrange(0, len(seq), window):
s = seq[i:i+window].upper()
a = s.count('A')
c = s.count('C')
g = s.count('G')
t = s.count('T')
if a+c+g+t > 0:
gc = int(round(((g+c)/float(a+c+g+t)) * 100.0));
gc_seq[i:i+window] = [[gc, 0] for _ in xrange(i, i+window)]
return gc_seq
The ReadBam function is now simplified.
def ReadBamFile(bam_file, gc_seq, chrom, window):
"""
ReadBamFile - read a bam file and return the read starting positions
bam_file - input bam file
gc_seq - a list of [GC%, counts of reads starting at this position]
chrom - a chromosome ids
window - the window size
"""
# iterate through reads and get read starting positions and add to counts
f = pysam.Samfile(bam_file, 'rb')
for read in f.fetch(reference=chrom, start=0, end=len(gc_seq)-1):
gc_seq[read.pos][1] += 1
The binning routine iterates through the gc_sec array accumulating counts.
def BinGC(gc_seq):
"""
BinGC - put counts into bins 0 to 100
gc_seq - a list of [GC%, counts of reads starting at this position]
returns:
gc_counts - a numpy array. gc_counts[i] = number or reads in segments with GC content
"""
gc_counts = np.zeros(101)
for i in xrange(len(gc_seq)):
gc_counts[gc_seq[i][0]] += gc_seq[i][1]
return gc_counts
Download the second version.
I really enjoyed reading this post, big fan. Keep up the good work andplease tell me when can you publish more articles or where can I read more on the subject?
ReplyDeleteΘωρακισμένες πόρτες