To do the counting, once again we'll use PySam. We can use pileup to give us the reads for each genomic position and we use a Numpy integer array to hold the counts. As we did previously, we will eliminate any read with a base call quality below 30.
Here's the counting code. It's similar to what we did before. We use pileup() to loop through each genomic position. Pileup allows us to iterate over each read. From the read, we can get the direction and increment the appropriate array position.
def ReadBamFile(chrom, chrom_path, bam_file):
"""
ReadBamFile - read a bam file and count coverage
chrom - chromosome name e.g. chr2L
chrom_path - path to the genome files
bam_file - input bam file
Return:
forward_counts, rev_counts - numpy arrays of integers, containing number of reads overlapping each position
forward_counts[i]=j means there were j in the '+' direction overlapping position i
"""
def read_ok(aln, bad_reads):
"""
read_ok - reject reads with a low quality (lt 30) base call
aln - a PySam AlignedRead object
bad_reads - a dictionary of read ids with low quality
returns: True if the read is ok
We maintain a dictionary of rejected reads, so once a read is flagged
we don't seach it's quality list again
"""
if aln.qname in bad_reads:
return False
if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(aln.qual)]):
bad_reads[aln.qname] = 1
return False
else:
return True
_BASE_QUAL_CUTOFF = 30
chr_len = ChromosomeLength(chrom_path + chrom + '.fa')
forward_counts = np.zeros(chr_len, dtype='int')
rev_counts = np.zeros(chr_len, dtype='int')
bad_reads = dict()
in_file = pysam.Samfile(bam_file, 'rb')
for pileup_col in in_file.pileup(chrom, max_depth=100000):
for pileup_read in pileup_col.pileups:
if pileup_read.indel == 0 and pileup_read.is_del == 0:
aln = pileup_read.alignment
if read_ok(aln, bad_reads):
if aln.is_reverse:
rev_counts[pileup_col.pos] += 1
else:
forward_counts[pileup_col.pos] += 1
return forward_counts, rev_counts
We can use Matplotlib to plot the histograms for the forward and reverse directions. We will eliminate positions with 0 coverage to avoid a big peak at 0 in the histogram.
forward_counts, rev_counts = ReadBamFile(chrom, chrom_path, bam_file)
hist_file = out_path + 'hist.' + chrom + '.png'
fig = plt.figure(1)
hist, bins = np.histogram(forward_counts[forward_counts > 0], bins=50)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plt.subplot(211)
plt.bar(center, hist, align='center', width=width)
plt.title('Forward')
hist, bins = np.histogram(rev_counts[rev_counts > 0], bins=50)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) / 2
plt.subplot(212)
plt.title('Reverse')
plt.bar(center, hist, align='center', width=width)
fig.savefig(hist_file)
output_file = out_path + 'counts.' + chrom + '.csv'
f = open(output_file, 'w')
f.write(','.join(['position', 'forward', 'reverse']) + '\n')
for pos in xrange(forward_counts.shape[0]):
if forward_counts[pos] > 0 or rev_counts[pos] > 0:
f.write(','.join([str(pos), str(forward_counts[pos]), str(rev_counts[pos])]) + '\n')
f.close()
np.histogram returns the counts and bins for the histogram. We can then use Matplotlib's bar plot top draw the histogram.Download the full code.
No comments:
Post a Comment