To do the counting, once again we'll use PySam. We can use pileup to give us the reads for each genomic position and we use a Numpy integer array to hold the counts. As we did previously, we will eliminate any read with a base call quality below 30.
Here's the counting code. It's similar to what we did before. We use pileup() to loop through each genomic position. Pileup allows us to iterate over each read. From the read, we can get the direction and increment the appropriate array position.
def ReadBamFile(chrom, chrom_path, bam_file): """ ReadBamFile - read a bam file and count coverage chrom - chromosome name e.g. chr2L chrom_path - path to the genome files bam_file - input bam file Return: forward_counts, rev_counts - numpy arrays of integers, containing number of reads overlapping each position forward_counts[i]=j means there were j in the '+' direction overlapping position i """ def read_ok(aln, bad_reads): """ read_ok - reject reads with a low quality (lt 30) base call aln - a PySam AlignedRead object bad_reads - a dictionary of read ids with low quality returns: True if the read is ok We maintain a dictionary of rejected reads, so once a read is flagged we don't seach it's quality list again """ if aln.qname in bad_reads: return False if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(aln.qual)]): bad_reads[aln.qname] = 1 return False else: return True _BASE_QUAL_CUTOFF = 30 chr_len = ChromosomeLength(chrom_path + chrom + '.fa') forward_counts = np.zeros(chr_len, dtype='int') rev_counts = np.zeros(chr_len, dtype='int') bad_reads = dict() in_file = pysam.Samfile(bam_file, 'rb') for pileup_col in in_file.pileup(chrom, max_depth=100000): for pileup_read in pileup_col.pileups: if pileup_read.indel == 0 and pileup_read.is_del == 0: aln = pileup_read.alignment if read_ok(aln, bad_reads): if aln.is_reverse: rev_counts[pileup_col.pos] += 1 else: forward_counts[pileup_col.pos] += 1 return forward_counts, rev_countsWe can use Matplotlib to plot the histograms for the forward and reverse directions. We will eliminate positions with 0 coverage to avoid a big peak at 0 in the histogram.
forward_counts, rev_counts = ReadBamFile(chrom, chrom_path, bam_file) hist_file = out_path + 'hist.' + chrom + '.png' fig = plt.figure(1) hist, bins = np.histogram(forward_counts[forward_counts > 0], bins=50) width = 0.7 * (bins[1] - bins[0]) center = (bins[:-1] + bins[1:]) / 2 plt.subplot(211) plt.bar(center, hist, align='center', width=width) plt.title('Forward') hist, bins = np.histogram(rev_counts[rev_counts > 0], bins=50) width = 0.7 * (bins[1] - bins[0]) center = (bins[:-1] + bins[1:]) / 2 plt.subplot(212) plt.title('Reverse') plt.bar(center, hist, align='center', width=width) fig.savefig(hist_file) output_file = out_path + 'counts.' + chrom + '.csv' f = open(output_file, 'w') f.write(','.join(['position', 'forward', 'reverse']) + '\n') for pos in xrange(forward_counts.shape[0]): if forward_counts[pos] > 0 or rev_counts[pos] > 0: f.write(','.join([str(pos), str(forward_counts[pos]), str(rev_counts[pos])]) + '\n') f.close()np.histogram returns the counts and bins for the histogram. We can then use Matplotlib's bar plot top draw the histogram.
Download the full code.
No comments:
Post a Comment