Read the Directions

This time, we'll look at a simple task, counting reads mapped to the genome in forward and reverse direction. Once we have the counts, we'll plot histograms of the distribution of coverage for the forward and reverse reads.

To do the counting, once again we'll use PySam. We can use pileup to give us the reads for each genomic position and we use a Numpy integer array to hold the counts. As we did previously, we will eliminate any read with a base call quality below 30.

Here's the counting code. It's similar to what we did before. We use pileup() to loop through each genomic position. Pileup allows us to iterate over each read. From the read, we can get the direction and increment the appropriate array position.

def ReadBamFile(chrom, chrom_path, bam_file):
    """
    ReadBamFile - read a bam file and count coverage
    chrom - chromosome name e.g. chr2L
    chrom_path - path to the genome files
    bam_file - input bam file
    Return:
    forward_counts, rev_counts - numpy arrays of integers, containing number of reads overlapping each position
             forward_counts[i]=j means there were j in the '+' direction overlapping position i
    """

    def read_ok(aln, bad_reads):
        """
        read_ok - reject reads with a low quality (lt 30) base call
        aln - a PySam AlignedRead object
        bad_reads - a dictionary of read ids with low quality
        returns: True if the read is ok

        We maintain a dictionary of rejected reads, so once a read is flagged
        we don't seach it's quality list again
        """
        if aln.qname in bad_reads:
            return False

        if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(aln.qual)]):
            bad_reads[aln.qname] = 1
            return False
        else:
            return True

    _BASE_QUAL_CUTOFF = 30

    chr_len = ChromosomeLength(chrom_path + chrom + '.fa')
    forward_counts = np.zeros(chr_len, dtype='int')
    rev_counts = np.zeros(chr_len, dtype='int')

    bad_reads = dict()
    in_file = pysam.Samfile(bam_file, 'rb')
    for pileup_col in in_file.pileup(chrom, max_depth=100000):
        for pileup_read in pileup_col.pileups:
            if pileup_read.indel == 0 and pileup_read.is_del == 0:
                aln = pileup_read.alignment
                if read_ok(aln, bad_reads):
                    if aln.is_reverse:
                        rev_counts[pileup_col.pos] += 1
                    else:
                        forward_counts[pileup_col.pos] += 1

    return forward_counts, rev_counts
We can use Matplotlib to plot the histograms for the forward and reverse directions. We will eliminate positions with 0 coverage to avoid a big peak at 0 in the histogram.
    forward_counts, rev_counts = ReadBamFile(chrom, chrom_path, bam_file)

    hist_file = out_path + 'hist.' + chrom + '.png'
    fig = plt.figure(1)

    hist, bins = np.histogram(forward_counts[forward_counts > 0], bins=50) 
    width = 0.7 * (bins[1] - bins[0])
    center = (bins[:-1] + bins[1:]) / 2
    plt.subplot(211)
    plt.bar(center, hist, align='center', width=width)
    plt.title('Forward')

    hist, bins = np.histogram(rev_counts[rev_counts > 0], bins=50) 
    width = 0.7 * (bins[1] - bins[0])
    center = (bins[:-1] + bins[1:]) / 2
    plt.subplot(212)
    plt.title('Reverse')
    plt.bar(center, hist, align='center', width=width)
    
    fig.savefig(hist_file)
    
    output_file = out_path + 'counts.' + chrom + '.csv'
    f = open(output_file, 'w')
    f.write(','.join(['position', 'forward', 'reverse']) + '\n')
    for pos in xrange(forward_counts.shape[0]):
        if forward_counts[pos] > 0 or rev_counts[pos] > 0:
            f.write(','.join([str(pos), str(forward_counts[pos]), str(rev_counts[pos])]) + '\n')
    f.close()
np.histogram returns the counts and bins for the histogram. We can then use Matplotlib's bar plot top draw the histogram.

Download the full code.

No comments:

Post a Comment