Mismatches and Quality

How does the mismatch rate of Illumina reads vary with quality? We can use reads from the PhiX control runs to examine that question. Our genome center runs a lane of phiX Control v3 each time they run a set of reads. The phiX run serves as a quality control for the sequencing experiments. We can use it to examine basic error rates arising from the sequencing process.

To do this we'll use a bam file that contains reads aligned to the phiX genome. As mentioned in an earlier post, we found some positions where our reads differed from the official genome by almost 100%. We replaced those positions with a fixed value and created a new genome. In addition, we have one position (position 1301) with a well know, approximately 50-50, SNP. In our mismatch counting, we will ignore reads that overlap this position as well as reads overlapping the genome end positions. We will also ignore any read with a base call quality below 30.

With this in mind, the count procedure consists of using PySam's pileup routine to loop through the genome positions and count the number of mismatches and the total reads at each quality value.

Here's the counting routine:

    f = pysam.Samfile(bam_file, 'rb')
    for pileup_col in f.pileup(max_depth=1000000):
        for read in pileup_col.pileups:
            aln = read.alignment
            if not read.indel and not read.is_del and read_ok(aln):
                q = QualChar2Score(aln.qual[read.qpos])
                totals[q] += 1
                if aln.seq[read.qpos] != seq_str[pileup_col.pos]:
                    mismatches[q] += 1


mismatches and totals are numpy integer arrays, initialized to zero. We ignore indels, deletions, and any read that has a quality below our cutoff of 30.

read_ok() checks for reads overlapping positions that we want to exclude and for reads that contain low quality bases. Since pileup will return an individual read multiple times at different position, we keep a dictionary of reads that have been flagged so we don't have to go through the process  of checking a bad read every time.

   def QualChar2Score(q_char):
        return ord(q_char) - 33

    def read_ok(read):
        """
        read_ok - reject reads with a low quality (below 30) base call
                  also reject reads overlapping the start and end of the genome
                  and the known snp position

        read - a PySam AlignedRead object
        we maintain a dictionary of reads that have failed
        returns: True if the read is ok
        """
        if read.qname in bad_reads:
            return False

        snps = set([0, 1, 2, 3, 4, 1301, 5381, 5382, 5383, 5384, 5385])
        if len(snps.intersection(set(read.positions))) > 0:
            bad_reads[read.qname] = 1
            return False

        if any([QualChar2Score(c) < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
            bad_reads[read.qname] = 1
            return False
        else:
            return True


Once we have the mismatches counted, we just save them to a conma separated variable (CSV) file for later processing.

Here's the full program:  download

import sys
import argparse
import pysam
import numpy as np
from Bio import SeqIO

def ReadFasta(file):
    """
    ReadFasta - read a sequence from a fasta formatted file.

    file - fasta file name
    returns: a BioPython sequence record
    """
    handle = open(file, 'rU')
    iter = SeqIO.parse(handle, 'fasta')
    record = iter.next()
    handle.close()

    return record


def GetArgs():
    """
    GetArgs - read the command line
    returns - a bam_file and output file, genome file names

    typing python phyX.qual.mismatch.py will show the options
    """

    def ParseArgs(parser):
        class Parser(argparse.ArgumentParser):
            def error(self, message):
                sys.stderr.write('error: %s\n' % message)
                self.print_help()
                sys.exit(2)

        parser = Parser(description='Calculate mismatch rates for each quality value.')
        parser.add_argument('-b', '--bam_file',
                            type=str,
                            required=True,
                            help='Bam file (required).')
        parser.add_argument('-o', '--output_file',
                            type=str,
                            help='output csv file.')
        parser.add_argument('-g', '--genome_file',
                            required=False,
                            type=str,
                            default='/users/thompson/data/phiX2/from_illumina/data/phiX.illumina.2.fa',
                            help='File containing phiX genome.')

        return parser.parse_args()

    parser = argparse.ArgumentParser()
    args = ParseArgs(parser)

    return args.bam_file, args.output_file, args.genome_file


def CountMismatches(bam_file, genome_file):
    """
    CountMismatches:
    count the total reads with each quality value (quality >= 30) and
    mismatches

    bam_file - a bam_file of reads
    genome_file - genome sequence file in fasta format
    returns:
    totals and mismatch counts at each quality
    """

    def QualChar2Score(q_char):
        return ord(q_char) - 33

    def read_ok(read):
        """
        read_ok - reject reads with a low quality (<30 base="" br="" call="">                  also reject reads overlapping the start and end of the genome
                  and the known snp position

        read - a PySam AlignedRead object
        we maintain a dictionary of reads that have failed
        returns: True if the read is ok
        """
        if read.qname in bad_reads:
            return False

        snps = set([0, 1, 2, 3, 4, 1301, 5381, 5382, 5383, 5384, 5385])
        if len(snps.intersection(set(read.positions))) > 0:
            bad_reads[read.qname] = 1
            return False

        if any([QualChar2Score(c) < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
            bad_reads[read.qname] = 1
            return False
        else:
            return True

    _BASE_QUAL_CUTOFF = 30
    bad_reads = dict()

    totals = np.zeros(61, dtype=int)  # this size is overkill, we'll trim later
    mismatches = np.zeros(61, dtype=int)

    seq = ReadFasta(genome_file)
    seq_str = str(seq.seq)

    # open the file and run pileup to count
    f = pysam.Samfile(bam_file, 'rb')
    for pileup_col in f.pileup(max_depth=1000000):
        for read in pileup_col.pileups:
            aln = read.alignment
            if not read.indel and not read.is_del and read_ok(aln):
                q = QualChar2Score(aln.qual[read.qpos])
                totals[q] += 1
                if aln.seq[read.qpos] != seq_str[pileup_col.pos]:
                    mismatches[q] += 1

    return mismatches, totals


def Main():
    bam_file, output_file, genome_file = GetArgs()

    mismatches, totals = CountMismatches(bam_file, genome_file)

    if output_file is not None:
        f = open(output_file, 'w')
    else:
        f = sys.stdout

    f.write(','.join(['quality', 'mismatches', 'total']) + '\n')
    for q in xrange(61):
        f.write(','.join([str(q), str(mismatches[q]), str(totals[q])]) + '\n')
    f.close()
   

if __name__ == '__main__':
    Main()

No comments:

Post a Comment