How does the mismatch rate of Illumina reads vary with quality? We can use reads from the PhiX control runs to examine that question. Our genome center runs a lane of phiX Control v3 each time they run a set of reads. The phiX run serves as a quality control for the sequencing experiments. We can use it to examine basic error rates arising from the sequencing process.
To do this we'll use a bam file that contains reads aligned to the phiX genome. As mentioned in an earlier post, we found some positions where our reads differed from the official genome by almost 100%. We replaced those positions with a fixed value and created a new genome. In addition, we have one position (position 1301) with a well know, approximately 50-50, SNP. In our mismatch counting, we will ignore reads that overlap this position as well as reads overlapping the genome end positions. We will also ignore any read with a base call quality below 30.
With this in mind, the count procedure consists of using PySam's pileup routine to loop through the genome positions and count the number of mismatches and the total reads at each quality value.
Here's the counting routine:
f = pysam.Samfile(bam_file, 'rb')
for pileup_col in f.pileup(max_depth=1000000):
for read in pileup_col.pileups:
aln = read.alignment
if not read.indel and not read.is_del and read_ok(aln):
q = QualChar2Score(aln.qual[read.qpos])
totals[q] += 1
if aln.seq[read.qpos] != seq_str[pileup_col.pos]:
mismatches[q] += 1
mismatches and totals are numpy integer arrays, initialized to zero. We ignore indels, deletions, and any read that has a quality below our cutoff of 30.
read_ok() checks for reads overlapping positions that we want to exclude and for reads that contain low quality bases. Since pileup will return an individual read multiple times at different position, we keep a dictionary of reads that have been flagged so we don't have to go through the process of checking a bad read every time.
def QualChar2Score(q_char):
return ord(q_char) - 33
def read_ok(read):
"""
read_ok - reject reads with a low quality (below 30) base call
also reject reads overlapping the start and end of the genome
and the known snp position
read - a PySam AlignedRead object
we maintain a dictionary of reads that have failed
returns: True if the read is ok
"""
if read.qname in bad_reads:
return False
snps = set([0, 1, 2, 3, 4, 1301, 5381, 5382, 5383, 5384, 5385])
if len(snps.intersection(set(read.positions))) > 0:
bad_reads[read.qname] = 1
return False
if any([QualChar2Score(c) < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
bad_reads[read.qname] = 1
return False
else:
return True
Once we have the mismatches counted, we just save them to a conma separated variable (CSV) file for later processing.
Here's the full program: download
import sys
import argparse
import pysam
import numpy as np
from Bio import SeqIO
def ReadFasta(file):
"""
ReadFasta - read a sequence from a fasta formatted file.
file - fasta file name
returns: a BioPython sequence record
"""
handle = open(file, 'rU')
iter = SeqIO.parse(handle, 'fasta')
record = iter.next()
handle.close()
return record
def GetArgs():
"""
GetArgs - read the command line
returns - a bam_file and output file, genome file names
typing python phyX.qual.mismatch.py will show the options
"""
def ParseArgs(parser):
class Parser(argparse.ArgumentParser):
def error(self, message):
sys.stderr.write('error: %s\n' % message)
self.print_help()
sys.exit(2)
parser = Parser(description='Calculate mismatch rates for each quality value.')
parser.add_argument('-b', '--bam_file',
type=str,
required=True,
help='Bam file (required).')
parser.add_argument('-o', '--output_file',
type=str,
help='output csv file.')
parser.add_argument('-g', '--genome_file',
required=False,
type=str,
default='/users/thompson/data/phiX2/from_illumina/data/phiX.illumina.2.fa',
help='File containing phiX genome.')
return parser.parse_args()
parser = argparse.ArgumentParser()
args = ParseArgs(parser)
return args.bam_file, args.output_file, args.genome_file
def CountMismatches(bam_file, genome_file):
"""
CountMismatches:
count the total reads with each quality value (quality >= 30) and
mismatches
bam_file - a bam_file of reads
genome_file - genome sequence file in fasta format
returns:
totals and mismatch counts at each quality
"""
def QualChar2Score(q_char):
return ord(q_char) - 33
def read_ok(read):
"""
read_ok - reject reads with a low quality (<30 base="" br="" call=""> also reject reads overlapping the start and end of the genome
and the known snp position
read - a PySam AlignedRead object
we maintain a dictionary of reads that have failed
returns: True if the read is ok
"""
if read.qname in bad_reads:
return False
snps = set([0, 1, 2, 3, 4, 1301, 5381, 5382, 5383, 5384, 5385])
if len(snps.intersection(set(read.positions))) > 0:
bad_reads[read.qname] = 1
return False
if any([QualChar2Score(c) < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
bad_reads[read.qname] = 1
return False
else:
return True
_BASE_QUAL_CUTOFF = 30
bad_reads = dict()
totals = np.zeros(61, dtype=int) # this size is overkill, we'll trim later
mismatches = np.zeros(61, dtype=int)
seq = ReadFasta(genome_file)
seq_str = str(seq.seq)
# open the file and run pileup to count
f = pysam.Samfile(bam_file, 'rb')
for pileup_col in f.pileup(max_depth=1000000):
for read in pileup_col.pileups:
aln = read.alignment
if not read.indel and not read.is_del and read_ok(aln):
q = QualChar2Score(aln.qual[read.qpos])
totals[q] += 1
if aln.seq[read.qpos] != seq_str[pileup_col.pos]:
mismatches[q] += 1
return mismatches, totals
def Main():
bam_file, output_file, genome_file = GetArgs()
mismatches, totals = CountMismatches(bam_file, genome_file)
if output_file is not None:
f = open(output_file, 'w')
else:
f = sys.stdout
f.write(','.join(['quality', 'mismatches', 'total']) + '\n')
for q in xrange(61):
f.write(','.join([str(q), str(mismatches[q]), str(totals[q])]) + '\n')
f.close()
if __name__ == '__main__':
Main()30>
No comments:
Post a Comment