Recently, we wanted to remove low quality and duplicate reads from .bam alignment files to make viewing them in the UCSC genome browser easier. Our definition of low quality is stringent. We consider a read as low quality if it has any base with a base call quality below 30. To filter low quality reads and duplicates, we use a combination of samtools and the python PySam library.
samtools has an option to remove PCR and optical duplicates and we use PySam to filter low quality reads. PySam allows us to read and write bam files. To filter out low quality reads, we loop through the input bam file, check the read quality to see if any base call falls below 30. If all of the base call qualities are 30 or above, we write the read to a new file.
Here's one way to filter:
def FilterReads(chrom, in_file, out_file):
def read_ok(read):
"""
read_ok - reject reads with a low quality (below 30) base call
read - a PySam AlignedRead object
returns: True if the read is ok
"""
if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
return False
else:
return True
_BASE_QUAL_CUTOFF = 30
bam_in = pysam.Samfile(in_file, 'rb')
bam_out = pysam.Samfile(out_file, 'wb', template=bam_in)
out_count = 0
for read in bam_in.fetch(chrom):
if read_ok(read):
bam_out.write(read)
out_count += 1
print 'reads_written =', out_count
bam_out.close()
bam_in.close()
This routine is passed a chrmosome label and the input and output file names. It uses PySam's fetch routine to get each read. Read_ok() checks to see if the read has any low quality bases.
Here's the entire program: download
import argparse
import pysam
import re
def FilterReads(chrom, in_file, out_file):
def read_ok(read):
"""
read_ok - reject reads with a low quality (below 30) base call
read - a PySam AlignedRead object
returns: True if the read is ok
"""
if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
return False
else:
return True
_BASE_QUAL_CUTOFF = 30
bam_in = pysam.Samfile(in_file, 'rb')
bam_out = pysam.Samfile(out_file, 'wb', template=bam_in)
out_count = 0
for read in bam_in.fetch(chrom):
if read_ok(read):
bam_out.write(read)
out_count += 1
print 'reads_written =', out_count
bam_out.close()
bam_in.close()
def GetArgs():
"""
GetArgs - read the command line
returns - an input bam file name and the output filtered bam file
"""
def ParseArgs(parser):
class Parser(argparse.ArgumentParser):
def error(self, message):
sys.stderr.write('error: %s\n' % message)
self.print_help()
sys.exit(2)
parser = Parser(description='Filter low quality reads.')
parser.add_argument('-b', '--bam_file',
type=str,
required=True,
help='Input Bam file.')
parser.add_argument('-o', '--output_file',
type=str,
required=True,
help='Output Bam file.')
return parser.parse_args()
parser = argparse.ArgumentParser()
args = ParseArgs(parser)
return args.bam_file, args.output_file
def Main():
bam_file, output_file = GetArgs()
m = re.search('(chr.+?)\.fa', bam_file)
chrom = m.group(1)
FilterReads(chrom, bam_file, output_file)
if __name__ == '__main__':
Main()
After filtering the low quality reads, we use samtools to sort the bam file, remove duplicates, and index teh new bam file.
Here's an example of the calls to filter a bam file.
## filter out low quality reads
python /users/thompson/python/filter.lowquality.reads.py -b /users/thompson/data/fly/bwa_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam -o /users/thompson/data/fly/bwa_filtered_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam
## sort the new bam file - PySam fetch may not keep the reads in the same order
/gpfs/runtime/opt/samtools/0.1.18/bin/samtools sort /users/thompson/data/fly/bwa_filtered_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam /users/thompson/data/fly/bwa_filtered_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked-sorted
## Use rmdup to remove PCR and optical duplicates
/gpfs/runtime/opt/samtools/0.1.18/bin/samtools rmdup -s /users/thompson/data/fly/bwa_filtered_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked-sorted.bam /users/thompson/data/fly/bwa_output_nodups/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam
## Index the final bam file
/gpfs/runtime/opt/samtools/0.1.18/bin/samtools index /users/thompson/data/fly/bwa_output_nodups/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam /users/thompson/data/fly/bwa_output_nodups/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam.bai
Please excuse the formatting!
I have tried it but it's not working
ReplyDelete