Filter Reads in Bam Files

Recently, we wanted to remove low quality and duplicate reads from .bam alignment files to make viewing them in the UCSC genome browser easier. Our definition of low quality is stringent. We consider a read as low quality if it has any base with a base call quality below 30. To filter low quality reads and duplicates, we use a combination of samtools and the python PySam library.

samtools has an option to remove PCR and optical duplicates and we use PySam to filter low quality reads. PySam allows us to read and write bam files. To filter out low quality reads, we loop through the input bam file, check the read quality to see if any base call falls below 30. If all of the base call qualities are 30 or above, we write the read to a new file.

Here's one way to filter:

def FilterReads(chrom, in_file, out_file):

    def read_ok(read):
        """
        read_ok - reject reads with a low quality (below 30) base call
        read - a PySam AlignedRead object
        returns: True if the read is ok
        """
        if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
            return False
        else:
            return True

    _BASE_QUAL_CUTOFF = 30

    bam_in = pysam.Samfile(in_file, 'rb')
    bam_out = pysam.Samfile(out_file, 'wb', template=bam_in)

    out_count = 0
    for read in bam_in.fetch(chrom):
        if read_ok(read):
            bam_out.write(read)
            out_count += 1

    print 'reads_written =', out_count

    bam_out.close()
    bam_in.close()



This routine is passed a chrmosome label and the input and output file names. It uses PySam's fetch routine to get each read. Read_ok() checks to see if the read has any low quality bases.

Here's the entire program:  download

import argparse
import pysam
import re

def FilterReads(chrom, in_file, out_file):

    def read_ok(read):
        """
        read_ok - reject reads with a low quality (below 30) base call
        read - a PySam AlignedRead object
        returns: True if the read is ok
        """
        if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
            return False
        else:
            return True

    _BASE_QUAL_CUTOFF = 30

    bam_in = pysam.Samfile(in_file, 'rb')
    bam_out = pysam.Samfile(out_file, 'wb', template=bam_in)

    out_count = 0
    for read in bam_in.fetch(chrom):
        if read_ok(read):
            bam_out.write(read)
            out_count += 1

    print 'reads_written =', out_count

    bam_out.close()
    bam_in.close()


def GetArgs():
    """
    GetArgs - read the command line
    returns - an input bam file name and the output filtered bam file
    """

    def ParseArgs(parser):
        class Parser(argparse.ArgumentParser):
            def error(self, message):
                sys.stderr.write('error: %s\n' % message)
                self.print_help()
                sys.exit(2)

        parser = Parser(description='Filter low quality reads.')
        parser.add_argument('-b', '--bam_file',
                            type=str,
                            required=True,
                            help='Input Bam file.')
        parser.add_argument('-o', '--output_file',
                            type=str,
                            required=True,
                            help='Output Bam file.')
        return parser.parse_args()

    parser = argparse.ArgumentParser()
    args = ParseArgs(parser)

    return args.bam_file, args.output_file


def Main():
    bam_file, output_file = GetArgs()
   
    m = re.search('(chr.+?)\.fa', bam_file)
    chrom = m.group(1)
    FilterReads(chrom, bam_file, output_file)

if __name__ == '__main__':
    Main()

After filtering the low quality reads, we use samtools to sort the bam file, remove duplicates, and index teh new bam file.

Here's an example of the calls to filter a bam file.

## filter out low quality reads
python /users/thompson/python/filter.lowquality.reads.py -b /users/thompson/data/fly/bwa_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam -o /users/thompson/data/fly/bwa_filtered_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam

## sort the new bam file - PySam fetch may not keep the reads in the same order
/gpfs/runtime/opt/samtools/0.1.18/bin/samtools sort /users/thompson/data/fly/bwa_filtered_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam /users/thompson/data/fly/bwa_filtered_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked-sorted
 

## Use rmdup to remove PCR and optical duplicates
/gpfs/runtime/opt/samtools/0.1.18/bin/samtools rmdup -s /users/thompson/data/fly/bwa_filtered_output/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked-sorted.bam /users/thompson/data/fly/bwa_output_nodups/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam
 

## Index the final bam file
/gpfs/runtime/opt/samtools/0.1.18/bin/samtools index /users/thompson/data/fly/bwa_output_nodups/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam /users/thompson/data/fly/bwa_output_nodups/YAS9_AAGCCT_L002_R1_001/chrX.fa.masked.bam.bai

Please excuse the formatting!

1 comment: