How to Count

Counting is hard. It seems like it shouldn't be. After all, we all learned to count as part of our first experiences in school, or maybe even before. In computational biology, we often have to count to determine basic properties of a collection of sequence or to estimate probabilities and expected values. The problem we frequently run into is how do we determine what to count and how do we go about counting it.

Sequencing Errors

One of the questions we have been interested in answering is; what are the basic error rates in the base calls coming from the genome center's Illumina sequencing machine. A brief overview of the sequencing and alignment processes can be downloaded here. If you looked at https://www.youtube.com/watch?v=l99aKKHcxC4 you can see how reads are built one nucleotide at a time. The base calling process determines the nucleotide type by image processing of  emitted fluorescence. For each nucleotide, intensity values are calculated for each type: A, C, G, and T. Based on these intensities, the nucleotide is identified. This process is base calling.

The base calling process is generally accurate, but errors can occur. Possible sources of error are noise in the 4 image channels and phasing, where molecules fall one base behind in sequencing. Based on the intensity values for each cycle (base), the nucleotide is assigned a label, A, C, G, T, or N. N's indicate that due to errors, the base could not be accurately identified. We typically receive this data in the fastq format. The fastq files contain the base calls along with a quality score for each base. The quality score for a base is an estimation of the probability of that base call being in error. The factors going into the quality score calculation are described in this document.

In the fastq format, quality scores are encode as ascii characters. The quality score, Q, is defined

\[Q =  - 10{\log _{10}}P\]

where P is the probability of the base call being in error. A Q score of 40 is an error rate of 1 in 10,000. In the fastq file, the Q values are encoded to ascii by adding 33 to the Q score and converting the numerical value to an ascii letter. For example, Q=40 becomes 40+33=73 or an ascii I.

The encoding is actually slightly more complicated because the factor added to Q varies depends on the version of Illumina software being used. See this page for details.

Here's an example of a fastq record:


@MYSEQ_1_1_1_1001_499
GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
+MYSEQ_1_1_1_1001_499
]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS


The first row is an ID. Each record has a unique ID. The second row contains the base calls. The third is a repeat of the ID. The forth row contains the quality scores. We have seen that we can align the reads to a genome with Bowtie2 or BWA. Maybe, we'll go into the details of aligning reads to a genome in a later post.

Context

There area  number of factors that influence error rates. See Abnizova et al. for some details. One important factor influencing error rates is the genomic context. As part of the RNA editing project, we wanted to determine how genomic context influenced error rates in the data coming from our genome center.

Each time the genome center runs a set of sequences, it runs a lane of phiX174 as quality control. PhiX is a bacterial phage. It has a small genome, ~5K. Typically, a lane of phiX will produce well over 100,000,000 reads, so that with such a small genome, that will yield very high coverage at each position. We use the phiX reads to determine genomic context.

In our case, we define the genomic context of a base as the genomic nucleotide at an aligned position and the nucleotides at the two positions 5' of the aligned position. With this definition in mind, we can estimate context dependent error rates by examining each base of the aligned reads, determining the genomic context and counting the number of A's, C's etc for each context.

Counting

As you might expect, there are a few details to consider. First, we will only deal with high quality reads. We define a high quality read as one which has no base call below 30 (P=0.001). We won't count any bases from a read with even one base call below 30.

Second, about half the reads will align to the genome in reverse complement. The "official" genome was sequenced in a certain order, but DNA is double stranded. Some reads come from the same strand as the official genome sequence (Watson strand) and others come from the opposite (Crick) strand. Because of this, we have to be sure we apply the context from the Crick strand to the reads mapped in reverse.

Finally, genome are a moving target. For example, the fly genome that you downloadUCSC or NCBI was sequenced in 2006 and the annotations are from 2008. However even if a laboratory strain exactly matched the reported genome at teh time teh genome was completed, flies in laboratories keep breeding, so over many generations variations from the reported genome will arise.

In the case of phiX174, we found several positions that differed from reported genome in almost 100% of the reads and a well know SNP (Single-nucleotide polymorphism). To deal with these, we created our own corrected genome, where we replaced the variant positions.

The Program

There are a number of ways to perform the counting. In the accompanying code, I used a Python library called PySam. PySam is a wrapper for samtools. It provides methods for reading sam and sam files. In this example, we will read a bam file of reads aligned to the phiX genome.

The basic idea behind this program is to use the PySam routine pileup to process the reads. pileup uses what is know as a callback routine. The callback routine is a subroutine that is called for each genomic position. The subroutine is passed a PileupColumn object. This object contains among other things, the genomic position, and a list of PilupRead objects, one object for each read overlapping the position. The PileupRead object contains information about the position in the read and a PileupAlignment object. The alignment object contains the sam/bam information about the aligned read. The alignment information is returned in a format that's convenient for Python.

I won't go through all of the details of this code. Take a look at the comments if you are interested. If anything is puzzling or seems wrong, please ask me.

The code: download

#!/gpfs/runtime/opt/python/2.7.3/bin/python

import sys
import argparse
from Bio import SeqIO
import pysam
import resource


def ReadFasta(file):
    """
    ReadFasta - read a sequence from a fasta formatted file.

    file - fasta file name
    returns: a BioPython sequence record
    """
    handle = open(file, 'rU')
    iter = SeqIO.parse(handle, 'fasta')
    record = iter.next()
    handle.close()

    return record


def ReadBamFile(bam_file, genome_file):
    """
    ReadBamFile - read a bam file and count nucleotides by context
    bam_file - bam file containing reads
    genome_file - file containing the genome in fasta format
    returns: a dictionary keyed by context. For each contex, there is a list of A,C,G, T counts
    """

    def RevComp(c):
        """
        RevComp - reverse complement a characater
        """
        tr = dict(A='T', C = 'G', G = 'C', T = 'A', N = 'N')
        return tr[c.upper()]

    def read_ok(aln):
        """
        read_ok - reject reads with a low quality (less than 30) base call
                  also reject reads overlapping the start and end of the genome
                  and the known snp position
        aln - a PySam AlignedRead object
        returns: True if the read is ok

        We maintain a dictionary of rejected reads, so once a read is flagged
        we don't seach it's quality list again
        """
        snps = set([1, 2, 3, 4, 1301, 5381, 5382, 5383, 5384, 5385])
        if len(snps.intersection(set(aln.positions))) > 0:
            rejected_reads[aln.qname] = 1
            return False
       
        for c in list(aln.qual):
            if (ord(c)-33) < _BASE_QUAL_CUTOFF:
                rejected_reads[aln.qname] = 1
                return False
        return True

    def pileup_callback(pileup_col):
        """
        pilup_callback - this is counting workhorse.
        pileup_col - a PySam PilupColumn object
       
        this function is called for each position in the genome
        From the pileup_col object, we get a list of pilup_read objects, one
        for each read overlapping a position. From the pileup_read, we get an
        PySam alignment object representing the bam/sam record for the read.
        We check to see if the read is ok. If it is, we extract the context and
        add 1 to the approriate slot in list of counts.
        """
        for pileup_read in pileup_col.pileups:
            aln = pileup_read.alignment
            read_char = aln.seq[pileup_read.qpos]
            if aln.qname not in rejected_reads and read_char != 'N':
                if read_ok(aln):
                    if not aln.is_reverse:
                        if not pileup_read.is_del:                           
                            context = seq_str[(pileup_col.pos - 2):(pileup_col.pos+1)]
                            context_counts[context][nuc_map[read_char]] += 1
                    else:
                        if not pileup_read.is_del:
                            rev_char = RevComp(read_char)
                            offset = len(rev_seq_str) - pileup_col.pos - 1
                            context = rev_seq_str[(offset-2):(offset+1)]
                            context_counts[context][nuc_map[rev_char]] += 1

        # this is just some debugging stupp to print memory usage
        if _DEBUG == 1 and pileup_col.pos % 1000 == 0:
            print pileup_col.pos, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
            sys.stdout.flush()
           
    _DEBUG = 1
    _BASE_QUAL_CUTOFF = 30
    nuc_map = dict(A=0, C=1, G=2, T=3)  # map a letter to an index position
    rejected_reads = dict() # keep track of reads with base quality < 30 or snps

    # context_counts is a dictionary. Each entry is a list of counts.
    # nuc_map is used to map the nucleotide letter to the correct index
    # we initalize all counts to zero
    context_counts = dict()
    for i in ['A', 'C', 'G', 'T']:
        for j in ['A', 'C', 'G', 'T']:
            for k in ['A', 'C', 'G', 'T']:
                context_counts[i + j + k] = [0, 0, 0, 0]

    # read the genome sequence
    # reverse commplement it so we can easily handle reverse reads
    seq = ReadFasta(genome_file)
    seq_str = str(seq.seq)
    rev_seq = seq.reverse_complement()
    rev_seq_str = str(rev_seq.seq)

    # open the file and have pileup call the callback routine for each position
    f = pysam.Samfile(bam_file, 'rb')
    f.pileup('phix', start=5, end=len(seq_str)-6, callback=pileup_callback)

    return context_counts

def GetArgs():
    """
    GetArgs - read the command line
    returns - a bam file name, a genome file name, and an output file name
    bam file is required
    genome file is optional and has a default
    output file is option, if not given, output is written to stdout

    typing python phiX.count.py will show the options
    """

    def ParseArgs(parser):
        class Parser(argparse.ArgumentParser):
            def error(self, message):
                sys.stderr.write('error: %s\n' % message)
                self.print_help()
                sys.exit(2)

        parser = Parser(description='Calculate PhiX Context Specific Error Rates.')
        parser.add_argument('-b', '--bam_file',
                            type=str,
                            required=True,
                            help='Bam file of PhiXRead (required).')
        parser.add_argument('-g', '--genome_file',
                            required=False,
                            type=str,
                            default='/users/thompson/data/phiX2/from_illumina/data/phiX.illumina.2.fa',
                            help='File containing phiX genome.')
        parser.add_argument('-o', '--output_file',
                            type=str,
                            help='output file context counts.')
        return parser.parse_args()

    parser = argparse.ArgumentParser()
    args = ParseArgs(parser)

    return args.bam_file, args.genome_file, args.output_file


def Main():
    """
    main routine - get the input and output files and print the output
    """

    bam_file, genome_file, output_file = GetArgs()

    context_counts = ReadBamFile(bam_file, genome_file)

    if output_file is not None:
        f = open(output_file, 'w')
    else:
        f = sys.stdout

    # write the output in csv fomat
    header = ','.join(['context',
                       'A', 'A_freq',
                       'C', 'C_freq',
                       'G', 'G_freq',
                       'T', 'T_freq',
                       'total'])
    f.write(header + '\n')
   
    # calculate the frequency of each nucleotide in each context and print
    for context in sorted(context_counts.iterkeys()):
        total = sum(context_counts[context])
        out_list = [context]
        for i in xrange(0, 4):
            freq = float(context_counts[context][i])/total
            out_list.extend([str(context_counts[context][i]), str(freq)])
        out_list.extend([str(total)])
        out_str = ','.join(out_list)
        f.write(out_str + '\n')
    f.close()

if __name__ == '__main__':
    Main()



A few words about programming style

I prefer to use library routines whenever I can. There are two reasons. First, I'm lazy. If I can use something written by someone else, that means I don't have to write it. More importantly, a popular library used by thousand of programmers is probably well debugged. Code is almost never guaranteed to be bug free, but having lots of people banging away at is reduces the likelihood of severe bugs.

I like to keep the code in small pieces. One of the things I don't like about Python is that by default variables have scope at the function level. Scope defines where the variable exists. I prefer to keep any variables' scope at lowest possible level. For example, I think a variable used inside a loop should only exist inside that loop. Some languages C++ or Perl, for example, give better control of scope. Keeping Python code in small well defined function helps avoid scoping problems. No matter style you use, avoid global variables unless that's the only possible option.



No comments:

Post a Comment