The Analytic Garden: Duplicates and what to do with them

Ever since I started dealing with ChIP-Seq and RNA-Seq data, PCR duplicates have been a bit of a mystery to me. Googling for PCR duplicates provides a variety of opinions about how they arise, how to handle them and so forth. The best introduction I have encountered is the CureFFI.org blog. I going to swipe their description of the library prep process because it's clear and concise (they actually got it from here):

Shatter genomic DNA, e.g. with a sonicator or DNAase.
Ligate adapters to both ends of the fragments. Not every molecule successfully ligates.
PCR amplify the fragments with adapters
Spread DNA molecules across flowcells. The goal is to get exactly one DNA molecule per flowcell lawn of primers. This depends purely on probability, based on the concentration of DNA.
Use bridge PCR to amplify the single molecule on each bead or each lawn so that you can get a strong enough signal (whether light or pH) to detect. Usually this requires several hundred or low thousands of molecules.
Sequence by synthesis of complementary strand

The third step is where PCR duplicates arise. A duplicate occurs when a copy of the original molecule ends up on a different lawn in the flowcell. With adequate starting amounts of DNA, the duplicate level should be low. With smaller amounts of DNA, the rounds of PCR is increased. Sometimes more than one copy of the same original molecule will hybridize to a different lawn. These are duplicates. Ideally, we would like to have the concentration of DNA just right so that duplicates are kept to a minimum.

The CureFFI.org blog has a nice back of the envelope calculation showing how the initial level of DNA affects the amount of duplication you see in the data. Even if you don't care about the rest of this post, I encourage you to read their description of how the duplicate level increases.

Given that we have some level of PCR duplicates, how do we remove them. There are two popular software tools for removing duplicates from bam or sam files: samtools and Picard. They perform similar processes and while discussions on SeqAnswers seem to prefer Picard, I haven't seen much difference when using each tool on our data.

Both perform basically the same operations. They take a bam file sorted by target id and sequence location. Reads that are mapped to the same starting location and in the same orientation are considered duplicates. The read with the highest map quality is retained.

Notice that the sequence of the read is not considered. In our work, we're interested in SNPs, edits, and sequencing errors. We're concerned that we may be throwing out some of the very data that we need. In order to understand the process of duplicate removal and how we might retain more data, I wrote my own duplicate removal program, sort of a reverse Xerox machine. The idea is basically the same. You start with a sorted bam file, examine the reads mapping to each position and instead of just considering orientation, we save a copy of each different sequence. The result is that reads mapping to identical positions but with differing sequences will be saved.

The code is pretty simple:

 
    # this assumes the bam file is sorted.
    # it does minimal error checking
    count = 0
    out_count = 0
    prev_read_pos = -1
    prev_tid = -1
    reads = dict()

    # iterate through reads and accumulate the reads with the same starting address
    f_in = pysam.Samfile(bam_file, 'rb')
    f_out = pysam.Samfile(out_file, 'wb', template=f_in)
    for read in f_in.fetch(until_eof = True):
        count += 1
        if read_ok(read):
            if read.tid == prev_tid and read.positions[0] < prev_read_pos:
                sys.stderr.write("It look's like this bam file is not sorted" + '\n')
                sys.exit(1)
            
            if (not reads) or read.positions[0] == prev_read_pos:
                if read.seq not in reads.keys():
                    reads[read.seq] = []
                reads[read.seq].append(read)
            else:
                out_count += ProcessReads(reads, f_out)
                if out_count % 10000 == 0:
                    print count, out_count
                    sys.stdout.flush()
                reads = dict()
                reads[read.seq] = [read]
            prev_read_pos = read.positions[0]
            prev_tid = read.tid

    return count, out_count

We iterate through the reads, saving each read in a list in dictionary keyed on the sequence. We could have just sorted the reads by sequence later, but a dictionary is more convenient. When a read mapping to a different location is encountered, we process the reads mapping to the current location and start over with teh new read.

To process the reads, we handle each orientation separately. I'm not really sure this is a good idea, but it's what samtools and Picard do, so we will also. We also process each sequence key separately, so we end of with a copy of each different sequence being saved.

  def ProcessReads(reads, out):
        """
        ProcessReads - remove duplicate reads
                       we consider a read a duplicate if it is mappedto the same position
                       and has the same sequence.
                       Note: we handle forward and reverse reads separately because that's what
                       samtools does.
        reads - dictionary of reads keyed on sequence
        out - file handle of output file
        returns
        reads_written - a count of the number of reads written
        """
        
        reads_written = 0
        for seq in reads.keys():
            for_reads = [x for x in reads[seq] if not x.is_reverse]
            if for_reads:
                r = sorted(for_reads, key=lambda x: x.mapq, reverse=True)
                out.write(r[0])
                reads_written += 1
            rev_reads = [x for x in reads[seq] if x.is_reverse]
            if rev_reads:
                r = sorted(rev_reads, key=lambda x: x.mapq, reverse=True)
                out.write(r[0])
                reads_written += 1

        return reads_written

In the end we have bam file with slightly more reads than we see with samtools or Picard. The code is slower than samtools, written in C, or Picard, written in Java, but it's not so slow that speed is a major problem.

Download the code.

The Analytic Garden

Duplicates and what to do with them

No comments:

Post a Comment

Labels

Contributors

wfmu