- Shatter genomic DNA, e.g. with a sonicator or DNAase.
- Ligate adapters to both ends of the fragments. Not every molecule successfully ligates.
- PCR amplify the fragments with adapters
- Spread DNA molecules across flowcells. The goal is to get exactly one DNA molecule per flowcell lawn of primers. This depends purely on probability, based on the concentration of DNA.
- Use bridge PCR to amplify the single molecule on each bead or each lawn so that you can get a strong enough signal (whether light or pH) to detect. Usually this requires several hundred or low thousands of molecules.
- Sequence by synthesis of complementary strand
The CureFFI.org blog has a nice back of the envelope calculation showing how the initial level of DNA affects the amount of duplication you see in the data. Even if you don't care about the rest of this post, I encourage you to read their description of how the duplicate level increases.
Given that we have some level of PCR duplicates, how do we remove them. There are two popular software tools for removing duplicates from bam or sam files: samtools and Picard. They perform similar processes and while discussions on SeqAnswers seem to prefer Picard, I haven't seen much difference when using each tool on our data.
Both perform basically the same operations. They take a bam file sorted by target id and sequence location. Reads that are mapped to the same starting location and in the same orientation are considered duplicates. The read with the highest map quality is retained.
Notice that the sequence of the read is not considered. In our work, we're interested in SNPs, edits, and sequencing errors. We're concerned that we may be throwing out some of the very data that we need. In order to understand the process of duplicate removal and how we might retain more data, I wrote my own duplicate removal program, sort of a reverse Xerox machine. The idea is basically the same. You start with a sorted bam file, examine the reads mapping to each position and instead of just considering orientation, we save a copy of each different sequence. The result is that reads mapping to identical positions but with differing sequences will be saved.
The code is pretty simple:
# this assumes the bam file is sorted. # it does minimal error checking count = 0 out_count = 0 prev_read_pos = -1 prev_tid = -1 reads = dict() # iterate through reads and accumulate the reads with the same starting address f_in = pysam.Samfile(bam_file, 'rb') f_out = pysam.Samfile(out_file, 'wb', template=f_in) for read in f_in.fetch(until_eof = True): count += 1 if read_ok(read): if read.tid == prev_tid and read.positions[0] < prev_read_pos: sys.stderr.write("It look's like this bam file is not sorted" + '\n') sys.exit(1) if (not reads) or read.positions[0] == prev_read_pos: if read.seq not in reads.keys(): reads[read.seq] = [] reads[read.seq].append(read) else: out_count += ProcessReads(reads, f_out) if out_count % 10000 == 0: print count, out_count sys.stdout.flush() reads = dict() reads[read.seq] = [read] prev_read_pos = read.positions[0] prev_tid = read.tid return count, out_count
We iterate through the reads, saving each read in a list in dictionary keyed on the sequence. We could have just sorted the reads by sequence later, but a dictionary is more convenient. When a read mapping to a different location is encountered, we process the reads mapping to the current location and start over with teh new read.
To process the reads, we handle each orientation separately. I'm not really sure this is a good idea, but it's what samtools and Picard do, so we will also. We also process each sequence key separately, so we end of with a copy of each different sequence being saved.
def ProcessReads(reads, out): """ ProcessReads - remove duplicate reads we consider a read a duplicate if it is mappedto the same position and has the same sequence. Note: we handle forward and reverse reads separately because that's what samtools does. reads - dictionary of reads keyed on sequence out - file handle of output file returns reads_written - a count of the number of reads written """ reads_written = 0 for seq in reads.keys(): for_reads = [x for x in reads[seq] if not x.is_reverse] if for_reads: r = sorted(for_reads, key=lambda x: x.mapq, reverse=True) out.write(r[0]) reads_written += 1 rev_reads = [x for x in reads[seq] if x.is_reverse] if rev_reads: r = sorted(rev_reads, key=lambda x: x.mapq, reverse=True) out.write(r[0]) reads_written += 1 return reads_written
In the end we have bam file with slightly more reads than we see with samtools or Picard. The code is slower than samtools, written in C, or Picard, written in Java, but it's not so slow that speed is a major problem.
Download the code.
No comments:
Post a Comment