There are a number of packages for RNA-Seq differential expression analysis. This Wikipedia article lists some of them. How do you decide which to use? I use the post-modern approach. There's no way to test all of them, so I Google for "RNA-Seq differential expression software" and get a list of candidates. Next, I ask colleagues what they use and look at the intersection of the two sets. From there, I just pick a couple of popular packages and try them out. Using this totally unscientific approach I settled on DESeq2 and edgeR. Both are well established Bioconductor packages for R.
I tend to be an impatient programmer. I don't like to spend much time reading docs or scientific papers before I start coding. I'll figure out if the software does what I want after I try it. This perhaps isn't an optimal approach, but in my experience much academic and scientific software breaks easily. I don't want to invest a lot time in trying to get something to work if it doesn't fit my needs or doesn't even work as advertised. Luckily both DESeq2 and edgeR have quick start sections in the documentation for the impatient. One problem I ran into is that the DESeq2 examples didn't work quite as docs said, but it was easy enough to get them working. I have to admit, however, I don't like R as a programming tool. I know it has a package for every conceivable statistical method and Bioconductor has a package for every possible bioinformatics task, but I find the docs opaque and R's object model bordering on incomprehensible. There are actually two different object models, S3 and S4. Object methods aren't part of the class definition of an R object, so know what the object does as opposed to what data it holds is difficult. Combining that problem with opaque documentation makes life difficult for the chronically impatient.(This is the minor rant part.)
There is an updated D. melanogaster assembly, dm6, available at UCSC. I downloaded the full assembly and since we're interested in TEs, the RepeatMasker table from the UCSC table browser. I ran BWA to map a collection RNA-Seq runs against the new genome. A problem arises here. We're interested in TEs and TEs are difficult to map because there are multiple copies of a particular TE and the copies tend to be similar at the sequence level. It's difficult to know which copy the read actually came from. We could take a probabilistic approach and calculate the probability that a read comes from each of the copies. However to measure differential expression, we only need to know that a read came from one of the TE types, so we will simple take the top alignment for each read. To do this, we can use the BWA mem algorithm and eliminate supplementary reads.
We have six RNA-seq data sets: four from loxP lab strain flies and two from Adar Null flies. There are a number of packages to count reads from .bam files. Both DESeq2 and edgeR documentation mention the Python library HTSeq as containing possible methods for counting. HTSeq is a fairly complete package for sequence manipulation, but it only runs under Python 2.x. I have mainly been using Python3 for my Python work, so I decided to stick with it and use PySam 0.8.3 to process the files and count reads for each repeat name. I use methods similar to code used in the ChIP-Seq examples on this blog. One thing to watch out for PySam 0.8.x renames the methods and attributes. The old names are deprecated.
To count reads overlapping TE regions, we first build a genome interval tree of TE locations. The interval tree is a balanced binary tree than can be searched by genomic range. It's important that the tree is balanced because the .bam files are sorted and if the tree wasn't rebalanced, by reading them in order we would end up with a linear tree. Rebalancing adds some work to the program, but it still runs in a reasonable time.
Here's the code for creating RepeatMasker records and adding them to the interval tree. The dm6 assembly contains a number of short sequences that could not be fully integrated to the assembly. We only want to concentrate on the major chromosomes 2L, 2R, 3L, 3R, 4, X, and Y, so we filter out any other segments. The RepeatMasker objects contain a dictionary for counting the number of reads overlapping the repeat region for each bam file.
class RepeatMasker(object): """ A class to hold info from the UCSC RepMasker table. For details of the fields see https://genome.ucsc.edu/cgi-bin/hgTables """ def __init__(self, rep_str): """ rep_str - a tab delimited line form the RepeatMasker table """ rep_list = rep_str.split('\t') self.swScore = int(rep_list[1]) self.milliDiv = int(rep_list[2]) self.milliDel = int(rep_list[3]) self.milliIns = int(rep_list[4]) self.genoName = rep_list[5] self.genoStart = int(rep_list[6]) # genome region start location self.genoEnd = int(rep_list[7]) # genome end location self.genoLeft = int(rep_list[8]) self.strand = rep_list[9] self.repName = rep_list[10] self.repClass = rep_list[11] self.repFamily = rep_list[12] self.repStart = int(rep_list[13]) self.repEnd = int(rep_list[14]) self.repLeft = int(rep_list[15]) self.id = int(rep_list[16]) self.extend_start = max(self.genoStart - 1000, 0) self.extend_end = self.genoStart + 1000 self.counts = defaultdict(int) # store counts of each repName def ReadRepeatmasker(repeatmasker_file, chroms): """ ReadRepeatmasker Read the UCSC RepMasker table. We build a set interval tree for the file. GenomeIntervalTree is a dictionary of interval trees, one for each chromosome we want to analyze. repeatmasker_file - the local file name for the RepeatMaskerTable chroms - a list of valid chromosome names """ tree = GenomeIntervalTree() with open(repeatmasker_file, 'r') as f: f.readline() # skip header for line in f: rep = RepeatMasker(line.strip()) if rep.genoName in chroms: tree[rep.genoName].addi(rep.extend_start, rep.extend_end, rep) return tree
Here's the code for reading the .bam files and updating the counts in the tree.
def ReadBamFile(bam_file, repeat_tree, id): """ ReadBamFile - read a bam file and count coverage bam_file - input bam file repeat_tree - an interval tree of TE locations, new reads are added to the tree. The tree is self balancing. This is good because the bam is usually sorted. id - an identifier for the file - a short name Return: nothing """ def read_ok(aln, bad_reads): """ read_ok - reject reads with a low quality (lt 30) base call aln - a PySam AlignedRead object bad_reads - a dictionary of read ids with low quality returns: True if the read is ok We maintain a dictionary of rejected reads, so once a read is flagged we don't seach it's quality list again This isn't really necessary in this case. There should not be multiple reads for a sdingle ID """ _BASE_QUAL_CUTOFF = 30 if aln.is_unmapped: bad_reads[aln.query_name] = 1 return FALSE if aln.query_name in bad_reads.keys(): return False if any([c < _BASE_QUAL_CUTOFF for c in aln.query_alignment_qualities]): bad_reads[aln.query_name] = 1 return False else: return True bad_reads = dict() in_file = pysam.AlignmentFile(bam_file, 'rb') for read in in_file.fetch(): if (not read.is_supplementary) and read_ok(read, bad_reads): # filter supplemnetary reads ref_id = in_file.getrname(read.reference_id) if ref_id in repeat_tree.keys(): rep = repeat_tree[ref_id].search(read.reference_start) if rep: for r in rep: r.data.counts[id] += 1
The read code filters out supplementary reads and any read with a basecall quality less than 30.
Once we have counts for each region, we accumulate the total counts for each repeat region and .bam file and then print the counts as a CSV file for processing by DESeq2 and edgeR.
def AccumulateCounts(repeat_tree, bam_files): """ AccumulateCounts Iterate through the tree and count the reads overlapping each repName for each bam_file repeat_tree - a GenomeIntervalTree object bam_files - dictionary of bam files - see Main returns: counts - a dictionary of counts overlapping each repName - counts['repName']['id'] """ counts = defaultdict(lambda : defaultdict(int)) for chrom in sorted(repeat_tree.keys()): for interval in repeat_tree[chrom]: rep = interval.data for p in bam_files: counts[rep.repName][p['id']] += rep.counts[p['id']] return counts def PrintCounts(counts, bam_files, out_file): """ PrintCounts print a csv table of counts counts - a dictionary returned by AccumulateCounts bam_files - dictionary of bam files - see Main out_file - csv file """ f = open(out_file, 'w') header = ['repName'] for p in bam_files: header.append(p['id']) print(','.join(header), file=f) for rep_id in sorted(counts.keys()): row = [rep_id] for p in bam_files: row.append(str(counts[rep_id][p['id']])) print(','.join(row), file=f) f.close() def Main(): repeat_file = '/users/thompson/data/genomes/fly/dm6/ucsc.rep.table.txt' out_file = '/users/thompson/data/fly/repeats/expression/repeat_counts.csv' bam_files = [{'type': 'LoxP RNA', 'id': 'YAS0003_AAGACG_L003_R1_001', 'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS0003_AAGACG_L003_R1_001/dm6.chrom.fa.nodups.bam'}, {'type': 'LoxP RNA', 'id': 'YAS0003_AAGACG_L007_R1_001', 'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS0003_AAGACG_L007_R1_001/dm6.chrom.fa.nodups.bam'}, {'type': 'LoxP RNA', 'id': 'YAS0004_CCTCGG_L003_R1_001', 'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS0004_CCTCGG_L003_R1_001/dm6.chrom.fa.nodups.bam'}, {'type': 'LoxP RNA', 'id': 'YAS0004_CCTCGG_L007_R1_001', 'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS0004_CCTCGG_L007_R1_001/dm6.chrom.fa.nodups.bam'}, {'type': 'ADAR0 RNA', 'id': 'YAS9_AAGCCT_L002_R1_001', 'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS9_AAGCCT_L002_R1_001/dm6.chrom.fa.nodups.bam'}, {'type': 'ADAR0 RNA', 'id': 'YAS10_GTCGTA_L008_R1_001', 'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS10_GTCGTA_L008_R1_001/dm6.chrom.fa.nodups.bam'}] chroms = ['chr2L', 'chr2R', 'chr3L', 'chr3R', 'chr4', 'chrX', 'chrY'] # chromosome we want to examine repeat_tree = ReadRepeatmasker(repeat_file, chroms) # read the repmasker and return an interval tree for p in bam_files: ReadBamFile(p['file'], repeat_tree, p['id']) counts = AccumulateCounts(repeat_tree, bam_files) PrintCounts(counts, bam_files, out_file)
The counting code can be downloaded here.
Once we have a table of counts in a file, we can process it with DESeq2 or edgeR. Calculating the log differential expression is straight forward with each.
Here's the code for DESeq2
repeats = read.table('repeat_counts.csv', sep=',', header=TRUE, row.names=1) repCols = data.frame(row.names=colnames(repeats), type=c('LoxP', 'LoxP', 'LoxP', 'LoxP', 'ADAR0', 'ADAR0')) rep_dds = DESeqDataSetFromMatrix(countData = repeats, colData=repCols, design = ~type) colData(rep_dds)$type = factor(colData(rep_dds)$type, levels=c("LoxP", "ADAR0")) rep_dds=DESeq(rep_dds) res = results(rep_dds) write.csv(res, file='repeats_expression.csv')
Here's the code for edgeR,
group = c(1,1,1,1,2,2) y=DGEList(repeats, group=group) y=calcNormFactors(y) y=estimateCommonDisp(y) y = estimateTagwiseDisp(y) et = exactTest(y) design = model.matrix(~group) y = estimateGLMCommonDisp(y,design) y = estimateGLMTrendedDisp(y,design) y = estimateGLMTagwiseDisp(y,design) fit = glmFit(y,design) lrt = glmLRT(fit,coef=2) write.csv(lrt$table, file='repeats_edgeR.csv')
The R code was adapted directly from the examples in the docs for each system.
No comments:
Post a Comment