Differential Expression and a Mild Rant

It's been a while since I posted anything here. I have been working on a stock trading bot on a proprietary platform so there hasn't been much that I can report in public. Over the last week, I have had a chance to take a look at some RNA-Seq data in order to look at differential expression in fly transposable elements (TE), also sometimes called repeat elements.

There are a number of packages for RNA-Seq differential expression analysis. This Wikipedia article lists some of them. How do you decide which to use? I use the post-modern approach. There's no way to test all of them, so I Google for "RNA-Seq differential expression software" and get a list of candidates. Next, I ask colleagues what they use and look at the intersection of the two sets. From there, I just pick a couple of popular packages and try them out. Using this totally unscientific approach I settled on DESeq2 and edgeR. Both are well established Bioconductor packages for R.

I tend to be an impatient programmer. I don't like to spend much time reading docs or scientific papers before I start coding. I'll figure out if the software does what I want after I try it. This perhaps isn't an optimal approach, but in my experience much academic and scientific software breaks easily. I don't want to invest a lot time in trying to get something to work if it doesn't fit my needs or doesn't even work as advertised. Luckily both DESeq2 and edgeR have quick start sections in the documentation for the impatient. One problem I ran into is that the DESeq2 examples didn't work quite as docs said, but it was easy enough to get them working. I have to admit, however, I don't like R as a programming tool. I know it has a package for every conceivable statistical method and Bioconductor has a package for every possible bioinformatics task, but I find the docs opaque and R's object model bordering on incomprehensible. There are actually two different object models, S3 and S4. Object methods aren't part of the class definition of an R object, so know what the object does as opposed to what data it holds is difficult. Combining that problem with opaque documentation makes life difficult for the chronically impatient.(This is the minor rant part.)

There is an updated D. melanogaster assembly, dm6, available at UCSC. I downloaded the full assembly and since we're interested in TEs, the RepeatMasker table from the UCSC table browser. I ran BWA to map a collection RNA-Seq runs against the new genome. A problem arises here. We're interested in TEs and TEs are difficult to map because there are multiple copies of a particular TE and the copies tend to be similar at the sequence level. It's difficult to know which copy the read actually came from. We could take a probabilistic approach and calculate the probability that a read comes from each of the copies. However to measure differential expression, we only need to know that a read came from one of the TE types, so we will simple take the top alignment for each read. To do this, we can use the BWA mem algorithm and eliminate supplementary reads.

We have six RNA-seq data sets: four from loxP  lab strain flies and two from Adar Null flies. There are a number of packages to count reads from .bam files. Both DESeq2 and edgeR documentation mention the Python library HTSeq as containing possible methods for counting. HTSeq is a fairly complete package for sequence manipulation, but it only runs under Python 2.x. I have mainly been using Python3 for my Python work, so I decided to stick with it and use PySam 0.8.3 to process the files and count reads for each repeat name. I use methods similar to code used in the ChIP-Seq examples on this blog. One thing to watch out for PySam 0.8.x renames the methods and attributes. The old names are deprecated.

To count reads overlapping TE regions, we first build a genome interval tree of TE locations. The interval tree is a balanced binary tree than can be searched by genomic range. It's important that the tree is balanced because the .bam files are sorted and if the tree wasn't rebalanced, by reading them in order we would end up with a linear tree. Rebalancing adds some work to the program, but it still runs in a reasonable time.

Here's the code for creating RepeatMasker records and adding them to the interval tree. The dm6 assembly contains a number of short sequences that could not be fully integrated to the assembly. We only want to concentrate on the major chromosomes 2L, 2R, 3L, 3R, 4, X, and Y, so we filter out any other segments. The RepeatMasker objects contain a dictionary for counting the number of reads overlapping the repeat region for each bam file.
 
class RepeatMasker(object):
    """
    A class to hold info from the UCSC RepMasker table. For details of the fields see
    https://genome.ucsc.edu/cgi-bin/hgTables
    """
    
    def __init__(self, rep_str):
        """
        rep_str - a tab delimited line form the RepeatMasker table
        """
        
        rep_list = rep_str.split('\t')

        self.swScore = int(rep_list[1])
        self.milliDiv = int(rep_list[2])
        self.milliDel = int(rep_list[3])
        self.milliIns = int(rep_list[4]) 
        self.genoName = rep_list[5]
        self.genoStart = int(rep_list[6]) # genome region start location
        self.genoEnd = int(rep_list[7])  # genome end location
        self.genoLeft = int(rep_list[8])
        self.strand = rep_list[9]
        self.repName = rep_list[10]
        self.repClass = rep_list[11]
        self.repFamily = rep_list[12]
        self.repStart = int(rep_list[13])
        self.repEnd = int(rep_list[14])
        self.repLeft = int(rep_list[15])
        self.id = int(rep_list[16])
        self.extend_start = max(self.genoStart - 1000, 0)
        self.extend_end = self.genoStart + 1000
        self.counts = defaultdict(int)  # store counts of each repName
          
def ReadRepeatmasker(repeatmasker_file, chroms):
    """
    ReadRepeatmasker
    Read the UCSC RepMasker table. We build a set interval tree for the file.
    GenomeIntervalTree is a dictionary of interval trees, one for each chromosome we want to analyze.
    repeatmasker_file - the local file name for the RepeatMaskerTable
    chroms - a list of valid chromosome names
    """
    tree = GenomeIntervalTree()
    
    with open(repeatmasker_file, 'r') as f:
        f.readline()   # skip header
        for line in f:
            rep = RepeatMasker(line.strip())
            if rep.genoName in chroms:
                tree[rep.genoName].addi(rep.extend_start, rep.extend_end, rep)

    return tree

Here's the code for reading the .bam files and updating the counts in the tree.
 
def ReadBamFile(bam_file, repeat_tree, id):
    """
    ReadBamFile - read a bam file and count coverage
    bam_file - input bam file
    repeat_tree - an interval tree of TE locations, new reads are added to the tree.
                  The tree is self balancing. This is good because the bam is usually sorted.
    id - an identifier for the file - a short name
    
    Return: nothing
    """

    def read_ok(aln, bad_reads):
        """
        read_ok - reject reads with a low quality (lt 30) base call
        aln - a PySam AlignedRead object
        bad_reads - a dictionary of read ids with low quality
        returns: True if the read is ok

        We maintain a dictionary of rejected reads, so once a read is flagged
        we don't seach it's quality list again
        This isn't really necessary in this case. There should not be multiple reads for a sdingle ID
        """
        _BASE_QUAL_CUTOFF = 30
        
        if aln.is_unmapped:
            bad_reads[aln.query_name] = 1
            return FALSE
        
        if aln.query_name in bad_reads.keys():
            return False

        if any([c < _BASE_QUAL_CUTOFF for c in aln.query_alignment_qualities]):
            bad_reads[aln.query_name] = 1
            return False
        else:
            return True

    bad_reads = dict()
    
    in_file = pysam.AlignmentFile(bam_file, 'rb')
    for read in in_file.fetch():
        if (not read.is_supplementary) and read_ok(read, bad_reads):  # filter supplemnetary reads
            ref_id = in_file.getrname(read.reference_id)
            if ref_id in repeat_tree.keys():
                rep = repeat_tree[ref_id].search(read.reference_start)
                if rep:
                    for r in rep:
                        r.data.counts[id] += 1

The read code filters out supplementary reads and any read with a basecall quality less than 30.

Once we have counts for each region, we accumulate the total counts for each repeat region and  .bam file and then print the counts as a CSV file for processing by DESeq2 and edgeR.


def AccumulateCounts(repeat_tree, bam_files):
    """
    AccumulateCounts
    Iterate through the tree and count the reads overlapping each repName for each bam_file
    repeat_tree - a GenomeIntervalTree object
    bam_files - dictionary of bam files - see Main

    returns:
    counts - a dictionary of counts overlapping each repName - counts['repName']['id']
    """
    
    counts = defaultdict(lambda : defaultdict(int))
    
    for chrom in sorted(repeat_tree.keys()):
        for interval in repeat_tree[chrom]:
            rep = interval.data
            for p in bam_files:
                counts[rep.repName][p['id']] += rep.counts[p['id']]

    return counts 

def PrintCounts(counts, bam_files, out_file):
    """
    PrintCounts
    print a csv table of counts
    counts - a dictionary returned by AccumulateCounts
    bam_files - dictionary of bam files - see Main
    out_file - csv file
    """
    
    f = open(out_file, 'w')
    header = ['repName']
    for p in bam_files:
        header.append(p['id'])
    print(','.join(header), file=f)

    for rep_id in sorted(counts.keys()):
        row = [rep_id]
        for p in bam_files:
            row.append(str(counts[rep_id][p['id']]))
        print(','.join(row), file=f)
    f.close()
    
                                    
def Main():
    repeat_file = '/users/thompson/data/genomes/fly/dm6/ucsc.rep.table.txt'
    out_file = '/users/thompson/data/fly/repeats/expression/repeat_counts.csv'
    bam_files = [{'type': 'LoxP RNA',
                  'id': 'YAS0003_AAGACG_L003_R1_001',
                  'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS0003_AAGACG_L003_R1_001/dm6.chrom.fa.nodups.bam'},
                 {'type': 'LoxP RNA',
                  'id': 'YAS0003_AAGACG_L007_R1_001',
                  'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS0003_AAGACG_L007_R1_001/dm6.chrom.fa.nodups.bam'},                  
                 {'type': 'LoxP RNA',
                  'id': 'YAS0004_CCTCGG_L003_R1_001',
                  'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS0004_CCTCGG_L003_R1_001/dm6.chrom.fa.nodups.bam'},                  
                 {'type': 'LoxP RNA',
                  'id': 'YAS0004_CCTCGG_L007_R1_001',
                  'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS0004_CCTCGG_L007_R1_001/dm6.chrom.fa.nodups.bam'},
                 {'type': 'ADAR0 RNA',
                  'id': 'YAS9_AAGCCT_L002_R1_001',
                  'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS9_AAGCCT_L002_R1_001/dm6.chrom.fa.nodups.bam'},
                 {'type': 'ADAR0 RNA',
                  'id': 'YAS10_GTCGTA_L008_R1_001',
                  'file': '/users/thompson/data/fly/repeats/homer/bwa/output/YAS10_GTCGTA_L008_R1_001/dm6.chrom.fa.nodups.bam'}]
                                    
    chroms = ['chr2L', 'chr2R', 'chr3L', 'chr3R', 'chr4', 'chrX', 'chrY']  # chromosome we want to examine
    
    repeat_tree = ReadRepeatmasker(repeat_file, chroms)   # read the repmasker and return an interval tree
    
    for p in bam_files:
        ReadBamFile(p['file'], repeat_tree, p['id']) 

    counts = AccumulateCounts(repeat_tree, bam_files)
    PrintCounts(counts, bam_files, out_file)

The counting code can be downloaded here.

Once we have a table of counts in a file, we can process it with DESeq2 or edgeR. Calculating the log differential expression is straight forward with each.

Here's the code for DESeq2

repeats = read.table('repeat_counts.csv', sep=',', header=TRUE, row.names=1)
repCols = data.frame(row.names=colnames(repeats), type=c('LoxP', 'LoxP', 'LoxP', 'LoxP', 'ADAR0', 'ADAR0'))
rep_dds = DESeqDataSetFromMatrix(countData = repeats, colData=repCols, design = ~type)
colData(rep_dds)$type = factor(colData(rep_dds)$type, levels=c("LoxP", "ADAR0"))
rep_dds=DESeq(rep_dds)
res = results(rep_dds)
write.csv(res, file='repeats_expression.csv')


Here's the code for edgeR,

group = c(1,1,1,1,2,2)
y=DGEList(repeats, group=group)
y=calcNormFactors(y)
y=estimateCommonDisp(y)
y = estimateTagwiseDisp(y)
et = exactTest(y)
design = model.matrix(~group)
y = estimateGLMCommonDisp(y,design)
y = estimateGLMTrendedDisp(y,design)
y = estimateGLMTagwiseDisp(y,design)
fit = glmFit(y,design)
lrt = glmLRT(fit,coef=2)
write.csv(lrt$table, file='repeats_edgeR.csv')

The R code was adapted directly from the examples in the docs for each system.