Tracking Edits

Previously, we looked at one way to map reads to genes to annotate expressed transcripts with GO annotation. Now, I'd like to look at a collection of sites known to be edited under certain circumstances and see if our sequence reads confirm that A to I RNA editing may have taken place. Supplementary table 29 from this paper (St Laurent et al.)* contains a list of previously know editing sites. The top few lines of the table look like this:


Supplementary Table 29: List of previously known editing sites in Drosophila







Gene Name Genome Chr Genome Strand Genome Pos Previous Annotation Identified by modENCODE
unc-13-RB 4 T 895009  Reenan_unc13_A YES
unc-13-RB 4 T 899596   YES
Caps-RA 4 A 1268833  Reenan_CAPS_A YES
nAcRbeta-21C-RB 2L A 546751   YES
syt-RA 2L T 2785542  Reenan_Syt_D YES
syt-RA 2L T 2785610  Reenan_Syt_C YES
syt-RA 2L T 2785621  Reenan_Syt_B YES
....







 We will read the table and create a collection of objects containing the site information, along with a few counters to keep track of how many reads contain the site and how many reads contain possible evidence of editing.

The definition of the edit site objects mimics the rows of the table.

class EditSite(object):
    """
    EditSite - edit site class for rows of spreadsheet
    The __init__ routine is passed a Pandas data frame row.
    Pandas converts integer in the spreadsheet to floats by default
    """

    def __init__(self, data):
        self.gene = data[0]
        if type(data[1]) == float:
            self.chrom = 'chr' + str(int(data[1]))
        else:
            self.chrom = 'chr' +  data[1]
        self.strand = data[2]
        self.position = int(data[3])
        self.annotation = data[4]
        self.mod_encode = data[5] == 'YES'

        self.edit_count = 0   # count of possible edit sites
        self.reads = 0
        self.A_G_hits = 0
        self.T_C_hits = 0

The table is stored in an Excel worksheet. We will read the data from the spreadsheet file with Pandas. The inititialization routine for the EditSite objects is passed the row as read by Pandas and it simply fills in the values of the object.

There are many python libraries for reading Excel-like spreadsheets. The easy way to deal with data in .xlsx or .xls files is to not deal with them. Open the spreadsheet in Excel or LibreOffice Calc and save it as a comma separated variable (csv) or tab delimited data file. Reading these type of files is trivially easy in python or perl. Instead, I decided to be a tiny bit more adventurous and use Pandas to read the worksheet.

From the Pandas website: pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. Pandas is useful for reading tabular data into data frame structures similar to R's data frame tables. It's also useful for handling time series and HDF5 data.

 Here's how we read a table from a spreadsheet with Pandas:

def ReadExcel(file):
    """
    ReadExcel - read the supplementary table 29 from the supplement spreadsheet
                of the SMS paper
    Uses Pandas to read the table into a data frame
   
    return:
    sites - a dictionary of EditSite objects keyed by chromosome
    """
    with warnings.catch_warnings():   # ignore useless warnings on oscar
        warnings.simplefilter("ignore")
        xl = pd.ExcelFile(file)

    sites = dict()

    data = xl.parse('Table 29', index_col = None, header = None)
    (rows, cols) = data.shape
    for row in xrange(3, rows):
        site = EditSite(data.ix[row])
        if site.chrom not in sites:
            sites[site.chrom] = []
        sites[site.chrom].append(site)

    return sites



This routine is passed the name of the spreadsheet file containing Table 29. It opens the file and reads Table 29 one row at a time.

This bit of business:

     with warnings.catch_warnings():   # ignore useless warnings on oscar
        warnings.simplefilter("ignore")
        xl = pd.ExcelFile(file)



is to avoid a warning caused by the python module readline already having been imported by default on our system.

The routine creates a dictionary, keyed by chromosome name. The dictionary entries are lists of EditSite objects.

Given a .bam file, we can then loop through the reads overlapping the dcoumented edit sites and see if the read contains an A->G (T->C in reverse complement) change, i.e. does the genome contain an A and the read shows a G at the documented position. This doesn't prove that this is an edit site, it might still be a SNP or a read error, but a collection of differences from the genome in reads at a position can contribute evidence of editing and make the position worthwhile for further investigation.

We can use PySam's pileup routine to limit our search to positions in our list of possible edit sites. pileup allows you to specify a region to search. It will return all reads overlapping the region. After a bit of experimenting, I found that the current version requires a region of at least 20 bases. We compare the base at the current position in the read to the genome. If it contains a possible edit, we count it.

def SearchEditSites(bam, sites, chrom, genome_file):
    """
    SearchEditSites - read a bam file and seach the list of sites for

                      possible edits
    bam - a handle to a bam file
    sites - the dictionary of documented edit sites from the spreadsheet
    chrom - a chromosome name
    genome_file - the name of a fasta file containing the chromosome sequence
    """

    # read the genome sequence
    seq = ReadFasta(genome_file)
    seq_str = str(seq.seq)

    # we use loop through the sites for this chromosome and call pileup
    # we look for reads in the region of the site +/- 10 bases
    # this is due to a feature/bug in pysam that requires at least 20 bases

    # in a  region,
    # we check for the site position and then look for mismatches
    for site in sites[chrom]:
        for pileup_col in bam.pileup(chrom, site.position-10, site.position+10):
            if pileup_col.pos == site.position-1: # the sites are 1 based
                for pileup_read in pileup_col.pileups:
                    if not pileup_read.indel and not pileup_read.is_del:
                        site.reads += 1
                        aln = pileup_read.alignment
                        if not aln.is_reverse:
                            if seq_str[pileup_col.pos] == 'A' and

                               aln.seq[pileup_read.qpos] == 'G':
                                site.edit_count += 1
                                site.A_G_hits += 1
                        else:
                            if seq_str[pileup_col.pos] == 'T' and  

                               aln.seq[pileup_read.qpos] == 'C':
                                site.edit_count += 1
                                site.T_C_hits += 1
                       

               
Once we have the counts for each site, we just print them.

Download the entire program.


* St Laurent, G., M. R. Tackett, S. Nechkin, D. Shtokalo, D. Antonets, Y. A. Savva, R. Maloney, P. Kapranov, C. E. Lawrence and R. A. Reenan (2013). "Genome-wide analysis of A-to-I RNA editing by single-molecule sequencing in Drosophila." Nat Struct Mol Biol 20(11): 1333-1339.

No comments:

Post a Comment