Supplementary Table 29: List of previously known editing sites in Drosophila | ||||||
Gene Name | Genome Chr | Genome Strand | Genome Pos | Previous Annotation | Identified by modENCODE | |
unc-13-RB | 4 | T | 895009 | Reenan_unc13_A | YES | |
unc-13-RB | 4 | T | 899596 | YES | ||
Caps-RA | 4 | A | 1268833 | Reenan_CAPS_A | YES | |
nAcRbeta-21C-RB | 2L | A | 546751 | YES | ||
syt-RA | 2L | T | 2785542 | Reenan_Syt_D | YES | |
syt-RA | 2L | T | 2785610 | Reenan_Syt_C | YES | |
syt-RA | 2L | T | 2785621 | Reenan_Syt_B | YES | |
.... |
We will read the table and create a collection of objects containing the site information, along with a few counters to keep track of how many reads contain the site and how many reads contain possible evidence of editing.
The definition of the edit site objects mimics the rows of the table.
class EditSite(object):
"""
EditSite - edit site class for rows of spreadsheet
The __init__ routine is passed a Pandas data frame row.
Pandas converts integer in the spreadsheet to floats by default
"""
def __init__(self, data):
self.gene = data[0]
if type(data[1]) == float:
self.chrom = 'chr' + str(int(data[1]))
else:
self.chrom = 'chr' + data[1]
self.strand = data[2]
self.position = int(data[3])
self.annotation = data[4]
self.mod_encode = data[5] == 'YES'
self.edit_count = 0 # count of possible edit sites
self.reads = 0
self.A_G_hits = 0
self.T_C_hits = 0
The table is stored in an Excel worksheet. We will read the data from the spreadsheet file with Pandas. The inititialization routine for the EditSite objects is passed the row as read by Pandas and it simply fills in the values of the object.
There are many python libraries for reading Excel-like spreadsheets. The easy way to deal with data in .xlsx or .xls files is to not deal with them. Open the spreadsheet in Excel or LibreOffice Calc and save it as a comma separated variable (csv) or tab delimited data file. Reading these type of files is trivially easy in python or perl. Instead, I decided to be a tiny bit more adventurous and use Pandas to read the worksheet.
From the Pandas website: pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. Pandas is useful for reading tabular data into data frame structures similar to R's data frame tables. It's also useful for handling time series and HDF5 data.
Here's how we read a table from a spreadsheet with Pandas:
def ReadExcel(file):
"""
ReadExcel - read the supplementary table 29 from the supplement spreadsheet
of the SMS paper
Uses Pandas to read the table into a data frame
return:
sites - a dictionary of EditSite objects keyed by chromosome
"""
with warnings.catch_warnings(): # ignore useless warnings on oscar
warnings.simplefilter("ignore")
xl = pd.ExcelFile(file)
sites = dict()
data = xl.parse('Table 29', index_col = None, header = None)
(rows, cols) = data.shape
for row in xrange(3, rows):
site = EditSite(data.ix[row])
if site.chrom not in sites:
sites[site.chrom] = []
sites[site.chrom].append(site)
return sites
This routine is passed the name of the spreadsheet file containing Table 29. It opens the file and reads Table 29 one row at a time.
This bit of business:
with warnings.catch_warnings(): # ignore useless warnings on oscar
warnings.simplefilter("ignore")
xl = pd.ExcelFile(file)
is to avoid a warning caused by the python module readline already having been imported by default on our system.
The routine creates a dictionary, keyed by chromosome name. The dictionary entries are lists of EditSite objects.
Given a .bam file, we can then loop through the reads overlapping the dcoumented edit sites and see if the read contains an A->G (T->C in reverse complement) change, i.e. does the genome contain an A and the read shows a G at the documented position. This doesn't prove that this is an edit site, it might still be a SNP or a read error, but a collection of differences from the genome in reads at a position can contribute evidence of editing and make the position worthwhile for further investigation.
We can use PySam's pileup routine to limit our search to positions in our list of possible edit sites. pileup allows you to specify a region to search. It will return all reads overlapping the region. After a bit of experimenting, I found that the current version requires a region of at least 20 bases. We compare the base at the current position in the read to the genome. If it contains a possible edit, we count it.
def SearchEditSites(bam, sites, chrom, genome_file):
"""
SearchEditSites - read a bam file and seach the list of sites for
possible edits
bam - a handle to a bam file
sites - the dictionary of documented edit sites from the spreadsheet
chrom - a chromosome name
genome_file - the name of a fasta file containing the chromosome sequence
"""
# read the genome sequence
seq = ReadFasta(genome_file)
seq_str = str(seq.seq)
# we use loop through the sites for this chromosome and call pileup
# we look for reads in the region of the site +/- 10 bases
# this is due to a feature/bug in pysam that requires at least 20 bases
# in a region,
# we check for the site position and then look for mismatches
for site in sites[chrom]:
for pileup_col in bam.pileup(chrom, site.position-10, site.position+10):
if pileup_col.pos == site.position-1: # the sites are 1 based
for pileup_read in pileup_col.pileups:
if not pileup_read.indel and not pileup_read.is_del:
site.reads += 1
aln = pileup_read.alignment
if not aln.is_reverse:
if seq_str[pileup_col.pos] == 'A' and
aln.seq[pileup_read.qpos] == 'G':
site.edit_count += 1
site.A_G_hits += 1
else:
if seq_str[pileup_col.pos] == 'T' and
aln.seq[pileup_read.qpos] == 'C':
site.edit_count += 1
site.T_C_hits += 1
Once we have the counts for each site, we just print them.
Download the entire program.
* St Laurent, G., M. R. Tackett, S. Nechkin, D. Shtokalo, D. Antonets, Y. A. Savva, R. Maloney, P. Kapranov, C. E. Lawrence and R. A. Reenan (2013). "Genome-wide analysis of A-to-I RNA editing by single-molecule sequencing in Drosophila." Nat Struct Mol Biol 20(11): 1333-1339.
No comments:
Post a Comment