The common method of masking a genome is to replace regions that you do not wish reads to map to with N's. We create a series of sequences of the same length as the genomic sequences, but consisting of nothing but N characters. We will then insert the Hoppel sequences in their locations in the genome sequences.
The first step is to download the RepeatMasker database from UCSC. We read the RepeatMasker file and create a dictionary of RepeatMasker objects.
class RepeatMasker(object): """ RepeatMasker - class for holding UCSC Repeatmasker records """ def __init__(self, rep_str): rep_list = rep_str.split('\t') self.swScore = int(rep_list[1]) self.milliDiv = int(rep_list[2]) self.milliDel = int(rep_list[3]) self.milliIns = int(rep_list[4]) self.genoName = rep_list[5] self.genoStart = int(rep_list[6]) self.genoEnd = int(rep_list[7]) self.genoLeft = int(rep_list[8]) self.strand = rep_list[9] self.repName = rep_list[10] self.repClass = rep_list[11] self.repFamily = rep_list[12] self.repStart = int(rep_list[13]) self.repEnd = int(rep_list[14]) self.repLeft = int(rep_list[15]) self.id = int(rep_list[16])
The RepeatMasker file is read one line at a time and a dictionary is created.
def ReadRepeatMaskerFile(repeat_file): """ ReadRepeatMaskerFile - read the UCSC Repeatmasker file returns repeats - a dictionary of dictionaries of RepeatMasker objects, the primary key is the chromosome, the secondary is the repeat name the values are a list of RepeatMasker objects """ repeats = dict() with open(repeat_file, 'r') as f: f.readline() # throw away header for line in f: rep = RepeatMasker(line.strip()) if rep.genoName not in repeats.keys(): repeats[rep.genoName] = dict() if rep.repName not in repeats[rep.genoName].keys(): repeats[rep.genoName][rep.repName] = [] repeats[rep.genoName][rep.repName].append(rep) return repeats
We also read the genome sequences and use them as a model for our masked genome, inserting the Hoppel sequences in the correct locations.
def InsertHoppelSeqs(seq_recs, repeats): """ InsertHoppelSeqs - create a masked genome with everything but the Hoppel sequences masked PROTOP is an alias for Hoppel seq_rec - a dictionary of SeqRecord objects containing the genome sequences repeats - a dictioary of RepeatMasker objects returns masked_seqs - a dictionary of SeqRecordObjects with Hoppel sequences inserted """ # keep track of memory usage count = 0 print 'start', resource.getrusage(resource.RUSAGE_SELF).ru_maxrss sys.stdout.flush() # create the masked genome as a list of N's seq_list = dict() for chrom in seq_recs.keys(): seq_list[chrom] = ['N'] * len(seq_recs[chrom].seq) print 'repeats', resource.getrusage(resource.RUSAGE_SELF).ru_maxrss sys.stdout.flush() # iterate through the sequences, then through the repeats # for each Hoppel repeat, insert its sequence in the correct location for chrom in seq_recs.keys(): for id in ['PROTOP', 'PROTOP_A', 'PROTOP_B']: if id in repeats[chrom].keys(): for rep in repeats[chrom][id]: if count % 100 == 0: print count, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss sys.stdout.flush() seq_list[chrom][rep.genoStart:rep.genoEnd-1] = seq_recs[chrom].seq[rep.genoStart:rep.genoEnd-1] count += 1 # create a dictionary of SeqRecord objects masked_seqs = dict() for chrom in seq_recs.keys(): masked_seqs[chrom] = SeqRecord(Seq.Seq(''.join(seq_list[chrom]), generic_dna), id = chrom, description='masked') print chrom, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss sys.stdout.flush() return masked_seqs
Download the programs
No comments:
Post a Comment