The Masked Genome

Often, a genome is masked to replace repeat structures in order to map reads to the genome sequences and avoid problems caused by reads being mapped multiple times to repeats. However, we're interested in what happens in transposons and repeat elements, so we're going to take the opposite approach. We will create a set of genome sequences with everything but the sequences of certain repeat elements masked.The particular type of transposable element that we are interested in is Hoppel, also know as Proto-P. There are a little over 3000 such elements listed in the drosophila melanogaster genome.

The common method of masking a genome is to replace regions that you do not wish reads to map to with N's. We create a series of sequences of the same length as the genomic sequences, but consisting of nothing but N characters. We will then insert the Hoppel sequences in their locations in the genome sequences.

The first step is to download the RepeatMasker database from UCSC. We read the RepeatMasker file and create a dictionary of RepeatMasker objects.

class RepeatMasker(object):
    """
    RepeatMasker - class for holding UCSC Repeatmasker records
    """
    def __init__(self, rep_str):
        rep_list = rep_str.split('\t')

        self.swScore = int(rep_list[1])
        self.milliDiv = int(rep_list[2])
        self.milliDel = int(rep_list[3])
        self.milliIns = int(rep_list[4]) 
        self.genoName = rep_list[5]
        self.genoStart = int(rep_list[6])
        self.genoEnd = int(rep_list[7])
        self.genoLeft = int(rep_list[8])
        self.strand = rep_list[9]
        self.repName = rep_list[10]
        self.repClass = rep_list[11]
        self.repFamily = rep_list[12]
        self.repStart = int(rep_list[13])
        self.repEnd = int(rep_list[14])
        self.repLeft = int(rep_list[15])
        self.id = int(rep_list[16]) 

The RepeatMasker file is read one line at a time and a dictionary is created.
def ReadRepeatMaskerFile(repeat_file):
    """
    ReadRepeatMaskerFile - read the UCSC Repeatmasker file
    returns
    repeats - a dictionary of dictionaries of RepeatMasker objects,
              the primary key is the chromosome, the secondary is the repeat name
              the values are a list of RepeatMasker objects
    """

    repeats = dict()

    with open(repeat_file, 'r') as f:
        f.readline()  # throw away header
        for line in f:
            rep = RepeatMasker(line.strip())
            if rep.genoName not in repeats.keys():
                repeats[rep.genoName] = dict()

            if rep.repName not in repeats[rep.genoName].keys():
                repeats[rep.genoName][rep.repName] = [] 
            repeats[rep.genoName][rep.repName].append(rep)

    return repeats

We also read the genome sequences and use them as a model for our masked genome, inserting the Hoppel sequences in the correct locations.


def InsertHoppelSeqs(seq_recs, repeats):
    """
    InsertHoppelSeqs - create a masked genome with everything but the Hoppel sequences masked
                       PROTOP is an alias for Hoppel
    seq_rec - a dictionary of SeqRecord objects containing the genome sequences
    repeats - a dictioary of RepeatMasker objects
    returns
    masked_seqs - a dictionary of SeqRecordObjects with Hoppel sequences inserted
    """

    # keep track of memory usage
    count = 0
    print 'start', resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    sys.stdout.flush()

    # create the masked genome as a list of N's
    seq_list = dict()
    for chrom in seq_recs.keys():
        seq_list[chrom] = ['N'] * len(seq_recs[chrom].seq)
    
    print 'repeats', resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    sys.stdout.flush()

    # iterate through the sequences, then through the repeats
    # for each Hoppel repeat, insert its sequence in the correct location
    for chrom in seq_recs.keys():
        for id in ['PROTOP', 'PROTOP_A', 'PROTOP_B']:
            if id in repeats[chrom].keys():
                for rep in repeats[chrom][id]:
                    if count % 100 == 0:
                        print count, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
                        sys.stdout.flush()

                    seq_list[chrom][rep.genoStart:rep.genoEnd-1] = seq_recs[chrom].seq[rep.genoStart:rep.genoEnd-1]
                    count += 1

    # create a dictionary of SeqRecord objects
    masked_seqs = dict()
    for chrom in seq_recs.keys():                
        masked_seqs[chrom] = SeqRecord(Seq.Seq(''.join(seq_list[chrom]), generic_dna),
                                       id = chrom,
                                       description='masked')
        print chrom, resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
        sys.stdout.flush()
                    
    return masked_seqs

Download the programs

No comments:

Post a Comment