Sequence data is often stored in Fasta format. Fasta files are text files, so to access them, you can use standard I/O routines to read and write fasta files. However, libraries like BioPerl and BioPython provide a convenient way to manipulate these files.
Fasta files often contain multiple sequences. Here's a simple example of using BioPython to read a Fasta file and return a list of SeqRecord objects. SeqRecord objects contain sequence data, the sequence ID, possibly annotation and other information. It's a uniform object for storing sequence data.
def ReadFasta(file):
    """
    ReadFasta - read a sequence from a fasta formatted file.
    file - fasta file name
    returns: a list of BioPython sequence record
    """
    seq_list = []
    handle = open(file, 'rU')
    for record in  SeqIO.parse(handle, 'fasta'):
        seq_list.append(record)
    handle.close()
    return seq_list
That's it. SeqIO is an interface object for reading all kinds of  sequence types. The parse routine is an iterator that retrieves a sequence from a  file and returns a SeqRecord.
Here's a complete program for getting the lengths of a collection of sequences in a Fasta file. It prints the sequence ID and the sequence length.
download
import sys
import argparse
from Bio import SeqIO
def ReadFasta(file):
    """
    ReadFasta - read a sequence from a fasta formatted file.
    file - fasta file name
    returns: a list of BioPython sequence record
    """
    seq_list = []
    handle = open(file, 'rU')
    for record in  SeqIO.parse(handle, 'fasta'):
        seq_list.append(record)
    handle.close()
    return seq_list
def GetArgs():
    """
    GetArgs - read the command line
    returns - a bam file name, a genome file name, and an output file name
    bam file is required
    genome file is optional and has a default
    output file is option, if not given, output is written to stdout
    typing python phiX.count.py will show the options
    """
    def ParseArgs(parser):
        class Parser(argparse.ArgumentParser):
            def error(self, message):
                sys.stderr.write('error: %s\n' % message)
                self.print_help()
                sys.exit(2)
        parser = Parser(description='Get the length of a fasta sequence.')
        parser.add_argument('-f', '--fasta_file',
                            type=str,
                            required=True,
                            help='Fasta File (required).')
        return parser.parse_args()
    parser = argparse.ArgumentParser()
    args = ParseArgs(parser)
    return args.fasta_file
def Main():
    fasta_file = GetArgs()
    seqs = ReadFasta(fasta_file)
    for seq in seqs:
        print seq.id, len(seq)
    
if __name__ == '__main__':
    Main()
 
No comments:
Post a Comment