Getting the Sequence Length

Sequence data is often stored in Fasta format. Fasta files are text files, so to access them, you can use standard I/O routines to read and write fasta files. However, libraries like BioPerl and BioPython provide a convenient way to manipulate these files.

Fasta files often contain multiple sequences. Here's a simple example of using BioPython to read a Fasta file and return a list of SeqRecord objects. SeqRecord objects contain sequence data, the sequence ID, possibly annotation and other information. It's a uniform object for storing sequence data.

def ReadFasta(file):
    """
    ReadFasta - read a sequence from a fasta formatted file.

    file - fasta file name
    returns: a list of BioPython sequence record
    """

    seq_list = []
    handle = open(file, 'rU')
    for record in  SeqIO.parse(handle, 'fasta'):
        seq_list.append(record)
    handle.close()

    return seq_list




That's it. SeqIO is an interface object for reading all kinds of  sequence types. The parse routine is an iterator that retrieves a sequence from a  file and returns a SeqRecord.

Here's a complete program for getting the lengths of a collection of sequences in a Fasta file. It prints the sequence ID and the sequence length.

download

import sys
import argparse
from Bio import SeqIO


def ReadFasta(file):
    """
    ReadFasta - read a sequence from a fasta formatted file.

    file - fasta file name
    returns: a list of BioPython sequence record
    """

    seq_list = []
    handle = open(file, 'rU')
    for record in  SeqIO.parse(handle, 'fasta'):
        seq_list.append(record)
    handle.close()

    return seq_list



def GetArgs():
    """
    GetArgs - read the command line
    returns - a bam file name, a genome file name, and an output file name
    bam file is required
    genome file is optional and has a default
    output file is option, if not given, output is written to stdout

    typing python phiX.count.py will show the options
    """

    def ParseArgs(parser):
        class Parser(argparse.ArgumentParser):
            def error(self, message):
                sys.stderr.write('error: %s\n' % message)
                self.print_help()
                sys.exit(2)

        parser = Parser(description='Get the length of a fasta sequence.')
        parser.add_argument('-f', '--fasta_file',
                            type=str,
                            required=True,
                            help='Fasta File (required).')
        return parser.parse_args()

    parser = argparse.ArgumentParser()
    args = ParseArgs(parser)

    return args.fasta_file


def Main():
    fasta_file = GetArgs()

    seqs = ReadFasta(fasta_file)
    for seq in seqs:
        print seq.id, len(seq)
   

if __name__ == '__main__':
    Main()
 

No comments:

Post a Comment