From FASTQ to FASTA

Probably the most common bioinformatics task is conversion between file formats. Recently, we needed to extract sequence data from a FASTQ file and write the sequences to a FASTA file. If all you want to do is extract the sequences, it's very easy in BioPython. Our task was slightly more difficult, but not by much. We also want to filter reads with low basecall quality. Our usual criteria is to reject any read with one or more bases with a quality score lower than thirty. A second, minor problem is that some of our FASQ files are in zip format, others are gzipped, and some are just text.

We will use BioPython's SeqIO module to handle reading and writing the sequences. We use Python's ZipFile module to handle zipped files and the gzip module for gzipped files. The code is relatively simple. We open the FASTQ file, after first checking for the appropriate file type; read it one record at a time; check for low quality bases; if the record passes teh check, write it to the FASTA file. Here's the code:

def ProcessFastq(fastq_file, fasta_file):
    """
    ProcessFastq - output a fast file from a fastq file

    fastq_file - fastq file name, may be a fastq file
                 or zipped or gzipped fastq file
    fasta_file - output fasta file name
                 id is the fastq id.
                 description is quality string
    """
    
    def read_ok(qual):
        """
        read_ok - reject reads with a low quality (lt 30) basecalls
        qual - a list of integer basecall qualities
        returns: True if the read is ok
        """
        if any([q < _BASE_QUAL_CUTOFF for q in qual]):
            return False
        else:
            return True

    _BASE_QUAL_CUTOFF = 30

    if re.search('\.gz$', fastq_file):
        inp = gzip.open(fastq_file, 'rb')
    elif re.search('\.zip$', fastq_file):
        z = zipfile.ZipFile(fastq_file)
        (dirName, fileName) = os.path.split(fastq_file)
        m = re.search('(.+?\.fastq)\.zip$', fileName)
        inp = z.open(m.group(1), 'r')
    else:
        inp = open(fastq_file, 'r')

    if fasta_file is not None:
        out = open(fasta_file, 'w')
    else:
        out = sys.stdout

    for rec in SeqIO.parse(inp, 'fastq'):
        if read_ok(rec.letter_annotations['phred_quality']):
            out_rec = SeqRecord(seq = Seq(str(rec.seq), generic_dna),
                                id = rec.id,
                                description = ''.join([chr(c+33) for c in rec.letter_annotations['phred_quality']]))
            SeqIO.write(out_rec, out, 'fasta')

    out.close()
    inp.close()

That's it. The rest of the program just handles getting the file names from the command line.
Download the code.

No comments:

Post a Comment