We will use BioPython's SeqIO module to handle reading and writing the sequences. We use Python's ZipFile module to handle zipped files and the gzip module for gzipped files. The code is relatively simple. We open the FASTQ file, after first checking for the appropriate file type; read it one record at a time; check for low quality bases; if the record passes teh check, write it to the FASTA file. Here's the code:
def ProcessFastq(fastq_file, fasta_file): """ ProcessFastq - output a fast file from a fastq file fastq_file - fastq file name, may be a fastq file or zipped or gzipped fastq file fasta_file - output fasta file name id is the fastq id. description is quality string """ def read_ok(qual): """ read_ok - reject reads with a low quality (lt 30) basecalls qual - a list of integer basecall qualities returns: True if the read is ok """ if any([q < _BASE_QUAL_CUTOFF for q in qual]): return False else: return True _BASE_QUAL_CUTOFF = 30 if re.search('\.gz$', fastq_file): inp = gzip.open(fastq_file, 'rb') elif re.search('\.zip$', fastq_file): z = zipfile.ZipFile(fastq_file) (dirName, fileName) = os.path.split(fastq_file) m = re.search('(.+?\.fastq)\.zip$', fileName) inp = z.open(m.group(1), 'r') else: inp = open(fastq_file, 'r') if fasta_file is not None: out = open(fasta_file, 'w') else: out = sys.stdout for rec in SeqIO.parse(inp, 'fastq'): if read_ok(rec.letter_annotations['phred_quality']): out_rec = SeqRecord(seq = Seq(str(rec.seq), generic_dna), id = rec.id, description = ''.join([chr(c+33) for c in rec.letter_annotations['phred_quality']])) SeqIO.write(out_rec, out, 'fasta') out.close() inp.close()
That's it. The rest of the program just handles getting the file names from the command line.
Download the code.
No comments:
Post a Comment