Julia is dynamically typed, feels like a scripting language, and has good support for interactive use.
~ The Julia Programming Language
~ The Julia Programming Language
Did we mention it should be as fast as C?
~ Jeff Bezanson, Stefan Karpinski, Viral B. Shah, Alan Edelman
I have been interested in the Julia language for sometime, but haven't had a chance to do much with it. The recent release of version 1.6 piqued my interest. I decided to try a simple exercise to get back into Julia programming. In particular, I wanted to try BioJulia.
In a previous post, I discussed searching for mutations in SARS-CoV-2 sequence data. The original code was written in Python and used BioPython and pandas. I wanted to compare the Python implementation to a Julia version.
One of the first tasks in the mutation pipeline was to reformat the FASTA headers in the genome sequences downloaded from GISAID. This is a simple exercise: read the sequence file, one record at a time; reformat the FASTA header ID to match the the strain ID in the GISAID metafile; add the PANGO lineage, country of exposure, and gisaid_epi_isl ID to the description; and finally write the sequence to a new file. In addition the program ignores records that are not included in the GISAID metafile or have more than a certain number of N's in the sequence. It will also optionally convert U's in the sequence to T's.
Let's Get Some Arguments
The first action performed by the Python version was to read arguments for input and output file paths, the location of the metafile, the maximum number of allowed N's, and whether to convert U's to N's (nucleotides that could not be properly identified). It uses the argparse library to get the the arguments.
Here's the code.
def GetArgs(): def ParseArgs(parser): class Parser(argparse.ArgumentParser): def error(self, message): sys.stderr.write('error: %s\n' % message) self.print_help() sys.exit(2) parser = Parser(description='Reformat FASTA headers to include lineage, isl id, and country of exposure.') parser.add_argument('-i', '--input_file', required = True, help = 'Input file (required). Fasta file from GISAID.', type = str) parser.add_argument('-m', '--meta_file', required = True, help = 'Meta file (required). GISAID meta file corresponding to FASTA file.', type = str) parser.add_argument('-o', '--output_file', required = False, help = 'Output fasta file. Default: write to screen.', type = str) parser.add_argument('-r', '--replace', required = False, default = False, help = "Replace U's with T's in sequence.", action = 'store_true') parser.add_argument('-n', '--min_Ns', required = False, default = 20, help = "Minimum allowed number of N's in a row. (default = 20)", type = int) return parser.parse_args() parser = argparse.ArgumentParser() args = ParseArgs(parser) return args
Julia has a similar library, argparse.
function GetArgs() # from https://argparsejl.readthedocs.io/en/latest/argparse.html s = ArgParseSettings(description = "Reformat FASTA headers to include lineage, isl id, and country of exposure.", version = "1.0", add_version = true) @add_arg_table s begin "--input_file" required = true help = "Input file (required). Fasta file from GISAID." arg_type = String "--meta_file" required = true help = "Meta file (required). GISAID meta file corresponding to FASTA file." arg_type = String "--output_file" required = false help = "Output FASTA file. (default: write to screen)" arg_type = String "--replace" help = "Replace U's with T's in sequence." action = :store_true "--n" required = false default = 20 help = "Minimum allowed number of N's in a row. (default = 20)" arg_type = Int end return parse_args(s) end
Reformatting
The code for parsing the the headers is relatively simple. It reads the metafile into a dataframe and then reads the input file one sequence at a time. For each sequence record, it checks to see if the sequence contains too many N's. It does some simple string manipulations to reformat the header text and checks to see if the reformatted sequence ID is contained in the metafile. Finally, it write the sequence with the reformatted header to the output file.
In Python, I used Bio.SeqIO and Bio.Seq from BioPython to manipulate the sequences. BioJulia provides FASTX to read and manipulate FASTA and FASTQ files. The Julia equivalent of pandas is DataFrames.
Here's what it looks like in Python.
def main(): args = GetArgs() msa_file = args.input_file meta_file = args.meta_file out_file = args.output_file replace_u = args.replace min_Ns = args.min_Ns ns = 'N' * min_Ns if out_file is not None: f = open(out_file, 'w') else: f = sys.stdout df = pd.read_csv(meta_file, sep = '\t', dtype={'location': str}) for rec in SeqIO.parse(msa_file, 'fasta'): if str(rec.seq).find(ns) != -1: print("Too many N's:", rec.description, file = sys.stderr) continue items1 = rec.description.split('|') items2 = items1[0].split('/') new_id = '/'.join(items2[1:]).replace(' ', '') new_id = new_id.replace("'", '-') new_id = new_id.replace(',','') df2 = df[df['strain'] == new_id] if df2.size == 0: print('Missing ID in metafile:', rec.description, file = sys.stderr) continue new_desc = '|'.join([rec.description, str(df2['pango_lineage'].values[0]), str(df2['country_exposure'].values[0]), str(df2['gisaid_epi_isl'].values[0])]) if not replace_u: seq = rec.seq else: seq = Seq(str(rec.seq).replace('U', 'T')) new_rec = SeqRecord(seq = seq, id = new_id, description = new_desc) SeqIO.write(new_rec, f, 'fasta') if out_file is not None: f.close()
The Julia code is so similar that you could easily mistake it for Python.
function main() args = GetArgs() in_file = args["input_file"] meta_file = args["meta_file"] out_file = args["output_file"] replace_u = args["replace"] min_Ns = args["n"] ns = repeat("N", min_Ns) if ! isnothing(out_file) f = open(FASTA.Writer, out_file) else f = stdout end df = DataFrame(CSV.File(meta_file, delim = '\t')) reader = open(FASTA.Reader, in_file) for rec in reader seq = FASTA.sequence(rec) if ! isnothing(findfirst(ns, convert(String, seq))) println(stderr, "Too many N's: ", FASTA.identifier(rec)) continue end # BioJulia handes description differntly from BioPython ident = FASTA.identifier(rec) desc = FASTA.description(rec) if ! isnothing(desc) ident = ident * " " * desc end items1 = split(ident, "|") items2 = split(items1[1], "/") new_id = join(items2[2:end], "/") new_id = replace(new_id, " " => "") new_id = replace(new_id, "'" => "-") new_id = replace(new_id, "," => "") df2 = df[df.strain .== new_id, :] if size(df2)[1] == 0 println(stderr, "Missing ID in metafile: ", FASTA.identifier(rec)) continue end new_desc = join([FASTA.identifier(rec) df2[:, "pango_lineage"] df2[:, "country_exposure"] df2[:, "gisaid_epi_isl"]], "|") if ! replace_u new_seq = convert(String, seq) else new_seq = replace(convert(String, seq), "U" => "T") end new_rec = FASTA.Record(new_id, new_desc, new_seq) write(f, new_rec) end close(f)
Who Wins the Race?
Julia has a reputation for being fast, but this code doesn't really play to Julia's strengths. The code is really I/O-bound and not compute intensive, so I didn't expect a huge difference in performance. One thing that hampered Julia in the past was that it compiled libraries the first time they were loaded into the program. This added quite a bit of overhead, especially for short programs. With version 1.6, you can pre-compile the libraries into a project and activate the project at the start of the program. This reduces start time considerably. Load time is on a par with Python.
I ran the two version on the same data.
$ time python gisaid_reformat_fasta_headers.py -i gisaid_hcov_19_sequences.fasta -m metadata.tsv -o gisaid_hcov_19_reformat.fasta -r -n 10 2> missing_ids.txt real 236m21.529s user 197m10.257s sys 0m34.297s $ time julia gisaid_reformat_fasta_headers.jl --i gisaid_hcov_19_sequences.fasta --m metadata.tsv --o gisaid_hcov_19_reformat.fasta --n 10 --r 2> missing.txt real 66m40.699s user 50m20.572s sys 0m22.900s
The code produces the same output, but Julia is a factor of ~3.5 times faster. Further investigation showed that pandas loads the metafile more quickly (4.14 seconds) than DataFrames (32.69 seconds), but those times are inconsequential in the overall timing.
Code was formatted using http://hilite.me/.
No comments:
Post a Comment