The Analytic Garden: BioJulia and BioPython

Julia is dynamically typed, feels like a scripting language, and has good support for interactive use.
~ The Julia Programming Language

Did we mention it should be as fast as C?

~ Jeff Bezanson, Stefan Karpinski, Viral B. Shah, Alan Edelman

I have been interested in the Julia language for sometime, but haven't had a chance to do much with it. The recent release of version 1.6 piqued my interest. I decided to try a simple exercise to get back into Julia programming. In particular, I wanted to try BioJulia.

In a previous post, I discussed searching for mutations in SARS-CoV-2 sequence data. The original code was written in Python and used BioPython and pandas. I wanted to compare the Python implementation to a Julia version.

One of the first tasks in the mutation pipeline was to reformat the FASTA headers in the genome sequences downloaded from GISAID. This is a simple exercise: read the sequence file, one record at a time; reformat the FASTA header ID to match the the strain ID in the GISAID metafile; add the PANGO lineage, country of exposure, and gisaid_epi_isl ID to the description; and finally write the sequence to a new file. In addition the program ignores records that are not included in the GISAID metafile or have more than a certain number of N's in the sequence. It will also optionally convert U's in the sequence to T's.

Let's Get Some Arguments

The first action performed by the Python version was to read arguments for input and output file paths, the location of the metafile, the maximum number of allowed N's, and whether to convert U's to N's (nucleotides that could not be properly identified). It uses the argparse library to get the the arguments.

Here's the code.

def GetArgs():
    def ParseArgs(parser):
        class Parser(argparse.ArgumentParser):
            def error(self, message):
                sys.stderr.write('error: %s\n' % message)
                self.print_help()
                sys.exit(2)
                
        parser = Parser(description='Reformat FASTA headers to include lineage, isl id, and country of exposure.')
        parser.add_argument('-i', '--input_file',
                            required = True,
                            help = 'Input file (required). Fasta file from GISAID.',
                            type = str)
        parser.add_argument('-m', '--meta_file',
                            required = True,
                            help = 'Meta file (required). GISAID meta file corresponding to FASTA file.',
                            type = str)
        parser.add_argument('-o', '--output_file',
                            required = False,
                            help = 'Output fasta file. Default: write to screen.',
                            type = str)
        parser.add_argument('-r', '--replace',
                            required = False,
                            default = False,
                            help = "Replace U's with T's in sequence.",
                            action = 'store_true')
        parser.add_argument('-n', '--min_Ns',
                            required = False,
                            default = 20,
                            help = "Minimum allowed number of N's in a row. (default = 20)",
                            type = int)

        return parser.parse_args()

    parser = argparse.ArgumentParser()
    args = ParseArgs(parser)
    
    return args

Julia has a similar library, argparse.

function GetArgs()
    # from https://argparsejl.readthedocs.io/en/latest/argparse.html
    s = ArgParseSettings(description = "Reformat FASTA headers to include lineage, isl id, and country of exposure.",
                         version = "1.0",
                         add_version = true)

    @add_arg_table s begin
        "--input_file"
        required = true
        help = "Input file (required). Fasta file from GISAID."
        arg_type = String

        "--meta_file"
        required = true
        help = "Meta file (required). GISAID meta file corresponding to FASTA file."
        arg_type = String

        "--output_file"
        required = false
        help = "Output FASTA file. (default: write to screen)"
        arg_type = String

        "--replace"
        help = "Replace U's with T's in sequence."
        action = :store_true

        "--n"
        required = false
        default = 20
        help = "Minimum allowed number of N's in a row. (default = 20)"
        arg_type = Int
    end

    return parse_args(s)
end

Reformatting

The code for parsing the the headers is relatively simple. It reads the metafile into a dataframe and then reads the input file one sequence at a time. For each sequence record, it checks to see if the sequence contains too many N's. It does some simple string manipulations to reformat the header text and checks to see if the reformatted sequence ID is contained in the metafile. Finally, it write the sequence with the reformatted header to the output file.

In Python, I used Bio.SeqIO and Bio.Seq from BioPython to manipulate the sequences. BioJulia provides FASTX to read and manipulate FASTA and FASTQ files. The Julia equivalent of pandas is DataFrames.

Here's what it looks like in Python.

def main():
    args = GetArgs()
    msa_file = args.input_file
    meta_file = args.meta_file
    out_file = args.output_file
    replace_u = args.replace
    min_Ns = args.min_Ns
    
    ns = 'N' * min_Ns
    
    if out_file is not None:
        f = open(out_file, 'w')
    else:
        f = sys.stdout
    
    df = pd.read_csv(meta_file, 
                     sep = '\t', 
                     dtype={'location': str})
       
    for rec in SeqIO.parse(msa_file, 'fasta'):
        if str(rec.seq).find(ns) != -1:
            print("Too many N's:", rec.description, file = sys.stderr)
            continue
            
        items1 = rec.description.split('|')
        items2 = items1[0].split('/')
        new_id = '/'.join(items2[1:]).replace(' ', '')
        new_id = new_id.replace("'", '-')
        new_id = new_id.replace(',','')
        
        df2 = df[df['strain'] == new_id]
        
        if df2.size == 0:
            print('Missing ID in metafile:', rec.description, 
                  file = sys.stderr)
            continue
                
        new_desc = '|'.join([rec.description,
                             str(df2['pango_lineage'].values[0]),
                             str(df2['country_exposure'].values[0]),
                             str(df2['gisaid_epi_isl'].values[0])])
        
        if not replace_u:
            seq = rec.seq
        else:
            seq = Seq(str(rec.seq).replace('U', 'T'))

        new_rec = SeqRecord(seq = seq,
                            id = new_id,
                            description = new_desc)
        SeqIO.write(new_rec, f, 'fasta')
        
    if out_file is not None:            
        f.close()

The Julia code is so similar that you could easily mistake it for Python.

function main()
    args = GetArgs()
    in_file = args["input_file"]
    meta_file = args["meta_file"]
    out_file = args["output_file"]
    replace_u = args["replace"]
    min_Ns = args["n"]

    ns = repeat("N", min_Ns)

    if ! isnothing(out_file)
        f = open(FASTA.Writer, out_file)
    else
        f = stdout
    end

    df = DataFrame(CSV.File(meta_file, delim = '\t'))

    reader = open(FASTA.Reader, in_file)
    for rec in reader
        seq = FASTA.sequence(rec)
        if ! isnothing(findfirst(ns, convert(String, seq)))
            println(stderr, "Too many N's: ", FASTA.identifier(rec))
            continue
        end

        # BioJulia handes description differntly from BioPython
        ident = FASTA.identifier(rec)
        desc = FASTA.description(rec)
        if ! isnothing(desc)
            ident = ident * " " * desc
        end

        items1 = split(ident, "|")
        items2 = split(items1[1], "/")
        new_id = join(items2[2:end], "/")
        new_id = replace(new_id, " " => "")
        new_id = replace(new_id, "'" => "-")
        new_id = replace(new_id, "," => "")

        df2 = df[df.strain .== new_id, :]
        if size(df2)[1] == 0
            println(stderr, "Missing ID in metafile: ", FASTA.identifier(rec))
            continue
        end

        new_desc = join([FASTA.identifier(rec)
                         df2[:, "pango_lineage"]
                         df2[:, "country_exposure"]
                         df2[:, "gisaid_epi_isl"]], "|")

        if ! replace_u
            new_seq = convert(String, seq)
        else
            new_seq = replace(convert(String, seq), "U" => "T")
        end

        new_rec = FASTA.Record(new_id, new_desc, new_seq)
        write(f, new_rec)
    end
    close(f)

Who Wins the Race?

Julia has a reputation for being fast, but this code doesn't really play to Julia's strengths. The code is really I/O-bound and not compute intensive, so I didn't expect a huge difference in performance. One thing that hampered Julia in the past was that it compiled libraries the first time they were loaded into the program. This added quite a bit of overhead, especially for short programs. With version 1.6, you can pre-compile the libraries into a project and activate the project at the start of the program. This reduces start time considerably. Load time is on a par with Python.

I ran the two version on the same data.

$ time python gisaid_reformat_fasta_headers.py -i gisaid_hcov_19_sequences.fasta -m metadata.tsv -o gisaid_hcov_19_reformat.fasta -r -n 10 2> missing_ids.txt

real    236m21.529s
user    197m10.257s
sys     0m34.297s

$ time julia gisaid_reformat_fasta_headers.jl --i gisaid_hcov_19_sequences.fasta --m metadata.tsv --o gisaid_hcov_19_reformat.fasta --n 10 --r  2> missing.txt

real    66m40.699s
user    50m20.572s
sys     0m22.900s

The code produces the same output, but Julia is a factor of ~3.5 times faster. Further investigation showed that pandas loads the metafile more quickly (4.14 seconds) than DataFrames (32.69 seconds), but those times are inconsequential in the overall timing.

Code, but not data, is available on GitHub. Data can be downloaded from GISAID.

Code was formatted using http://hilite.me/.

The Analytic Garden

BioJulia and BioPython

Let's Get Some Arguments

Reformatting

Who Wins the Race?

No comments:

Post a Comment

Labels

Contributors

wfmu