Data Wrangling with Julia, R, and Python

What we have is a data glut. 
~ Vernor Vinge

It is a capital mistake to theorize before one has data.
~ Sherlock Holmes

By &lt;a href=&quot;//commons.wikimedia.org/w/index.php?title=User:ChrisJen517&amp;amp;action=edit&amp;amp;redlink=1&quot; class=&quot;new&quot; title=&quot;User:ChrisJen517 (page does not exist)&quot;&gt;ChrisJen517&lt;/a&gt; - &lt;span class=&quot;int-own-work&quot; lang=&quot;en&quot;&gt;Own work&lt;/span&gt;, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=97533273">Link</a>

GISAID provides a metafile describing the contents of genome sequence files. The metafile is a tab delimited table. The version I downloaded on April 15, 2021 contains 1,107,140 records, each with 22 columns. 

The data columns are 

 [1] "Virus.name"                      "Type"                            "Accession.ID"                   
 [4] "Collection.date"                 "Location"                        "Additional.location.information"
 [7] "Sequence.length"                 "Host"                            "Patient.age"                    
[10] "Gender"                          "Clade"                           "Pango.lineage"                  
[13] "Pangolin.version"                "Variant"                         "AA.Substitutions"               
[16] "Submission.date"                 "Is.reference."                   "Is.complete."                   
[19] "Is.high.coverage."               "Is.low.coverage."                "N.Content"                      
[22] "GC.Content"                     
> 

On column in particular is of interest to me. AA.substitutions lists the amino acid substitution found in the genome sequence. I want to be able to plot the histories of substitutions of interest.  To do this, I need to pull out a subset of the columns, reformat several of the columns, and rearrange the resulting table so that each amino acid substitution is in its own row along with the relevant collection date, country of origin, name, accession ID, and Pango lineage. In addition, I want to throw out some of the data such as mis-formatted dates.

This typical data wrangling. Surveys say that data scientists spend about 45% of their time on tasks like this. Anything that makes this sort of thing easier is worthwhile.

Three common languages for data wrangling are Python, R, and now Julia. At least those are the three I turn to for this kind of task. All three provide libraries for manipulating data frames. A data frame is a two dimensional table. Ideally we want the table in a tidy format, Tidy data has three important properties: each variable forms a column, each observation forms a row, and each type of observational unit forms a table. 

Julia

My first attempt to reformat the metafile was a naïve, brute force  approach with Julia.  An empty data frame is created. Next, the data file is read, and then each row is examined. The correct host species is checked. Pango lineage is examined to see if a lineage has been assigned. Dates are rejected if they don't have the yyyy-mm-dd format. The location column contains items in the format "region / country / province". Country is extracted from the location record. The AA substitutions column is formatted similar to "(NSP15_A283V,NSP12_P323L,Spike_D614G)". The number of substitutions varies from record to record. The substitutions are separated and a new row containing the substitution and the the other data is formatted and appended to the new data frame.

Here's the relevant portion of the code.

function check_date(date_str)
    #=
    Check that data is formatted as yyyy-mm-dd.

    @arguments
    data_str - String
        A string representing a date.

    @returns
    A Date in yyyy-mm-dd of nothing if string is not properly formatted.
    =#
    r = r"^\d\d\d\d\-\d\d\-\d\d$"
    if isnothing(match(r, date_str))
        return nothing
    end

    return Date(date_str, "yyyy-mm-dd")
end

function main()
    #=
    Check sequences for valid date and host. 
    Write a CSV file with a row for each AA substitution.
    =#
    args = GetArgs()
    meta_table = args["input_file"]
    aa_subs_table = args["output_file"]
    host = args["host"]
   
    # Read the metfile and then make a new empty dataframe.
    df = DataFrame(CSV.File(meta_table, delim = '\t'))
    df2 = DataFrame(Date = Date[], 
                    Country = String[],
                    Virus_name = String[], 
                    Accession_ID = String[], 
                    Pango_lineage = String[],
                    AA_subs = String[])

    for i in 1:size(df)[1]
        if i % 10000 == 0
            println(i)
        end
        
        if df[i, "Host"] != host
            continue
        end

        if ismissing(df[i, "Pango lineage"])
            continue
        end

        date = check_date(df[i, "Collection date"])
        if isnothing(date)
            continue
        end
        
        aas = split(df[i, "AA Substitutions"][2:end-1], ",")
        if isempty(aas)
            continue
        end

        country = strip(split(df[i, "Location"], "/")[2], ' ')

        for aa in aas
            push!(df2, [date,
                        country, 
                        df[i, "Virus name"],
                        df[i, "Accession ID"],
                        df[i, "Pango lineage"],
                        aa])
        end
    end

    CSV.write(aa_subs_table, df2)
end

This code processes the one million+ records in 3 minutes and 41 seconds. That's not super speedy, but since it only has to be run once on a dataset, a little less that four minutes isn't a big deal. The advantage is that the code is dead simple. Any beginning programmer can see what it's doing. Julia's performance is more impressive when you realize how much memory management is involved in appending a row to the data frame over a million times.

Julia Attempt Two

Julia provides a data frame library, DataFrames.jl, This library provides a number of routines for manipulating data frames. Many of the Julia base functions also operate on data frames. The following code performs the same process as the code above, but rather than processing the metafile a line at a time, it performs a series of transformations on a data frame. Note that in Julia, a '.' attached to a function broadcasts the function across the data. 

The second attempt:

function country_from_location(location_str)
    #=
    Split and trim country from a GISAID Location. 
    Country is the second term in Location.

    @arguments
    location_str - String   
        A Location value from GISAID.
        
    @returns
    A String containing the country name.
    =#
    return strip(split(location_str, "/")[2], ' ')
end

function main()
    #=
    Check sequences for valid date and host. 
    Write a CSV file with a row for each AA substitution.
    =#
    args = GetArgs()
    meta_table = args["input_file"]
    aa_subs_table = args["output_file"]
    host = args["host"]
   
    # Read the metfile.
    df = DataFrame(CSV.File(meta_table, delim = '\t'))
    
    # select columns and filter the data
    df2 = df[!, ["Virus name", "Accession ID", 
                "Collection date", "Location",
                "Pango lineage", "AA Substitutions",
                "Host"]]

    
    df2 = df2[df2[!, :Host] .== host, :] # filter host and remove NAs from lineade
    df2 = dropmissing(df2, "Pango lineage")

    # Throw out invalid dates
    df2[!, "Date"] = check_date.(df2[!, "Collection date"])
    df2 = df2[df2[!, "Date"] .!== nothing, :]

    df2[!, "Country"] = country_from_location.(df2[!, "Location"])

    # Clean up AA substitutions and explode AA into rows
    df2[!, "AA Substitutions"] = chop.(df2[!, "AA Substitutions"], head = 1, tail = 1)
    df2[!, "AA Substitutions"] = split.(df2[!, "AA Substitutions"], ",")
    df2 = flatten(df2, ["AA Substitutions"])

    select!(df2, Not(["Location", "Host", "Collection date"])) # We don't need these

    CSV.write(aa_subs_table, df2) 
end

This version is slightly more concise and a bit faster. It clocks in at 1 minute 42 seconds.

Python

I attempted to reproduce my naïve Julia approach in Python. Translating that Julia code to Python was simple. However, the performance wasn't comparable. The Python version took 102 minutes to append slightly more than 60000 rows to the empty data frame. That's much too slow to be practical.

Python's data frame library, pandas, is a collection of high performance routines for data table manipulation. It's one of the things that make Python such a strong tool for data intense areas like machine learning. Using pandas, I was quickly able to write some code to rearrange the metafile.

The code is somewhat similar to the second Julia version. It preforms a series of manipulations on a data frame. The code has a couple of tricky bits. I had to suppress warning for two operations. The final explode step required some searching of StackOverflow.

def main():
    args = GetArgs()
    meta_file = args.input_file
    out_file = args.output_file
    host = args.host
    
    df = pd.read_csv(meta_file,
                     sep = '\t', 
                     dtype={'Location': str, "Variant": str, "Is reference?": str})
    
    # select columns and filter the data
    df2 = df.filter(["Virus name", "Accession ID", 
                     "Collection date", "Location",
                     "Pango lineage", "AA Substitutions",
                     "Host"])

    # country is the second part of location                      
    df2["Country"] = df2.Location.str.split("/", expand = True)[1] 

    df2 = df2[df2["Host"] == host]   # filter host and remove NAs from lineage
    df2 = df2.dropna(subset = ["Pango lineage"])

    df2 = df2.drop(["Location", "Host"], axis = 1) # We don't need these anymore

    # throw out malformed dates
    # regex for dates
    p = re.compile(r"^\d\d\d\d\-\d\d\-\d\d$")
    df2 = df2[df2["Collection date"].apply(lambda x: True if p.match(x) else False)]

    # clean up Country and AAs
    # these throw warnings, but they are actually OK.
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        df2["Country"] = df2["Country"].str.strip()
        df2["AA Substitutions"] = df2["AA Substitutions"].str.strip("()")
    
    # Explode AA substitutions. Each AA has it's own row
    # from https://stackoverflow.com/questions/50731229/split-cell-into-multiple-rows-in-pandas-dataframe
    df2 = df2.set_index(["Virus name", "Accession ID", 
                     "Collection date", "Country",
                     "Pango lineage"]).apply(lambda x: x.str.split(',').explode()).reset_index()
    
    df2.to_csv(out_file, index = False)

This Python version runs through the metafile in 1 minute and 23 seconds. 

R

I tried to reproduce the naive Julia version in R, but it's performance was as dismal as the initial Python attempt. Instead, I turned to tidyverse, dplyr, in particular. The tidyverse approach favors a series of transformations to a data table similar to those shown above, but in a much more concise manner. %>% is a pipe function, passing the data frame from one function to the next.


#' aa_substitutions
#' Create a CSV file of AA substitutions in human sequences by date and country from a
#' GISAID metafile.
#'
#' @param meta_file - path name to GISAID metafile 
#' @param outfile - path name of output fril
#' @param host - host species. Default is human
#'
#' @return NULL
#'
aa_substitutions <- function(meta_file, outfile, host = "Human") {
  require(tidyverse)
  
  meta_table <- read.csv(meta_file, sep = "\t", header = TRUE)
  
  df <- meta_table %>% 
    filter(Host == {{ host }}) %>%
    filter(Pango.lineage != "") %>%
    filter(str_detect(Collection.date, "\\d\\d\\d\\d-\\d\\d-\\d\\d")) %>%
    select(Virus.name, Collection.date, Accession.ID, Location, Pango.lineage, AA.Substitutions) %>% 
    mutate(AA.Substitutions = str_remove_all(AA.Substitutions, pattern = "[()]")) %>%
    separate_rows(AA.Substitutions, sep = ",") %>% 
    separate(Location, c("Region", "Country", "State"), sep = "\\s*/\\s*", extra = "drop", fill = "right") %>%
    select(-c("Region", "State"))
  
  write.csv(df, outfile, row.names = FALSE)
}

This version is a bit slower than the Julia and Python versions, requiring 4 minutes and 36 seconds to process the metafile. Maye with a bit of work I could have made this faster, but in practical terms, it's fast enough.

And the Winner Is...

They all win, even the R version which was the slowest. Being able to process a file of over a million records in a few minutes is worthwhile. Julia's ability to process the data with a minimum of programming constructs is impressive. With some effort, I could have probably squeezed better performance from the code. However, data wrangling is not the sort of task you want to spend so much time working over. The idea is to get it done correctly, in a reasonable time and get on with what you really need the data for. Being able to write simple code and get reasonable processing times on these necessary tasks is a true bonus.

Code, but not the data, is available at GitHub. Data is available from GISAID

No comments:

Post a Comment