Speed Demons

This post describes a little experiment in loading large file with Julia, Python, and R. On September 13, 2022, I downloaded a metadata file from GISAID. The file 13,061,086 rows and 22 columns. As a file for the PC, it's big. My usual approach to analyzing GISAID data is using a combination of R for counting and plotting variables such as linages over time and using Python, particularly BioPython, for manipulating sequence data. 

When using R, I tend to use the tidyverse tools for manipulating tabular data. Transforming dataframes by piping then through functions seems like a natural approach. Julia, the other hand, is often fast enough that writing simple loops to manipulate data is feasible and can lead to simpler more readable code.

Loading a file with more than 13 million rows is slow in R. I wondered if Julia or Python/Pandas could do better.  What follows is an unscientific exercise in reading a large tab delimited file. All the tests were run on a PC with an  Intel i9-9900 CPU running at 3.10 GHz and 32 GB of memory. The OS was 64 bit Ubuntu 22.04 on wsl-2 under Windows 10 Pro. I ran each test three times.

Other people have done this sort of thing before and done it more thoroughly, for example see The Great CSV Showdown: Julia vs Python vs R.

R


There are several methods for reading a tab delimited data table into R. These test were run with R version 4.2.1 on Rstudio 2022.07.1.


Base R


Basic R provides read.table to read delimited files. There are a number of wrappers for read.table including read.csv. I can never remember the others, so I just use read.csv and set the delimiter to '\t' to read tab delimited files.


> system.time(df <- read.csv('metadata.tsv', sep = '\t', quote = ''))
   user  system elapsed 
334.606  16.574 603.654 
> rm(df)
> system.time(df <- read.csv('metadata.tsv', sep = '\t', quote = ''))
   user  system elapsed 
258.097  13.918 465.070 
> rm(df)
> system.time(df <- read.csv('metadata.tsv', sep = '\t', quote = ''))
   user  system elapsed 
237.552  13.297 444.275 
>

quote = '' is necessary because R complains about an embedded quote in the data. After each run, I remove the data frame so R can't be clever and decide that it already has the data and it hasn't been modified so why bother reading the file. This also forces garbage collection on a subsequent call to read.csv.

Notice that the elapsed time decreases with each invocation of read.csv. I suspect, but I haven't investigated, is due to the operating system buffering the file and not having to go to the disk as often.

Still, an elapsed time of 603 seconds is a relatively long time to wait for the file to be available. Fortunately, R maintains the program environment between sessions so that it's not necessary to read the file each time. However, reloading the environment that contains a multi-GB variable is slow.

Tidyverse


The tidyverse library provides functions for reading delimited tables in the readr package. read_tsv is a wrapper for read.delim with the delimiter set to '\t'.  These function read the file into a tibble.


> system.time(df <- read_tsv('metadata.tsv'))
   user  system elapsed 
214.657  59.590 208.164 
> rm(df)

> system.time(df <- read_tsv('metadata.tsv'))
   user  system elapsed 
109.046  60.588 116.243
> rm(df)

> system.time(df <- read_tsv('metadata.tsv'))
   user  system elapsed 
139.013  58.560 143.086
> rm(df)

This is certainly an improvement over Base R. Again, we see considerable improvement on subsequent reads.

data.table


data.table uses multiple threads to provide high performance operations on tables. It provides an fread function for reading delimited tables.


> system.time(df <- fread('metadata.tsv'))
   user  system elapsed 
144.269  47.483 150.760 
> rm(df)

> system.time(df <- fread('metadata.tsv'))
   user  system elapsed 
120.651  42.848 136.581 
> rm(df)

> system.time(df <- fread('metadata.tsv'))
   user  system elapsed 
122.877  43.066 139.185 

fread shows an  improvement over read_csv on the initial read, but similar performance on subsequent calls. 

Python


The workhorse for data tables in Python is pandas. Reading the table and timing is easy in Python with read.csv. I tested it on Python 3.10.2 with pandas version 1.3.5.


import pandas as pd
import time

def read_datafile():
    time1 = time.time()
    df = pd.read_csv('metadata.tsv', dtype = {'Is reference?': 'str'}, sep = '\t')
    print(time.time() - time1)
    return df

In [4]: df = read_datafile()
127.54160499572754
del(df)

In [6]: df = read_datafile()
109.62158036231995
del(df)

In [8]: df = read_datafile()
108.43680119514465

The pandas version appears to be a bit faster than the R versions. dtype is required because read_csv complained about mixed types in column 16. Some of the fields  in column 16 were empty and some contained strings. Notice again how runtime decreases with subsequent calls.

Julia


Julia has a reputation for being fast. The fly in the ointment is the compile time the first time a function is run. There are tricks to pre-compile function to avoid that penalty. See the comments at https://www.reddit.com/r/Julia/comments/mjuhck/csvfile_read_extremely_slow/

These tests were run with Julia 1.8.1
 
using CSV
using DataFrames

@time df = CSV.File("metadata.tsv", delim = "\t") |> DataFrame
454.727217 seconds (81.28 M allocations: 18.037 GiB, 0.55% gc time, 1.42% compilation time)
df = nothing

@time df = CSV.File("metadata.tsv", delim = "\t") |> DataFrame
 85.075249 seconds (77.53 M allocations: 17.844 GiB, 6.03% gc time, 0.06% compilation time)
df = nothing

@time df = CSV.File("metadata.tsv", delim = "\t") |> DataFrame
159.083665 seconds (77.51 M allocations: 17.843 GiB, 6.03% gc time)
df = nothing

Julia shows considerable improvement on later calls. Notice how compilation times drops for the second and third time. Even with considerable time spent compiling, Julia beats Base R's read.csv. However, more advanced R function and pandas show better initial timings. 

Julia's programming model seems Lisp-like in that it is expected that you will use the REPL. That is, you load your file once, operate on it, and debug all in one session.  Still, excessive load times can decrease productivity.

Conclusion


In the end, what comes next, i.e. what you do with the data is important. I still tend to use R to handle tabular data, although I may switch to fread rather than read_csv.

No comments:

Post a Comment