This post describes a little experiment in loading large file with Julia, Python, and R. On September 13, 2022, I downloaded a metadata file from GISAID. The file 13,061,086 rows and 22 columns. As a file for the PC, it's big. My usual approach to analyzing GISAID data is using a combination of R for counting and plotting variables such as linages over time and using Python, particularly BioPython, for manipulating sequence data.
When using R, I tend to use the tidyverse tools for manipulating tabular data. Transforming dataframes by piping then through functions seems like a natural approach. Julia, the other hand, is often fast enough that writing simple loops to manipulate data is feasible and can lead to simpler more readable code.
Loading a file with more than 13 million rows is slow in R. I wondered if Julia or Python/Pandas could do better. What follows is an unscientific exercise in reading a large tab delimited file. All the tests were run on a PC with an Intel i9-9900 CPU running at 3.10 GHz and 32 GB of memory. The OS was 64 bit Ubuntu 22.04 on wsl-2 under Windows 10 Pro. I ran each test three times.
R
There are several methods for reading a tab delimited data table into R. These test were run with R version 4.2.1 on Rstudio 2022.07.1.
Base R
Basic R provides read.table to read delimited files. There are a number of wrappers for read.table including read.csv. I can never remember the others, so I just use read.csv and set the delimiter to '\t' to read tab delimited files.
> system.time(df <- read.csv('metadata.tsv', sep = '\t', quote = ''))
user system elapsed
334.606 16.574 603.654
> rm(df)
> system.time(df <- read.csv('metadata.tsv', sep = '\t', quote = ''))
user system elapsed
258.097 13.918 465.070
> rm(df)
> system.time(df <- read.csv('metadata.tsv', sep = '\t', quote = ''))
user system elapsed
237.552 13.297 444.275
>
quote = '' is necessary because R complains about an embedded quote in the data. After each run, I remove the data frame so R can't be clever and decide that it already has the data and it hasn't been modified so why bother reading the file. This also forces garbage collection on a subsequent call to read.csv.
Notice that the elapsed time decreases with each invocation of read.csv. I suspect, but I haven't investigated, is due to the operating system buffering the file and not having to go to the disk as often.
Still, an elapsed time of 603 seconds is a relatively long time to wait for the file to be available. Fortunately, R maintains the program environment between sessions so that it's not necessary to read the file each time. However, reloading the environment that contains a multi-GB variable is slow.
Tidyverse
The tidyverse library provides functions for reading delimited tables in the
readr package.
read_tsv is a wrapper for
read.delim with the delimiter set to '\t'. These function read the file into a
tibble.
> system.time(df <- read_tsv('metadata.tsv'))
user system elapsed
214.657 59.590 208.164
> rm(df)
> system.time(df <- read_tsv('metadata.tsv'))
user system elapsed
109.046 60.588 116.243
> rm(df)
> system.time(df <- read_tsv('metadata.tsv'))
user system elapsed
139.013 58.560 143.086
> rm(df)
This is certainly an improvement over Base R. Again, we see considerable improvement on subsequent reads.
data.table
data.table uses multiple threads to provide high performance operations on tables. It provides an
fread function for reading delimited tables.
> system.time(df <- fread('metadata.tsv'))
user system elapsed
144.269 47.483 150.760
> rm(df)
> system.time(df <- fread('metadata.tsv'))
user system elapsed
120.651 42.848 136.581
> rm(df)
> system.time(df <- fread('metadata.tsv'))
user system elapsed
122.877 43.066 139.185
fread shows an improvement over read_csv on the initial read, but similar performance on subsequent calls.
Python
The workhorse for data tables in Python is
pandas. Reading the table and timing is easy in Python with
read.csv. I tested it on Python 3.10.2 with pandas version 1.3.5.
import pandas as pd
import time
def read_datafile():
time1 = time.time()
df = pd.read_csv('metadata.tsv', dtype = {'Is reference?': 'str'}, sep = '\t')
print(time.time() - time1)
return df
In [4]: df = read_datafile()
127.54160499572754
del(df)
In [6]: df = read_datafile()
109.62158036231995
del(df)
In [8]: df = read_datafile()
108.43680119514465
The pandas version appears to be a bit faster than the R versions. dtype is required because read_csv complained about mixed types in column 16. Some of the fields in column 16 were empty and some contained strings. Notice again how runtime decreases with subsequent calls.
Julia
These tests were run with Julia 1.8.1
using CSV
using DataFrames
@time df = CSV.File("metadata.tsv", delim = "\t") |> DataFrame
454.727217 seconds (81.28 M allocations: 18.037 GiB, 0.55% gc time, 1.42% compilation time)
df = nothing
@time df = CSV.File("metadata.tsv", delim = "\t") |> DataFrame
85.075249 seconds (77.53 M allocations: 17.844 GiB, 6.03% gc time, 0.06% compilation time)
df = nothing
@time df = CSV.File("metadata.tsv", delim = "\t") |> DataFrame
159.083665 seconds (77.51 M allocations: 17.843 GiB, 6.03% gc time)
df = nothing
Julia shows considerable improvement on later calls. Notice how compilation times drops for the second and third time. Even with considerable time spent compiling, Julia beats Base R's read.csv. However, more advanced R function and pandas show better initial timings.
Julia's programming model seems Lisp-like in that it is expected that you will use the REPL. That is, you load your file once, operate on it, and debug all in one session. Still, excessive load times can decrease productivity.
Conclusion
In the end, what comes next, i.e. what you do with the data is important. I still tend to use R to handle tabular data, although I may switch to fread rather than read_csv.
No comments:
Post a Comment