XBB.1.5 and More Chunking

This post is an addendum to a previous post. On January 9 2023, I downloaded a SAR-CoV-2 metadata file from GISAID. The file contained 14,499,984 observations of 24 variables. I was interested in the growth of the XBB.1.5 variant. XBB.1.5, also know as the Kraken subvariant, has been in the news lately because it appears to have greater transmissibility than the other Omicron variants.

Here's the plot of the top 5 emerging variants from GISAID: BQ.1.1.22, CH.1.1, XBB.1.5, BQ.1.1, BQ.1.23.


XBB.1.5 isn't the major variant yet, but it is growing in the US.

The plot above was created with the program get_selected_variants.R from GitHub. The code was described in the previous post.

When I ran get_selected_variants.R, on Linux R version 4.1.2 it produced a warning.

> system.time(df_variants_usa <- get_selected_variants('data/metadata.tsv', selected_variants = selected_variants, title = 'Selected Variants USA 2023-01-09', start_date = as.Date('2022-09-01')))
   user  system elapsed 
142.105   1.954 170.132 
Warning message:
In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'
> 

I tracked the warning to the read_table_chunkwise function call. It appears to be caused by one of the columns such as Is reference? being assumed to contain logical values which the function doesn't handle. The warning doesn't affect the results.

The GISAID metadata files are getting so large that loading them is very slow and consumes a large amount of memory, hence the use of chunking. get_selected_variants.R uses the chunked library. Another approach is to use read_tsv_chunked from the readr library.

read_tsv_chunked uses a callback to process the chunks. The code is very similar to the previous version. The difference is in the handling of the chunked data.

#' get_selected_variants_ch
#'    Create dataframe and plot of selected SARS-CoV-2 variants
#'
#' @param metafile - a meta data file from GISAID
#' @param selected_variants - a vector of Pango lineage strings.
#' @param host - character, String indicating host species.
#' @param country - character, country of interest. If NULL, all countries are included.
#' @param title - character, optional plot title
#'
#' @return - a dataframe
#'    containing the columns
#'    `Collection date`, `Pango lineage`, Count
#' 
get_selected_variants_ch <- function(metafile,
                                     selected_variants,
                                     host = 'Human', 
                                     country = 'USA',
                                     title = NULL,
                                     start_date = as.Date("2020-01-01")) {
  require(tidyverse)

  # callback function for read_tsv_chunked
  # filter chunk by Collection.date Pango lineage, and variants
  filter_chunk <- function(df, pos) {
    df %>% filter(Host == {{ host }}) %>%
      select(`Collection date`, Location, `Pango lineage`) %>%
      mutate(`Collection date` = as.Date(`Collection date`, format = '%Y-%m-%d')) %>%
      filter(`Collection date` >= {{ start_date }}) %>%
      filter(`Pango lineage` %in% {{ selected_variants }})
  }
  
  df_BA <- read_tsv_chunked('data/metadata.tsv', DataFrameCallback$new(filter_chunk), show_col_types = FALSE)

  # extract country from location
  df_BA <- df_BA %>%
    separate(Location, c("Region", "Country", "State"), sep = "\\s*/\\s*", extra = "drop", fill = "right")

  # filter by country
  if(! is.null(country)) {
    df_BA <- df_BA %>%
      filter(Country == {{ country }})
  }

  # accumulate counts
  df_BA <- df_BA %>%
    select(`Collection date`, `Pango lineage`) %>%
    group_by(`Collection date`, `Pango lineage`) %>%
    count() %>%
    rename(Count = n) %>%
    drop_na() %>%
    arrange(`Collection date`)
  
  p <- ggplot(df_BA) +
    geom_bar(aes(x = `Collection date`, y = Count, color = `Pango lineage`, fill = `Pango lineage`), stat = 'identity')
  if(! is.null(title))
    p <- p + ggtitle(title)
  print(p)
  
  return(df_BA)
}


DataFrameCallback is a callback class. Documentation on R's callback classes is sparse. You can learn a bit by looking at the source. The callback classes are R6 classes. R6 is more like object oriented  programming in Python or Java. In order to create an instance, the new method is used. new is passed the callback function. When a chunk of data is read, the callback function receives a dataframe and an integer representing the current position in the file. In the case of DataFrameCallback, a dataframe containing a concatenation of  the dataframes is constructed by the function. Other callback functions can perform different operations on the data.

No warnings, but it's considerably slower than the previous version.

> system.time(df_variants_usa <- get_selected_variants_ch('data/metadata.tsv', selected_variants = selected_variants, title = 'Selected Variants USA 2023-01-09', start_date = as.Date('2022-09-01')))
|=================================================================================================================================| 100% 10577 MB
   user  system elapsed 
168.297  28.854 562.734 
> 

The code is available on GitHub.

No comments:

Post a Comment