The Analytic Garden: A Tale of Three Variants

Be fast, have no regrets... If you need to be right before you move, you will never win,
~ Mike Ryan, epidemiologist at WHO, in March

This is evolving science. You are seeing sausages being made — in front of the world’s eyes,

~ Yale vaccine researcher Saad Omer

The SARS-CoV-2 virus keeps evolving and the number of genome sequences deposited in various repositories keeps growing. The NCBI Virus database currently has 193,819 sequences. On May 5, 2021 I downloaded the GISAID metafile. It has 1,355,996 records. The plot below shows the counts of sequence records at GISAID for each day vs. collection date.

You can see the growth in sequence data in 2021.

I wanted to look at the growth in the USA of three variants of interest: B.1.1.7, called the United Kingdom variant; B.1.352, the South African variant; and P.1 the Brazil variant. Giving these variants country names doesn't necessarily mean that they developed in those countries. Rather, they have been extensively reported in those locations. For example, B.1.1.7 was first reported in the UK, but has spread worldwide.

These variants are of particular interest because there is some evidence that they are more transmissible and possibly more virulent. They also may be implicated in immune escape.

Variants in the USA

The plot below shows the growth of the three variants of interest along with all other variants in the GISAID database.

The B.1.1.7 variant started late in 2020 in the USA and proceed to grow exponentially. The exponential growth can be seen better by plotting the log of the counts.

The straight line increases in B.1.1.7, P.1 counts, and to a lesser extent B.1.351 and P.1 in the log plots is indicative an exponential growth rate up to April 1, 2021.

Plotting the percentage of the daily total count, shows how the variants, especially B.1.1.7, have become a larger portion of the sequence database up until April 1.

A couple of things to keep in mind: it's unclear whether the GISAID sequences represent a random sample of the virus population in the USA. Researchers may be more likely to sequence samples from patients that are suspected to be carrying specific variants., for example if other patients in the area have been seen to carry the variant. Starting at April 1, 2021, the total number of sequences for a particular collection day begins to drop rapidly. This is probably an artifact caused by the delay in depositing the sequences after collection. My guess is that once more data is deposited, the exponential growth of the variants will be confirmed past April 1.

Data Wrangling

Processing the metafile takes several steps similar to those described here. The first step is to filter the data by host and remove invalid dates and lineages.

  df <- meta_table %>% 
    filter(Host == {{ host }}) %>%
    filter(Pango.lineage != "") %>%
    filter(str_detect(Collection.date, "\\d\\d\\d\\d-\\d\\d-\\d\\d")) %>%
    select(Virus.name, Collection.date, Location, Pango.lineage) %>% 
    separate(Location, c("Region", "Country", "State"), sep = "\\s*/\\s*", extra = "drop", fill = "right") %>%
    select(-c("Region", "State")) %>%
    mutate(Collection.date = as.Date(Collection.date, format = "%Y-%m-%d"))

I use tidyverse and R frequently, but I'm not expert at tidyverse functions. I often pend some time searching the web for the correct function. I also break my code into small pieces with intermediate dataframes. This is somewhat inefficient and would probably annoy Hadley Wickham but it makes debugging easier. Debugging tidyverse code can be difficult because it's hard to view intermediate steps. The approach that I take is to start with the full data set and reduce and transform the data in a series of steps.

Next, the data is filtered by country and lineage. The data is counted by date and lineage. host and country are arguments passed to the function. variants is a vector of variants to be plotted.

  df2 <- 
    df %>%
    filter(Country == {{ country }}) %>%
    filter(Pango.lineage %in% {{ variants }}) %>%
    group_by(Collection.date, Pango.lineage) %>%
    count() %>%
    rename(Count = n)

Lineages not included in the vector of variants of interest are counted in the same way and labeled as Other. The dataframes containing the variants and Other are joined into a single dataframe. df4 is used to plot figures 2 and 3 above.

  df3 <- 
    df %>%
    filter(Country == {{ country }}) %>%
    filter(! (Pango.lineage %in% {{ variants }})) 
  df3$Pango.lineage <- "Other"
  df3 <- 
    df3 %>%
    group_by(Collection.date, Pango.lineage) %>%
    count()  %>%
    rename(Count = n)
  
  # Join the dataframes
  df4 <- full_join(df2, df3)

To produce the percent plot, it is necessary to calculate the total counts per day.

 # get totals by date and lineage
  dfp2 <- 
    df4 %>% 
    group_by(Collection.date) %>% 
    summarise(Total = sum(Count)) 
  
  dfp3 <- data.frame(Collection.date = as.Date(integer()),
                     Pango.lineag = character(),
                     Count = integer(),
                     Total = integer(),
                     Pct = double())
  
  # construct a data frame of the counts of lineages and their percents
  for(var in unique(df4$Pango.lineage)) {
    dfp4 <- 
      df4 %>% 
      filter(Pango.lineage == {{ var }}) %>% 
      full_join(dfp2) %>% 
      mutate(Pct = Count / Total) %>% 
      drop_na()
    
    dfp3 <- rbind(dfp3, dfp4)
  }

dfp3 is used to make the percent plot.

Here's the whole thing in one function.

#' plot_gisaid_variants
#' Draw plots of daily counts and percentage of counts for variants of interest.
#'
#' @param meta_table   A GISAID metatable read by read.csv.
#' @param variants     A vector of strings, the Pango lineages of interest.
#' @param host         A string indicating host species.
#' @param country      Country of interest.
#' @param plot_counts  A logical. Should daily counts be plotted?
#' @param plot_percent A logical. Should daily percent counts be plotted?
#' @param log          A logical. Should counts be plotted on log scale? Ignored if plot_counts = FALSE.
#' @param title        An optional title string for plots.
#'
#' @return A date frame containing columns: Collection.date, Pango.lineage, Count, Total, and Pct
#' 
plot_gisaid_variants <- function(meta_table, 
                                 variants, 
                                 host = "Human", 
                                 country = "USA", 
                                 plot_counts = TRUE,
                                 plot_percent = TRUE,
                                 log = FALSE, 
                                 title = NULL ) {
  library(tidyverse)
  
#' plot_pct
#' Plot daily percent of each variant vs date.
#' 
#' @param dfp A dataframe, see below.
#'
#' @return NULL
  plot_pct <- function(dfp) {
    p <- ggplot(dfp) + 
      geom_point(aes(x = Collection.date, y = Pct, color = Pango.lineage))

    if(! is.null(title)) {
      p <- p + ggtitle(paste(title, "Percent of Daily Counts", sep = " "))
    }
    
    print(p)
    
    return(NULL)
  }
  
#' plot_cnts
#' Plot daily counts of the variants.
#'
#' @param dfp A data frame, see below
#'
#' @return NULL
  plot_cnts <- function(dfp) {
    p <- ggplot(dfp) 
    if(! log) {
      p <- p +
        geom_point(aes(x = Collection.date, y = Count, color = Pango.lineage))
    } else {
      p <- p + 
        geom_point(aes(x = Collection.date, y = log(Count), color = Pango.lineage)) 
    }
    
    if(! is.null(title)) {
      p <- p + ggtitle(title)
    }
    
    print(p)
    
    return(NULL)
  }
  
  # Filter by host. Remove invalid dates and empty lineages.
  # Reformat Location into Country
  df <- meta_table %>% 
    filter(Host == {{ host }}) %>%
    filter(Pango.lineage != "") %>%
    filter(str_detect(Collection.date, "\\d\\d\\d\\d-\\d\\d-\\d\\d")) %>%
    select(Virus.name, Collection.date, Location, Pango.lineage) %>% 
    separate(Location, c("Region", "Country", "State"), sep = "\\s*/\\s*", extra = "drop", fill = "right") %>%
    select(-c("Region", "State")) %>%
    mutate(Collection.date = as.Date(Collection.date, format = "%Y-%m-%d"))
  
  # Filter by country and lineages
  # Count by date and lineage
  df2 <- 
    df %>%
    filter(Country == {{ country }}) %>%
    filter(Pango.lineage %in% {{ variants }}) %>%
    group_by(Collection.date, Pango.lineage) %>%
    count() %>%
    rename(Count = n)
  
  # Filter by country and lineages not in variant list.
  # Count by date label as Other.
  df3 <- 
    df %>%
    filter(Country == {{ country }}) %>%
    filter(! (Pango.lineage %in% {{ variants }})) 
  df3$Pango.lineage <- "Other"
  df3 <- 
    df3 %>%
    group_by(Collection.date, Pango.lineage) %>%
    count()  %>%
    rename(Count = n)
  
  # Join the dataframes
  df4 <- full_join(df2, df3)
  
  # get totals by date and lineage
  dfp2 <- 
    df4 %>% 
    group_by(Collection.date) %>% 
    summarise(Total = sum(Count)) 
  
  dfp3 <- data.frame(Collection.date = as.Date(integer()),
                     Pango.lineag = character(),
                     Count = integer(),
                     Total = integer(),
                     Pct = double())
  
  # construct a data frame of the counts of lineages and their percents
  for(var in unique(df4$Pango.lineage)) {
    dfp4 <- 
      df4 %>% 
      filter(Pango.lineage == {{ var }}) %>% 
      full_join(dfp2) %>% 
      mutate(Pct = Count / Total) %>% 
      drop_na()
    
    dfp3 <- rbind(dfp3, dfp4)
  }
  
  dfp3 <- ungroup(dfp3)
  
  if(plot_percent) {
    plot_pct(dfp3)
  }
  
  if(plot_counts) {
    plot_cnts(df4)  
  }
  
  return(dfp3)
}

The Analytic Garden

A Tale of Three Variants

Variants in the USA

Data Wrangling

No comments:

Post a Comment

Labels

Contributors

wfmu