The Analytic Garden: Variants of Concern for September

The secret of change is to focus all of your energy, not on fighting the old, but on building the new.
~ Socrates

On 2021-09-24, I downloaded the COVID-19 metadata file from GISIAD. The file contains a whopping 3,795,999 records. The records refer to individual SAR-COV-2 genome sequences uploaded. This is an unprecedented scientific effort and an amazing resource. To get a since of whir can be don with such a large collection of data I suggest checking out this YouTube video by Trevor Bedford from the Fred Hutchinson Cancer Research Center.

In this post, I want to continue exploring variants of concern (VOC) in the US. There are 1,085,764 sequences with Location labeled as "USA".

First, the log of the sequence counts.

We see a drop-off in the counts of the Delta variant starting around the end of August 2021. This is probably because of a lag in the time from collection to upload of the sequences. This is a log plot of sequence counts, so the linear increase in Delta counts indicates exponential growth in the number of sequences.

By plotting the percent of sequences for each variant we see how the delta variant has taken over in the US.

At this point approximately 100% of recorded sequences in the US are from the delta variant.

Here's the R code to make the plots.

#' plot_named_variants
#' Draw plots of daily counts and percentage of counts for variants of interest.
#'
#' @param meta_table   A GISAID variant table read by read.csv.
#' @param variants     A vector of strings, the Pango lineages of interest.
#' @param host         A string indicating host species.
#' @param country      Country of interest.
#' @param plot_counts  A logical. Should daily counts be plotted?
#' @param plot_percent A logical. Should daily percent counts be plotted?
#' @param log          A logical. Should counts be plotted on log scale? Ignored if plot_counts = FALSE.
#' @param title        An optional title string for plots.
#'
#' @return A date frame containing columns: Collection.date, Variant, Count, Total, and Pct
#' 
plot_named_variants <- function(meta_table, 
                                 host = "Human", 
                                 country = "USA", 
                                 plot_counts = TRUE,
                                 plot_percent = TRUE,
                                 log = FALSE, 
                                 title = NULL ) {
  library(tidyverse)

###################################### Helper Functions ####################################  
  
#' plot_pct
#' Plot daily percent of each variant vs date.
#' 
#' @param dfp A dataframe, see below.
#'
#' @return NULL
  plot_pct <- function(dfp) {
    p <- ggplot(dfp) + 
      geom_point(aes(x = Collection.date, y = Pct, color = Variant))

    if(! is.null(title)) {
      p <- p + ggtitle(paste(title, "Percent of Daily Counts", sep = " "))
    }
    
    print(p)
    
    return(NULL)
  }
  
#' plot_cnts
#' Plot daily counts of the variants.
#'
#' @param dfp A data frame, see below
#'
#' @return NULL
  plot_cnts <- function(dfp) {
    p <- ggplot(dfp) 
    if(! log) {
      p <- p +
        geom_point(aes(x = Collection.date, y = Count, color = Variant))
    } else {
      p <- p + 
        geom_point(aes(x = Collection.date, y = log(Count), color = Variant)) 
    }
    
    if(! is.null(title)) {
      p <- p + ggtitle(title)
    }
    
    print(p)
    
    return(NULL)
  }
  
###########################################################################################
  
  # Filter by host. Remove invalid dates and empty lineages.
  # Reformat Location into Country
  df <- meta_table %>% 
    filter(Host == {{ host }}) %>%
    filter(Variant != "") %>%
    # mutate(Variant = ifelse(Variant == "", "Wuhan", Variant)) %>%
    select(Collection.date, Location, Variant) %>% 
    separate(Location, c("Region", "Country", "State"), sep = "\\s*/\\s*", extra = "drop", fill = "right") %>%
    select(-c("Region", "State")) %>%
    mutate(Collection.date = as.Date(Collection.date, format = "%Y-%m-%d")) %>%
    separate(Variant, c("V", "Name"), extra = "drop", fill = "right") %>%
    unite(Variant, V:Name, sep = " ", remove = TRUE) %>%
    mutate(Variant = as.factor(Variant))
  
  # Filter by country and lineages
  # Count by date and lineage
  df2 <- 
    df %>%
    filter(Country == {{ country }}) %>%
    group_by(Collection.date, Variant) %>%
    count() %>%
    rename(Count = n)
  
  # get totals by date and lineage
  dfp2 <- 
    df2 %>% 
    group_by(Collection.date) %>% 
    summarise(Total = sum(Count)) 
  
  dfp3 <- data.frame(Collection.date = as.Date(integer()),
                     Pango.lineag = character(),
                     Count = integer(),
                     Total = integer(),
                     Pct = double())
  
  # construct a data frame of the counts of lineages and their percents
  for(var in unique(df2$Variant)) {
    dfp4 <- 
      df2 %>% 
      filter(Variant == {{ var }}) %>% 
      full_join(dfp2) %>% 
      mutate(Pct = Count / Total) %>% 
      drop_na()
    
    dfp3 <- rbind(dfp3, dfp4)
  }
  
  dfp3 <- ungroup(dfp3)
  
  if(plot_percent) {
    plot_pct(dfp3)
  }
  
  if(plot_counts) {
    plot_cnts(df2)  
  }
  
  return(dfp3)
}

Get vaccinated!

The Analytic Garden

Variants of Concern for September

No comments:

Post a Comment

Labels

Contributors

wfmu