Tracking the Variants

We have been informed today, in addition to spreading more quickly, it also now appears that there is some evidence that the new variant - the variant that was first identified in London and the South East - may be associated with a higher degree of mortality
~ Boris Johnson

It's all going to be about the [COVID-19] variants and the vaccine, and that will determine where we're going to be next year, the year after, and the year after that.
~ CBS, Mar 12, 2021

The Covid-19 variant B.1.1.7, aka the UK variant has been making news for the last few months. B.1.1.7 is estimated to be 40%–80% more transmissible than wild-type SARS-CoV-2 and was detected in the UK in November 2020 from a sample taken in September.

I wanted to take a look at its development so I downloaded the latest data from GISAID on March 18, 2021. The data included a tab delimited meta file containing 769,387 observational rows of 28 variables. The first things I wanted to look at was the time course of the variant in the USA and the UK.

B.1.1.7 has an E484K mutation in the spike protein. This mutations is concerning because it is postulated to reduce vaccine effectiveness.

I wrote a bit of R code to extract the rows from a specific country of exposure and pangolin lineage, and summarize and count the data. The code could probably been simplified to avoid the intermediate data frames, but they make debugging easier.

#' filter_meta
#' filter GISAID meta file by pango_lineage and country of exposure.
#' Data is summarized by date and percent of linage for each date is calculated.
#' Optionally a plot of the lineage's per cent of the total samples can be displayed
#'
#' @param df - a GISAID metafile
#' @param country - country of exposure
#' @param lineage - pango lineage
#' @param plot - a boolean, TRUE to plot
#' @param title - an optional title for the plot 
#'
#' @return df -a data frame
#'         with columns date, count of all lineage, count of chosen lineage, and pct of chosen lineage
#'         
filter_meta <- function(df, country, lineage, plot = TRUE, title = NULL) {
  require(tidyverse)
  
  # filter by country and select columns
  df2 <- df %>% 
    filter(country_exposure == {{ country }}) %>% 
    select(date, country_exposure, pango_lineage) %>% 
    group_by(date, pango_lineage ) %>% 
    count()
  
  # count each linage by date
  df3 <- df2 %>%
    group_by(date) %>%
    summarize(count = sum(n))
  
  # filter by lineage
  df4 <- df2 %>%
    filter(pango_lineage == {{ lineage }}) %>%
    ungroup() %>%
    select(date, n) %>%
    rename(lineage_count = n)
  
  # combine the tables
  df5 <- left_join(df3, df4) %>%
    replace_na(list(lineage_count = 0)) %>%
    mutate(pct = lineage_count / count) %>%
    na.omit()
  
  if(plot) {
    p <- ggplot(df5)  + 
      geom_area(aes(x = date, y = count, color='All Lineages'), alpha=0.6 , size=1) + 
      geom_area(aes(x = date, y = lineage_count, color = {{ lineage }}), alpha=0.6 , size=1) + 
      scale_color_manual("", breaks = c('All Lineages', {{ lineage }}), values = c('black', 'red')) +
      ylab('Count') +
      xlab('Date')
    
    if(! is.null(title)) {
      p <- p + ggtitle(title)
    }
    
    print(p)
  }
  
  return(df5)
}

We can see how B.1.1.7 took over the United Kingdom data in this plot.

You can see how B.1.1.1 began to show up in the data around November and quickly became dominant.

In the US, story is different. The variant hasn't yet become dominant in the data.

The US has lagged the UK in sequencing. The GISAID data contains 287,309 United Kingdom sequences, while only containing 180,571 US sequence. Given the differences in population between the two countries, the UK data is a better representation of the status of the disease in the population. In terms of the B.1.1.7 variant, the UK data contains 120,141 sequences (42%). The US data contains 4,322 B.1.1.7 sequence (2%), but the counts will likely be climbing in the near future.




No comments:

Post a Comment