Brazil and P.1

Normal led to this.
~ Ed Young, write for The Atlantic

Be fast, have no regrets... If you need to be right before you move, you will never win.
~ Mike Ryan, epidemiologist at WHO

I had intended to continue exploring the use of the Julia Language for analyzing SAR-CoV-2 genomes, but first I wanted to take a quick look at the growth of the P.1 variant, a.k.a. the Brazilian variant. This lineage is considered a variant of concern. It has has 17 amino acid changes, ten of which are in its spike protein, including these three designated to be of particular concern: N501YE484K and K417T. We have seen these before. 

On April 15, 2021, I downloaded the metafile from GISAID. The file contains 1,107,140 records of SAR-CoV-2 observations, mostly drawn from humans. I read the file into R and tried to use the routine presented here to show a plot of the growth of the P.1 variant in Brazil. The code failed. 

It didn't take long to figure out why the program didn't work. GISAID had radically changed the table format. Columns were removed, combined, and new columns added. It didn't take long to fix, but I wish the GISAID folks would keep their file formats stable.

Here's the updated R code.


#' filter_meta_2021_04_15
#' filter GISAID meta file by pango_lineage and country of exposure.
#' Data is summarized by date and percent of linage for each date is calculated.
#' Optionally a plot of the lineage's per cent of the total samples can be displayed
#' 
#' This is an updated version of filter_meta because GISAID changed meta file format and
#' now I have a ton of code that no longer works!
#'
#' @param df - a GISAID metafile
#' @param country - country of exposure
#' @param lineage - pango lineage
#' @param plot - a boolean, TRUE to plot
#' @param title - an optional title for the plot 
#' @param host - optional host, default is Human
#'
#' @return df -a data frame
#'         with columns date, count of all lineage, count of chosen lineage, and pct of chosen lineage
#'         
filter_meta_2021_04_15 <- function(df, country, lineage, plot = TRUE, title = NULL, host = "Human") {
  require(tidyverse)
  
  # split the Location into separate columns
  temp <- df %>% 
    separate(Location, c("Continent", "Country", "Region", "City"), sep = "/") %>%
    mutate_at(c("Continent", "Country", "Region", "City"), str_trim)
  
  # filter by country and select columns
  df2 <- temp %>% 
    mutate(Collection.date = as.Date(Collection.date, "%Y-%m-%d")) %>%
    filter(Country == {{ country }} & Host == {{ host }}) %>% 
    select(Collection.date, Country, Pango.lineage) %>% 
    group_by(Collection.date, Pango.lineage) %>% 
    count()
  
  # count each linage by date
  df3 <- df2 %>%
    group_by(Collection.date) %>%
    summarize(count = sum(n))
  
  # filter by lineage
  df4 <- df2 %>%
    filter(Pango.lineage == {{ lineage }}) %>%
    ungroup() %>%
    select(Collection.date, n) %>%
    rename(lineage_count = n)
  
  # combine the tables
  df5 <- left_join(df3, df4) %>%
    replace_na(list(lineage_count = 0)) %>%
    mutate(pct = lineage_count / count) %>%
    na.omit()
  
  if(plot) {
    p <- ggplot(df5)  + 
      geom_area(aes(x = Collection.date, y = count, color='All Lineages'), alpha=0.6 , size=1) + 
      geom_area(aes(x = Collection.date, y = lineage_count, color = {{ lineage }}), alpha=0.6 , size=1) + 
      scale_color_manual("", breaks = c('All Lineages', {{ lineage }}), values = c('black', 'red')) +
      ylab('Count') +
      xlab('Date')
    
    if(! is.null(title)) {
      p <- p + ggtitle(title)
    }
    
    print(p)
  }
  
  return(df5)
}

This plot shows the rapid growth of P.1 in Brazil since the beginning of the year. Note that the total number of sequences from Brazil is small compared to the available sequences from the UK or USA.


By contrast, P.1 is not yet showing significant growth in the US or UK data. Note that the sequence data lags the clinical data, so this may change in the next few weeks.





No comments:

Post a Comment