The Analytic Garden: Popular Substitutions

As viral sequencing ramps up, more variants will be detected. Currently, three variants appear to be of concern: B.1.1.7, B.1.351, and P.1, first detected in the UK, South Africa, and Brazil, respectively. B.1.1.7 has an unusually large number of mutations, including many in the spike protein. Although this is the best established for B.1.1.7, all of these variants may transmit better from person to person. (Posted January 19, 2021)
- Kartik Chandran, PhD Professor of Microbiology and Immunology, Albert Einstein College of Medicine

So far, one mutation, D614G, that occurred early last year quickly became the predominant strain worldwide. It allows the virus to replicate more efficiently in the upper airway to enhance shedding and the efficiency of transmission. Fortunately, this substitution does not render the virus resistant to the antibodies generated in vaccinated persons. Likewise, another substitution in the spike, N501Y, which is shared by both the UK and S. African variants, does not allow the virus to escape neutralizing antibodies produced in response to vaccination. (Posted January 19, 2021)

- Scott C. Weaver, PhD Professor and Chair, Department of Microbiology and Immunology, Director, Institute for Human Infections and Immunity, Scientific Director, Galveston National Laboratory, University of Texas Medical Branch

In a previous post, I looked at the three most common SARS-CoV-2 variants circulating in the US: B.1.1.7, B.1.351, and P.1 as shown in sequence data from GISAID. In this post, I want to look at the most common amino-acid substitutions in the data affecting the SARS-CoV-2 Spike protein. These correspond to changes seen in the common variants.

The Top 10

The following plot shows the log count of spike protein amino-acid substitutions with the ten most total occurrences in the GISAID data for the US. The drop at around April 1 is probably an artifact of the lower number of submissions.

The table shows the top ten substitutions and their counts in the USA sequences in the data.

	AA Substitutions	Count
1	Spike_D614G	364905
2	Spike_P681H	96560
3	Spike_N501Y	81208
4	Spike_Y144del	80045
5	Spike_A570D	75367
6	Spike_V70del	75141
7	Spike_H69del	75129
8	Spike_T716I	73500
9	Spike_S982A	71491
10	Spike_D1118H	71407

Note that there are 3 deletion is the top ten changes. D614G means a D (Aspartic Acid) was replaced by a G (Glycine) at position 614 in the spike protein. D6124G is the most common substitution as seem by the green swath across the top of the plot. It arose in April last year.

How to Count

Counting the amino-acid substitutions starts with a bit of code similar to some discussed previously. Initially the GISAID metafile is read using read.csv(). Some filtering is done to select the host species and remove invalid lineages and collection dates. The AA.substitutions list is split and each substitution is inserted on it's own row along with Virus.name, Collection.date, Accession.ID, Country, and Pango.lineage.

Here's the code.

#' get_aa_substitution
#'     Count amino acid substitutions in GISAID metafile
#'
#' @param meta_table a dataframe read from a GISAID metafile by read.csv.
#' @param host a string, the host species.
#'
#' @return a dataframe with columns: Virus.name, Collection.date, Accession.ID, Country, Pango.lineage, AA.Substitutions.
#' 
get_aa_substitutions <- function(meta_table, host = "Human") {
  require(tidyverse)
  
  # Filter host species, remove empty lineages,a nd invalid dates.
  # Select columns.
  # Remove paranetheis from AA substitutions and split on commas and place each substituiion on its own row.
  # Split Location and save Country.
  df <- 
    meta_table %>% 
    filter(Host == {{ host }}) %>%
    filter(Pango.lineage != "") %>%
    filter(str_detect(Collection.date, "\\d\\d\\d\\d-\\d\\d-\\d\\d")) %>%
    select(Virus.name, Collection.date, Accession.ID, Location, Pango.lineage, AA.Substitutions) %>% 
    mutate(AA.Substitutions = str_replace_all(AA.Substitutions, "[()]", "")) %>%
    separate_rows(AA.Substitutions, sep = ",") %>% 
    separate(Location, c("Region", "Country", "State"), sep = "\\s*/\\s*", extra = "drop", fill = "right") %>%
    select(-c("Region", "State")) %>%
    mutate(Collection.date = as.Date(Collection.date, format = "%Y-%m-%d"))

  return(df)
}

Here's an example of the transformation.

> name_list
[1] "Virus.name"       "Collection.date"  "Accession.ID"     "Location"         "Pango.lineage"    "AA.Substitutions"
> metafile_2021_05_05 %>% select(all_of(name_list)) %>% filter(Accession.ID == "EPI_ISL_426905")
                     Virus.name Collection.date   Accession.ID                       Location Pango.lineage
1 hCoV-19/Australia/VIC390/2020      2020-03-23 EPI_ISL_426905 Oceania / Australia / Victoria         B.6.6
                                                   AA.Substitutions
1 (NSP3_S1197R,NSP12_A97V,NSP3_T1198K,NSP12_T739I,N_P13L,NSP6_L37F)
> aa_subs %>% filter(Accession.ID == "EPI_ISL_426905")
# A tibble: 6 x 6
  Virus.name                    Collection.date Accession.ID   Country   Pango.lineage AA.Substitutions
  <chr>                         <date>          <chr>          <chr>     <chr>         <chr>           
1 hCoV-19/Australia/VIC390/2020 2020-03-23      EPI_ISL_426905 Australia B.6.6         NSP3_S1197R     
2 hCoV-19/Australia/VIC390/2020 2020-03-23      EPI_ISL_426905 Australia B.6.6         NSP12_A97V      
3 hCoV-19/Australia/VIC390/2020 2020-03-23      EPI_ISL_426905 Australia B.6.6         NSP3_T1198K     
4 hCoV-19/Australia/VIC390/2020 2020-03-23      EPI_ISL_426905 Australia B.6.6         NSP12_T739I     
5 hCoV-19/Australia/VIC390/2020 2020-03-23      EPI_ISL_426905 Australia B.6.6         N_P13L          
6 hCoV-19/Australia/VIC390/2020 2020-03-23      EPI_ISL_426905 Australia B.6.6         NSP6_L37F       
>

Plotting the Substitutions

In order to make the plot shown above, it's necessary to find the ten most common substitutions in the US. The following code calculates the total counts of each substitution in the dataframe returned by get_aa_substitution() shown above.

  df <- 
    df_aa_subs %>% 
    filter(Country == {{ country }}) %>% 
    select(AA.Substitutions) %>% 
    filter(str_detect(AA.Substitutions, "^Spike")) %>% 
    group_by(AA.Substitutions) %>% 
    count() %>%
    rename(Count = n) %>%
    ungroup()
  
  # pull converts tibble column to vector
  top_subs <- pull(df[rev(order(df$Count)),][1:num_subs,1])

Next, the data from get_aa_substitution() is filtered by country, spike protein, and inclusion in the top_subs list. Then, the substitution are counted for each date.

  df2 <- 
    df_aa_subs %>% 
    filter(Country == {{ country }}) %>% 
    filter(str_detect(AA.Substitutions, "^Spike")) %>% 
    filter(AA.Substitutions %in% {{ top_subs }}) %>%
    select(Collection.date, AA.Substitutions) %>% 
    group_by(Collection.date, AA.Substitutions) %>% 
    count() %>%
    rename(Count = n) %>%
    ungroup()

The df2 dataframe is then used to plot.

Here's the entire function.

#' plot_spike_substitutions
#'     Plot most common amino-acid substitutions from GISAID metafile.
#'     
#' @param df_aa_subs A dataframe (tibble) returned from get_aa_substitution
#' @param country    A String, the country of interest.
#' @param num_subs   An integer, number of top substitutions to plot.
#' @param log        A logical, should counts be plotted on log scale.
#' @param title      A string, an optional title for the plot.
#'
#' @return a list of two dataframes (tibbles)
#'      AA_subs_date Counts by date and amino acid substitution.
#'      AA_subs_count Total counts by amino acid substitution.
#'      
plot_spike_substitutions <- function(df_aa_subs, 
                                     country = "USA",
                                     num_subs = 10,
                                     log = FALSE,
                                     title = NULL) {
  require(tidyverse)
  require(pals)
  
  # get counts of each substitution for country.
  df <- 
    df_aa_subs %>% 
    filter(Country == {{ country }}) %>% 
    select(AA.Substitutions) %>% 
    filter(str_detect(AA.Substitutions, "^Spike")) %>% 
    group_by(AA.Substitutions) %>% 
    count() %>%
    rename(Count = n) %>%
    ungroup()
  
  # pull converts tibble column to vector
  top_subs <- pull(df[rev(order(df$Count)),][1:num_subs,1])
  
  # get counts of each top substitution by date.
  df2 <- 
    df_aa_subs %>% 
    filter(Country == {{ country }}) %>% 
    filter(str_detect(AA.Substitutions, "^Spike")) %>% 
    filter(AA.Substitutions %in% {{ top_subs }}) %>%
    select(Collection.date, AA.Substitutions) %>% 
    group_by(Collection.date, AA.Substitutions) %>% 
    count() %>%
    rename(Count = n) %>%
    ungroup()
  
  # plot it
  if(! log) {
    p <- 
      ggplot(df2) +
        geom_point(aes(x = Collection.date, y = Count, color = AA.Substitutions))
  } 
  else {
    p <- 
      ggplot(df2) +
        geom_point(aes(x = Collection.date, y = log(Count), color = AA.Substitutions))
  }

  # getting distinct colors is hard.  
  p <- p + scale_color_manual(values = glasbey(num_subs))
  
  if(! is.null(title)) {
    p <- p + ggtitle(title)
  }
  
  print(p)
  
  return(list(AA_subs_date = df2, AA_subs_count = df))
}

Code formatting by http://hilite.me/.

The Analytic Garden

Popular Substitutions

The Top 10

How to Count

Plotting the Substitutions

No comments:

Post a Comment

Labels

Contributors

wfmu