As viral sequencing ramps up, more variants will be detected. Currently, three variants appear to be of concern: B.1.1.7, B.1.351, and P.1, first detected in the UK, South Africa, and Brazil, respectively. B.1.1.7 has an unusually large number of mutations, including many in the spike protein. Although this is the best established for B.1.1.7, all of these variants may transmit better from person to person. (Posted January 19, 2021)
- Kartik Chandran, PhD Professor of Microbiology and Immunology, Albert Einstein College of Medicine
- Kartik Chandran, PhD Professor of Microbiology and Immunology, Albert Einstein College of Medicine
So far, one mutation, D614G, that occurred early last year quickly became the predominant strain worldwide. It allows the virus to replicate more efficiently in the upper airway to enhance shedding and the efficiency of transmission. Fortunately, this substitution does not render the virus resistant to the antibodies generated in vaccinated persons. Likewise, another substitution in the spike, N501Y, which is shared by both the UK and S. African variants, does not allow the virus to escape neutralizing antibodies produced in response to vaccination. (Posted January 19, 2021)
- Scott C. Weaver, PhD Professor and Chair, Department of Microbiology and Immunology, Director, Institute for Human Infections and Immunity, Scientific Director, Galveston National Laboratory, University of Texas Medical Branch
In a previous post, I looked at the three most common SARS-CoV-2 variants circulating in the US: B.1.1.7, B.1.351, and P.1 as shown in sequence data from GISAID. In this post, I want to look at the most common amino-acid substitutions in the data affecting the SARS-CoV-2 Spike protein. These correspond to changes seen in the common variants.
The Top 10
The following plot shows the log count of spike protein amino-acid substitutions with the ten most total occurrences in the GISAID data for the US. The drop at around April 1 is probably an artifact of the lower number of submissions.
The table shows the top ten substitutions and their counts in the USA sequences in the data.
|
AA Substitutions |
Count |
1 |
Spike_D614G |
364905 |
2 |
Spike_P681H |
96560 |
3 |
Spike_N501Y |
81208 |
4 |
Spike_Y144del |
80045 |
5 |
Spike_A570D |
75367 |
6 |
Spike_V70del |
75141 |
7 |
Spike_H69del |
75129 |
8 |
Spike_T716I |
73500 |
9 |
Spike_S982A |
71491 |
10 |
Spike_D1118H |
71407 |
Note that there are 3 deletion is the top ten changes. D614G means a D (Aspartic Acid) was replaced by a G (Glycine) at position 614 in the spike protein. D6124G is the most common substitution as seem by the green swath across the top of the plot. It arose in April last year.
How to Count
Counting the amino-acid substitutions starts with a bit of code similar to some discussed previously. Initially the GISAID metafile is read using read.csv(). Some filtering is done to select the host species and remove invalid lineages and collection dates. The AA.substitutions list is split and each substitution is inserted on it's own row along with Virus.name, Collection.date, Accession.ID, Country, and Pango.lineage.
Here's the code.
#' get_aa_substitution #' Count amino acid substitutions in GISAID metafile #' #' @param meta_table a dataframe read from a GISAID metafile by read.csv. #' @param host a string, the host species. #' #' @return a dataframe with columns: Virus.name, Collection.date, Accession.ID, Country, Pango.lineage, AA.Substitutions. #' get_aa_substitutions <- function(meta_table, host = "Human") { require(tidyverse) # Filter host species, remove empty lineages,a nd invalid dates. # Select columns. # Remove paranetheis from AA substitutions and split on commas and place each substituiion on its own row. # Split Location and save Country. df <- meta_table %>% filter(Host == {{ host }}) %>% filter(Pango.lineage != "") %>% filter(str_detect(Collection.date, "\\d\\d\\d\\d-\\d\\d-\\d\\d")) %>% select(Virus.name, Collection.date, Accession.ID, Location, Pango.lineage, AA.Substitutions) %>% mutate(AA.Substitutions = str_replace_all(AA.Substitutions, "[()]", "")) %>% separate_rows(AA.Substitutions, sep = ",") %>% separate(Location, c("Region", "Country", "State"), sep = "\\s*/\\s*", extra = "drop", fill = "right") %>% select(-c("Region", "State")) %>% mutate(Collection.date = as.Date(Collection.date, format = "%Y-%m-%d")) return(df) }
Here's an example of the transformation.
> name_list [1] "Virus.name" "Collection.date" "Accession.ID" "Location" "Pango.lineage" "AA.Substitutions" > metafile_2021_05_05 %>% select(all_of(name_list)) %>% filter(Accession.ID == "EPI_ISL_426905") Virus.name Collection.date Accession.ID Location Pango.lineage 1 hCoV-19/Australia/VIC390/2020 2020-03-23 EPI_ISL_426905 Oceania / Australia / Victoria B.6.6 AA.Substitutions 1 (NSP3_S1197R,NSP12_A97V,NSP3_T1198K,NSP12_T739I,N_P13L,NSP6_L37F) > aa_subs %>% filter(Accession.ID == "EPI_ISL_426905") # A tibble: 6 x 6 Virus.name Collection.date Accession.ID Country Pango.lineage AA.Substitutions <chr> <date> <chr> <chr> <chr> <chr> 1 hCoV-19/Australia/VIC390/2020 2020-03-23 EPI_ISL_426905 Australia B.6.6 NSP3_S1197R 2 hCoV-19/Australia/VIC390/2020 2020-03-23 EPI_ISL_426905 Australia B.6.6 NSP12_A97V 3 hCoV-19/Australia/VIC390/2020 2020-03-23 EPI_ISL_426905 Australia B.6.6 NSP3_T1198K 4 hCoV-19/Australia/VIC390/2020 2020-03-23 EPI_ISL_426905 Australia B.6.6 NSP12_T739I 5 hCoV-19/Australia/VIC390/2020 2020-03-23 EPI_ISL_426905 Australia B.6.6 N_P13L 6 hCoV-19/Australia/VIC390/2020 2020-03-23 EPI_ISL_426905 Australia B.6.6 NSP6_L37F >
Plotting the Substitutions
In order to make the plot shown above, it's necessary to find the ten most common substitutions in the US. The following code calculates the total counts of each substitution in the dataframe returned by get_aa_substitution() shown above.
df <- df_aa_subs %>% filter(Country == {{ country }}) %>% select(AA.Substitutions) %>% filter(str_detect(AA.Substitutions, "^Spike")) %>% group_by(AA.Substitutions) %>% count() %>% rename(Count = n) %>% ungroup() # pull converts tibble column to vector top_subs <- pull(df[rev(order(df$Count)),][1:num_subs,1])
Next, the data from get_aa_substitution() is filtered by country, spike protein, and inclusion in the top_subs list. Then, the substitution are counted for each date.
df2 <- df_aa_subs %>% filter(Country == {{ country }}) %>% filter(str_detect(AA.Substitutions, "^Spike")) %>% filter(AA.Substitutions %in% {{ top_subs }}) %>% select(Collection.date, AA.Substitutions) %>% group_by(Collection.date, AA.Substitutions) %>% count() %>% rename(Count = n) %>% ungroup()
The df2 dataframe is then used to plot.
Here's the entire function.
#' plot_spike_substitutions #' Plot most common amino-acid substitutions from GISAID metafile. #' #' @param df_aa_subs A dataframe (tibble) returned from get_aa_substitution #' @param country A String, the country of interest. #' @param num_subs An integer, number of top substitutions to plot. #' @param log A logical, should counts be plotted on log scale. #' @param title A string, an optional title for the plot. #' #' @return a list of two dataframes (tibbles) #' AA_subs_date Counts by date and amino acid substitution. #' AA_subs_count Total counts by amino acid substitution. #' plot_spike_substitutions <- function(df_aa_subs, country = "USA", num_subs = 10, log = FALSE, title = NULL) { require(tidyverse) require(pals) # get counts of each substitution for country. df <- df_aa_subs %>% filter(Country == {{ country }}) %>% select(AA.Substitutions) %>% filter(str_detect(AA.Substitutions, "^Spike")) %>% group_by(AA.Substitutions) %>% count() %>% rename(Count = n) %>% ungroup() # pull converts tibble column to vector top_subs <- pull(df[rev(order(df$Count)),][1:num_subs,1]) # get counts of each top substitution by date. df2 <- df_aa_subs %>% filter(Country == {{ country }}) %>% filter(str_detect(AA.Substitutions, "^Spike")) %>% filter(AA.Substitutions %in% {{ top_subs }}) %>% select(Collection.date, AA.Substitutions) %>% group_by(Collection.date, AA.Substitutions) %>% count() %>% rename(Count = n) %>% ungroup() # plot it if(! log) { p <- ggplot(df2) + geom_point(aes(x = Collection.date, y = Count, color = AA.Substitutions)) } else { p <- ggplot(df2) + geom_point(aes(x = Collection.date, y = log(Count), color = AA.Substitutions)) } # getting distinct colors is hard. p <- p + scale_color_manual(values = glasbey(num_subs)) if(! is.null(title)) { p <- p + ggtitle(title) } print(p) return(list(AA_subs_date = df2, AA_subs_count = df)) }
Code formatting by http://hilite.me/.
No comments:
Post a Comment