Spring Variants

Spring has actually arrived in upstate NY and along with it a small COVID-19 surge. 

First, here is the 7-day moving average of positive cases for the five county surrounding area.


You can see a slight uptick in cases and hospitalization in recent days. Albany county has a larger population and shows a more dramatic increase. We have to keep in mind that the data is probably skewed due to fewer tests being reported. People with milder symptoms don't show up at testing centers of doctor's offices. Home tests aren't reported. Hospitalization data is more solid.

Lineages

In a previous post, I commented on the difficulty of getting a clear picture of the state of the COVID-19 pandemic from publicly available data. Sequence and meta-data from GISAID is a rich and valuable source of information about the progress of the pandemic. Unfortunately, sequence data tends to lag case data. In addition, as public testing declines, opportunities to get viral material from infected individuals decline.

That said, I still wanted to see what kind of picture the GISAID data could provide for our local area. On April,20 2022, I downloaded 10,298,748 records for 22 variables. 

I searched the GISAID data collected from the five local counties for Pango lineages submitted so far during 2022. There were a total of 63 different lineages found, most with only a few counts.

Here's how I counted.

county_variants <- function(meta_data,
                            host = 'human',
                            state = 'New York',
                            counties = c('Albany', 'Columbia', 'Rensselaer', 'Saratoga', 'Schenectady'),
                            start = "2022-01-01",
                            end = NULL,
                            plot = TRUE,
                            title = NULL,
                            min_count = 10) {
  require(tidyverse)
  
  if(is.null(end)) {
    end <- Sys.Date()
  }
  
  # filter date range and state
  df <- meta_data %>%
    filter(Host == {{ host }}) %>%
    filter(Collection.date >= as.Date({{ start }}) & Collection.date <= as.Date({{ end }})) %>%
    filter(str_detect(Location, {{ state }})) %>%
    select(Collection.date, Pango.lineage, Location)

  df_county_variants <- data.frame()    
  for(county in counties) {
    county_name <- paste(county, "County")
    
    # get county data and count Pango lineages for the county.
    df_county <- df %>% 
      filter(str_detect(Location, {{ county_name }})) %>% 
      select(Collection.date, Pango.lineage) %>%
      group_by(Pango.lineage) %>%
      count() %>%
      rename(Count = n) %>%
      mutate(County = {{ county}})
    
    df_county_variants <- bind_rows(df_county_variants, df_county)
  }
  
  if(plot) {
    # plot counts > min_count
    p <- df_county_variants %>%
      filter(Count > {{ min_count}}) %>%
      ggplot() + 
        geom_bar(aes(x = Pango.lineage, y = Count, fill = County), stat = 'identity', width = 0.5) +
        labs(caption = paste('Minimum Count =', {{ min_count }})) +
        xlab('Pango Lineange')
    
    if(! is.null(title)) {
      p <- p + ggtitle(title)
    }
    
    print(p)
  }
  
  return(df_county_variants)
}

The code can be downloaded from GitHub.

There are thirty different lineages collected and submitted from the local five county region. Most lineages had fewer than four submissions.

GISAID submissions lag by a few weeks. The most recent submission data in this collection was 2022-04-06. 

We can't assume that the data is a random sample of the infected population. However, it does give us a, possibly distorted, picture of the local state of the pandemic in a certain time period.


It is interesting that although the BA.2.. lineages have been in the news, they are not included in the data yet. BA.1.1 accounts for slightly over 50% of the total samples. This situation has probably already changed. The most recet date in the five county data was 2022-03-24, almost a month previous.

You can get the code HERE.

No comments:

Post a Comment