The Analytic Garden: H5N1

The H5N1 strain of avian influenza (bird flu) is a significant health concern. Although human infections of H5N1 are relative rare, the effect on bird populations has been devastating in some areas. In addition to birds, H5N1 infections have been found in domestic cattle, cats, goats, and other mammals. So far, Human cases appear to come from contact with infected animals. Although there are cases of undetermined origin, effective human-to-human transmission hasn't been determined.

H5N1 belongs to the A subtype of the influenza virus. It's a negative sense RNA virus.

I wanted to get a picture of H5N1 genomics. In particular, I am interested in what is happening in the HA protein. The viral genome consists of eight proteins: PB1, PB2, PA, HA, NP, NA, M, and NS. See H5N1 genetic structure for more details and references.

I'm particularly interested in changes in the hemagglutinin (HA) protein. HA is found on the surface of viral envelope. It is responsible for binding the viron to the host cell surface, so it's a key element in infection. In particular, a single change of a Gln to Leu in the HA protein at position 226 in amino acid sequence switches HA's binding specificity to bind to human cell receptors. See https://www.science.org/doi/10.1126/science.adt0180.

https://en.wikipedia.org/wiki/Hemagglutinin_(influenza)#/media/File:PDB_1hgd_EBI.jpg

On December 12 2024, I downloaded 3,432 HA genomic sequences covering the time period from 2024-01-01 to 2024-12-12 from GISAID's EpiFlu database. EpiFlu doesn't provide a metadate file for the sequences, but you can specify much of the same information to be included in the FASTA header. For example,

>A/environment/India/240005/2024|EPI_ISL_19521827|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625882

The elements are separated by "|" marks.

A/environment/India/240005/2024 indicates the A subtype, the host or source of the sample, the country of origin, sequence number, and year.

EPI_ISL_19521827 is the GISAID ID.

A_/_H5N1 indicates influenza subtype and virus type.

2.3.4.4b is the virus clade.

2024-06-15 is the collection date.

HA is the protein type.

EPI3625882 is the accession number.

For a first step, I reformatted the headers into a table so I could see where the data came from and which host the sample was drawn from. One problem is that submissions were not formatted uniformly. Let's examine hosts and country of origin to see the problem. For example, sometimes country is entered as USA and sometimes by state. Hosts are reported in a variety of ways for the same species. For example, CHICKEN and HEN seem to be the same species, or maybe not. I may have missed a few items in converting the host and country names.

Here's the R code for reading the FASTA headers and generating a dataframe.

#' fasta2dataframe - read a GISAID EpiFlu fasta file to a dataframe
#'
#' @param fasta_file - a string containg that fasta file name
#' @param remove_ref - boolean. Should reference sequence be removed from dataframe
#' @param ref_row - the row number of teh reference sequence, default = 1. Ignored if remove_ref = FALSE
#' @param min_seq_length - sequences with few elements are removed
#'
#' @returns - a dataframe with 16 columns
#' "Influenza_type"   "Host"             "Country"          "ID"               "Year"             "Isolate_name"     "Isolate_ID"      
#' "Type"             "Clade"            "Collection_date"  "Segment"          "Lineage"          "DNA_Accession_no" "seq.name"        
#' "seq.text"         "seq_length"
#'
fasta2dataframe <- function(fasta_file, 
                            remove_ref = FALSE, 
                            ref_row = 1,
                            min_seq_length = 1700) {
  require(phylotools)  # for reading the FASTA file
  require(tidyverse)
  
  # a list of US states for converting atates to USA
  states <- c("Alabama", "Alaska", "Arizona",  "Arkansas", "California", "Colorado",       
              "Connecticut", "Delaware", "Florida","Georgia", "Hawaii",
              "Idaho","Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana",
              "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota",      
              "Mississippi", "Missouri", "Montana",  "Nebraska", "Nevada",  
              "New_Hampshire", "New_Jersey", "New_Mexico", "New_York", "North_Carolina", "North_Dakota",
              "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode_Island", 
              "South_Carolina", "South_Dakota", "Tennessee", "Texas", "Utah", 
              "Vermont", "Virginia", "Washington", "West,Virginia", "Wisconsin", "Wyoming")
  
  df <- read.fasta(fasta_file)
  if(remove_ref) {
    df <- df %>% slice(-ref_row)
  }
  
  # split sequence  IDs and get rid of any sequence that doesn't have full information
  df <- df %>% separate_wider_delim(seq.name, 
                                    delim = "|", 
                                    names = c("Isolate_name", "Isolate_ID", "Type", "Clade", 
                                              "Collection_date", "Segment", "Lineage", "DNA_Accession_no"),
                                    too_many = "debug", too_few = "debug")
  df <- df %>% filter(`seq.name_ok`)
  
  df <- df %>% 
    separate_wider_delim(`Isolate_name`, 
                         delim = "/", 
                         names = c("Influenza_type", "Host", "Country", "ID", "Year"), 
                         too_many = "debug", too_few = "debug")
  df <- df %>% filter(`Isolate_name_ok`)
  
  # map states to USA and map the German entries to Germany. Foramt a few others.
  df <- df %>% mutate(Country = if_else(Country %in% states, "USA", Country))
  df <- df %>% mutate(Country = if_else(grepl("Germany", Country), "Germany", Country))
  df <- df %>% mutate(Country = if_else(grepl("Nordrhein-Westfalen", Country), "Germany", Country))
  df <- df %>% mutate(Country = if_else(Country %in% c("Sao_Paulo", "Rio_de_Janeiro"), "Brazil", Country))
  df <- df %>% mutate(Country = if_else(Country %in% c("Kagoshima", "Ishikawa", "Iwate", "Hokkaido", "Hiroshima"), "Japan", Country))
  
  # Combine common hosts
  df <- df %>% mutate(Host = str_to_upper(Host))
  df <- df %>% mutate(Host = if_else(grepl("DUCK", Host), "DUCK", Host))
  df <- df %>% mutate(Host = if_else(grepl("MALLARD", Host), "DUCK", Host))
  df <- df %>% mutate(Host = if_else(grepl("HEN", Host), "CHICKEN", Host))
  df <- df %>% mutate(Host = if_else(grepl("CAT", Host), "CAT", Host))
  df <- df %>% mutate(Host = if_else(grepl("FELINE", Host), "CAT", Host))
  df <- df %>% mutate(Host = if_else(grepl("GOOSE", Host), "GOOSE", Host))
  df <- df %>% mutate(Host = if_else(grepl("CROW", Host), "CROW", Host))
  df <- df %>% mutate(Host = if_else(grepl("FOX", Host), "FOX", Host))
  
  # filter sequence columns
  df <- df %>% filter(str_length(seq.text) >= min_seq_length) 
  df <- df %>% mutate(seq.text = str_to_upper(seq.text))
  df <- df %>% mutate(seq_length = str_length(seq.text))
  
  # Since we removed the rows with missing values, remove the debug columns
  df <- df %>% select(-c("Isolate_name_ok", "Isolate_name_pieces", "Isolate_name_remainder"))
  df <- df %>% select(-c("seq.name_ok", "seq.name_pieces", "seq.name_remainder"))
  
  return(df)
}

There's a lot of detail there, but the transformations are pretty simple. The results are a dataframe with 16 columns. Here are the first 10 rows. I left out the column with the sequences.

Influenza_type	Host	Country	ID	Year	Isolate_name	Isolate_ID	Type	Clade	Collection_date	Segment	DNA_Accession_no	seq.name
A	ENVIRONMENT	India	240005	2024	A/environment/India/240005/2024	EPI_ISL_19521827	A_/_H5N1	2.3.4.4b	6/15/2024	HA	EPI3625882	A/environment/India/240005/2024\|EPI_ISL_19521827\|A_/_H5N1\|2.3.4.4b\|2024-06-15\|HA\|\|EPI3625882
A	CHICKEN	India	240008	2024	A/chicken/India/240008/2024	EPI_ISL_19521829	A_/_H5N1	2.3.4.4b	6/15/2024	HA	EPI3625898	A/chicken/India/240008/2024\|EPI_ISL_19521829\|A_/_H5N1\|2.3.4.4b\|2024-06-15\|HA\|\|EPI3625898
A	CROW	India	240013	2024	A/crow/India/240013/2024	EPI_ISL_19521828	A_/_H5N1	2.3.4.4b	6/15/2024	HA	EPI3625890	A/crow/India/240013/2024\|EPI_ISL_19521828\|A_/_H5N1\|2.3.4.4b\|2024-06-15\|HA\|\|EPI3625890
A	MUTE_SWAN	Mangystau	02	2024	A/Mute_Swan/Mangystau/02/2024	EPI_ISL_19259709	A_/_H5N1	2.3.4.4b	1/9/2024	HA	EPI3433322	A/Mute_Swan/Mangystau/02/2024\|EPI_ISL_19259709\|A_/_H5N1\|2.3.4.4b\|2024-01-09\|HA\|\|EPI3433322
A	CROW	Japan	B080	2024	A/large-billed_crow/Hokkaido/B080/2024	EPI_ISL_18932042	A_/_H5N1	2.3.4.4b	1/22/2024	HA	EPI3067350	A/large-billed_crow/Hokkaido/B080/2024\|EPI_ISL_18932042\|A_/_H5N1\|2.3.4.4b\|2024-01-22\|HA\|\|EPI3067350
A	DAIRY_COW	USA	24-008766-001	2024	A/dairy_cow/Kansas/24-008766-001/2024	EPI_ISL_19145035	A_/_H5N1	2.3.4.4b	3/21/2024	HA	EPI3297748	A/dairy_cow/Kansas/24-008766-001/2024\|EPI_ISL_19145035\|A_/_H5N1\|2.3.4.4b\|2024-03-21\|HA\|\|EPI3297748
A	GOOSE	Germany	2024AI01291	2024	A/Barnacle_Goose/Germany-SH/2024AI01291/2024	EPI_ISL_19014078	A_/_H5N1	2.3.4.4b	2/9/2024	HA	EPI3158594	A/Barnacle_Goose/Germany-SH/2024AI01291/2024\|EPI_ISL_19014078\|A_/_H5N1\|2.3.4.4b\|2024-02-09\|HA\|\|EPI3158594
A	CHICKEN	Germany	2024AI00860	2024	A/chicken/Germany-MV/2024AI00860/2024	EPI_ISL_19014079	A_/_H5N1	2.3.4.4b	2/2/2024	HA	EPI3158602	A/chicken/Germany-MV/2024AI00860/2024\|EPI_ISL_19014079\|A_/_H5N1\|2.3.4.4b\|2024-02-02\|HA\|\|EPI3158602
A	DAIRY_COW	USA	24-010303-002	2024	A/dairy_cow/Michigan/24-010303-002/2024	EPI_ISL_19194255	A_/_H5N1	2.3.4.4b	4/3/2024	HA	EPI3364348	A/dairy_cow/Michigan/24-010303-002/2024\|EPI_ISL_19194255\|A_/_H5N1\|2.3.4.4b\|2024-04-03\|HA\|\|EPI3364348
A	CHICKEN	Italy	24VIR9865-7	2024	A/laying_hen/Italy/24VIR9865-7/2024	EPI_ISL_19554808	A_/_H5N1	2.3.4.4b	11/9/2024	HA	EPI3651272	A/laying_hen/Italy/24VIR9865-7/2024\|EPI_ISL_19554808\|A_/_H5N1\|2.3.4.4b\|2024-11-09\|HA\|\|EPI3651272

We can generate a list of the top 10 countries submitting sequences,

> df_H5N1 %>% group_by(Country) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(n = 10)
# A tibble: 10 × 2
   Country            n
   <chr>          <int>
 1 USA             2627
 2 Czech_Republic   231
 3 Japan             64
 4 Poland            62
 5 BC                45
 6 Italy             43
 7 Germany           38
 8 Austria           26
 9 Bangladesh        26
10 Vietnam           18
>

We can also get a list of the most common hosts in the data.

> df_H5N1 %>% group_by(Host) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(n = 10)
# A tibble: 10 × 2
   Host           n
   <chr>      <int>
 1 DAIRY_COW   1928
 2 CHICKEN      539
 3 TURKEY       167
 4 DUCK         158
 5 GOOSE         81
 6 CROW          62
 7 CAT           54
 8 MUTE_SWAN     54
 9 GOAT          15
10 SANDERLING    12
>

Next post, I'll look at the common mutations in the data set.

You can get the source code for creating the table on GitHub.

The Analytic Garden

H5N1

No comments:

Post a Comment

Labels

Contributors

wfmu