The H5N1 strain of avian influenza (bird flu) is a significant health concern. Although human infections of H5N1 are relative rare, the effect on bird populations has been devastating in some areas. In addition to birds, H5N1 infections have been found in domestic cattle, cats, goats, and other mammals. So far, Human cases appear to come from contact with infected animals. Although there are cases of undetermined origin, effective human-to-human transmission hasn't been determined.
H5N1 belongs to the A subtype of the influenza virus. It's a negative sense RNA virus.
I wanted to get a picture of H5N1 genomics. In particular, I am interested in what is happening in the HA protein. The viral genome consists of eight proteins: PB1, PB2, PA, HA, NP, NA, M, and NS. See H5N1 genetic structure for more details and references.
I'm particularly interested in changes in the hemagglutinin (HA) protein. HA is found on the surface of viral envelope. It is responsible for binding the viron to the host cell surface, so it's a key element in infection. In particular, a single change of a Gln to Leu in the HA protein at position 226 in amino acid sequence switches HA's binding specificity to bind to human cell receptors. See https://www.science.org/doi/10.1126/science.adt0180.
https://en.wikipedia.org/wiki/Hemagglutinin_(influenza)#/media/File:PDB_1hgd_EBI.jpg
On December 12 2024, I downloaded 3,432 HA genomic sequences covering the time period from 2024-01-01 to 2024-12-12 from GISAID's EpiFlu database. EpiFlu doesn't provide a metadate file for the sequences, but you can specify much of the same information to be included in the FASTA header. For example,
>A/environment/India/240005/2024|EPI_ISL_19521827|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625882
#' fasta2dataframe - read a GISAID EpiFlu fasta file to a dataframe #' #' @param fasta_file - a string containg that fasta file name #' @param remove_ref - boolean. Should reference sequence be removed from dataframe #' @param ref_row - the row number of teh reference sequence, default = 1. Ignored if remove_ref = FALSE #' @param min_seq_length - sequences with few elements are removed #' #' @returns - a dataframe with 16 columns #' "Influenza_type" "Host" "Country" "ID" "Year" "Isolate_name" "Isolate_ID" #' "Type" "Clade" "Collection_date" "Segment" "Lineage" "DNA_Accession_no" "seq.name" #' "seq.text" "seq_length" #' fasta2dataframe <- function(fasta_file, remove_ref = FALSE, ref_row = 1, min_seq_length = 1700) { require(phylotools) # for reading the FASTA file require(tidyverse) # a list of US states for converting atates to USA states <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida","Georgia", "Hawaii", "Idaho","Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New_Hampshire", "New_Jersey", "New_Mexico", "New_York", "North_Carolina", "North_Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode_Island", "South_Carolina", "South_Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West,Virginia", "Wisconsin", "Wyoming") df <- read.fasta(fasta_file) if(remove_ref) { df <- df %>% slice(-ref_row) } # split sequence IDs and get rid of any sequence that doesn't have full information df <- df %>% separate_wider_delim(seq.name, delim = "|", names = c("Isolate_name", "Isolate_ID", "Type", "Clade", "Collection_date", "Segment", "Lineage", "DNA_Accession_no"), too_many = "debug", too_few = "debug") df <- df %>% filter(`seq.name_ok`) df <- df %>% separate_wider_delim(`Isolate_name`, delim = "/", names = c("Influenza_type", "Host", "Country", "ID", "Year"), too_many = "debug", too_few = "debug") df <- df %>% filter(`Isolate_name_ok`) # map states to USA and map the German entries to Germany. Foramt a few others. df <- df %>% mutate(Country = if_else(Country %in% states, "USA", Country)) df <- df %>% mutate(Country = if_else(grepl("Germany", Country), "Germany", Country)) df <- df %>% mutate(Country = if_else(grepl("Nordrhein-Westfalen", Country), "Germany", Country)) df <- df %>% mutate(Country = if_else(Country %in% c("Sao_Paulo", "Rio_de_Janeiro"), "Brazil", Country)) df <- df %>% mutate(Country = if_else(Country %in% c("Kagoshima", "Ishikawa", "Iwate", "Hokkaido", "Hiroshima"), "Japan", Country)) # Combine common hosts df <- df %>% mutate(Host = str_to_upper(Host)) df <- df %>% mutate(Host = if_else(grepl("DUCK", Host), "DUCK", Host)) df <- df %>% mutate(Host = if_else(grepl("MALLARD", Host), "DUCK", Host)) df <- df %>% mutate(Host = if_else(grepl("HEN", Host), "CHICKEN", Host)) df <- df %>% mutate(Host = if_else(grepl("CAT", Host), "CAT", Host)) df <- df %>% mutate(Host = if_else(grepl("FELINE", Host), "CAT", Host)) df <- df %>% mutate(Host = if_else(grepl("GOOSE", Host), "GOOSE", Host)) df <- df %>% mutate(Host = if_else(grepl("CROW", Host), "CROW", Host)) df <- df %>% mutate(Host = if_else(grepl("FOX", Host), "FOX", Host)) # filter sequence columns df <- df %>% filter(str_length(seq.text) >= min_seq_length) df <- df %>% mutate(seq.text = str_to_upper(seq.text)) df <- df %>% mutate(seq_length = str_length(seq.text)) # Since we removed the rows with missing values, remove the debug columns df <- df %>% select(-c("Isolate_name_ok", "Isolate_name_pieces", "Isolate_name_remainder")) df <- df %>% select(-c("seq.name_ok", "seq.name_pieces", "seq.name_remainder")) return(df) }
Influenza_type | Host | Country | ID | Year | Isolate_name | Isolate_ID | Type | Clade | Collection_date | Segment | Lineage | DNA_Accession_no | seq.name |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | ENVIRONMENT | India | 240005 | 2024 | A/environment/India/240005/2024 | EPI_ISL_19521827 | A_/_H5N1 | 2.3.4.4b | 6/15/2024 | HA | EPI3625882 | A/environment/India/240005/2024|EPI_ISL_19521827|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625882 | |
A | CHICKEN | India | 240008 | 2024 | A/chicken/India/240008/2024 | EPI_ISL_19521829 | A_/_H5N1 | 2.3.4.4b | 6/15/2024 | HA | EPI3625898 | A/chicken/India/240008/2024|EPI_ISL_19521829|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625898 | |
A | CROW | India | 240013 | 2024 | A/crow/India/240013/2024 | EPI_ISL_19521828 | A_/_H5N1 | 2.3.4.4b | 6/15/2024 | HA | EPI3625890 | A/crow/India/240013/2024|EPI_ISL_19521828|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625890 | |
A | MUTE_SWAN | Mangystau | 02 | 2024 | A/Mute_Swan/Mangystau/02/2024 | EPI_ISL_19259709 | A_/_H5N1 | 2.3.4.4b | 1/9/2024 | HA | EPI3433322 | A/Mute_Swan/Mangystau/02/2024|EPI_ISL_19259709|A_/_H5N1|2.3.4.4b|2024-01-09|HA||EPI3433322 | |
A | CROW | Japan | B080 | 2024 | A/large-billed_crow/Hokkaido/B080/2024 | EPI_ISL_18932042 | A_/_H5N1 | 2.3.4.4b | 1/22/2024 | HA | EPI3067350 | A/large-billed_crow/Hokkaido/B080/2024|EPI_ISL_18932042|A_/_H5N1|2.3.4.4b|2024-01-22|HA||EPI3067350 | |
A | DAIRY_COW | USA | 24-008766-001 | 2024 | A/dairy_cow/Kansas/24-008766-001/2024 | EPI_ISL_19145035 | A_/_H5N1 | 2.3.4.4b | 3/21/2024 | HA | EPI3297748 | A/dairy_cow/Kansas/24-008766-001/2024|EPI_ISL_19145035|A_/_H5N1|2.3.4.4b|2024-03-21|HA||EPI3297748 | |
A | GOOSE | Germany | 2024AI01291 | 2024 | A/Barnacle_Goose/Germany-SH/2024AI01291/2024 | EPI_ISL_19014078 | A_/_H5N1 | 2.3.4.4b | 2/9/2024 | HA | EPI3158594 | A/Barnacle_Goose/Germany-SH/2024AI01291/2024|EPI_ISL_19014078|A_/_H5N1|2.3.4.4b|2024-02-09|HA||EPI3158594 | |
A | CHICKEN | Germany | 2024AI00860 | 2024 | A/chicken/Germany-MV/2024AI00860/2024 | EPI_ISL_19014079 | A_/_H5N1 | 2.3.4.4b | 2/2/2024 | HA | EPI3158602 | A/chicken/Germany-MV/2024AI00860/2024|EPI_ISL_19014079|A_/_H5N1|2.3.4.4b|2024-02-02|HA||EPI3158602 | |
A | DAIRY_COW | USA | 24-010303-002 | 2024 | A/dairy_cow/Michigan/24-010303-002/2024 | EPI_ISL_19194255 | A_/_H5N1 | 2.3.4.4b | 4/3/2024 | HA | EPI3364348 | A/dairy_cow/Michigan/24-010303-002/2024|EPI_ISL_19194255|A_/_H5N1|2.3.4.4b|2024-04-03|HA||EPI3364348 | |
A | CHICKEN | Italy | 24VIR9865-7 | 2024 | A/laying_hen/Italy/24VIR9865-7/2024 | EPI_ISL_19554808 | A_/_H5N1 | 2.3.4.4b | 11/9/2024 | HA | EPI3651272 | A/laying_hen/Italy/24VIR9865-7/2024|EPI_ISL_19554808|A_/_H5N1|2.3.4.4b|2024-11-09|HA||EPI3651272 |
> df_H5N1 %>% group_by(Country) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(n = 10) # A tibble: 10 × 2 Country n <chr> <int> 1 USA 2627 2 Czech_Republic 231 3 Japan 64 4 Poland 62 5 BC 45 6 Italy 43 7 Germany 38 8 Austria 26 9 Bangladesh 26 10 Vietnam 18 >
> df_H5N1 %>% group_by(Host) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(n = 10) # A tibble: 10 × 2 Host n <chr> <int> 1 DAIRY_COW 1928 2 CHICKEN 539 3 TURKEY 167 4 DUCK 158 5 GOOSE 81 6 CROW 62 7 CAT 54 8 MUTE_SWAN 54 9 GOAT 15 10 SANDERLING 12 >
No comments:
Post a Comment