The H5N1 strain of avian influenza (bird flu) is a significant health concern. Although human infections of H5N1 are relative rare, the effect on bird populations has been devastating in some areas. In addition to birds, H5N1 infections have been found in domestic cattle, cats, goats, and other mammals. So far, Human cases appear to come from contact with infected animals. Although there are cases of undetermined origin, effective human-to-human transmission hasn't been determined.
H5N1 belongs to the A subtype of the influenza virus. It's a negative sense RNA virus.
I wanted to get a picture of H5N1 genomics. In particular, I am interested in what is happening in the HA protein. The viral genome consists of eight proteins: PB1, PB2, PA, HA, NP, NA, M, and NS. See H5N1 genetic structure for more details and references.
I'm particularly interested in changes in the hemagglutinin (HA) protein. HA is found on the surface of viral envelope. It is responsible for binding the viron to the host cell surface, so it's a key element in infection. In particular, a single change of a Gln to Leu in the HA protein at position 226 in amino acid sequence switches HA's binding specificity to bind to human cell receptors. See https://www.science.org/doi/10.1126/science.adt0180.
https://en.wikipedia.org/wiki/Hemagglutinin_(influenza)#/media/File:PDB_1hgd_EBI.jpg
On December 12 2024, I downloaded 3,432 HA genomic sequences covering the time period from 2024-01-01 to 2024-12-12 from GISAID's EpiFlu database. EpiFlu doesn't provide a metadate file for the sequences, but you can specify much of the same information to be included in the FASTA header. For example,
>A/environment/India/240005/2024|EPI_ISL_19521827|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625882
The elements are separated by "|" marks.
A/environment/India/240005/2024 indicates the A subtype, the host or source of the sample, the country of origin, sequence number, and year.
EPI_ISL_19521827 is the GISAID ID.
A_/_H5N1 indicates influenza subtype and virus type.
2.3.4.4b is the virus clade.
2024-06-15 is the collection date.
HA is the protein type.
EPI3625882 is the accession number.
For a first step, I reformatted the headers into a table so I could see where the data came from and which host the sample was drawn from. One problem is that submissions were not formatted uniformly. Let's examine hosts and country of origin to see the problem. For example, sometimes country is entered as USA and sometimes by state. Hosts are reported in a variety of ways for the same species. For example, CHICKEN and HEN seem to be the same species, or maybe not. I may have missed a few items in converting the host and country names.
Here's the R code for reading the FASTA headers and generating a dataframe.
#' fasta2dataframe - read a GISAID EpiFlu fasta file to a dataframe
#'
#' @param fasta_file - a string containg that fasta file name
#' @param remove_ref - boolean. Should reference sequence be removed from dataframe
#' @param ref_row - the row number of teh reference sequence, default = 1. Ignored if remove_ref = FALSE
#' @param min_seq_length - sequences with few elements are removed
#'
#' @returns - a dataframe with 16 columns
#' "Influenza_type" "Host" "Country" "ID" "Year" "Isolate_name" "Isolate_ID"
#' "Type" "Clade" "Collection_date" "Segment" "Lineage" "DNA_Accession_no" "seq.name"
#' "seq.text" "seq_length"
#'
fasta2dataframe <- function(fasta_file,
remove_ref = FALSE,
ref_row = 1,
min_seq_length = 1700) {
require(phylotools) # for reading the FASTA file
require(tidyverse)
# a list of US states for converting atates to USA
states <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
"Connecticut", "Delaware", "Florida","Georgia", "Hawaii",
"Idaho","Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana",
"Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota",
"Mississippi", "Missouri", "Montana", "Nebraska", "Nevada",
"New_Hampshire", "New_Jersey", "New_Mexico", "New_York", "North_Carolina", "North_Dakota",
"Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode_Island",
"South_Carolina", "South_Dakota", "Tennessee", "Texas", "Utah",
"Vermont", "Virginia", "Washington", "West,Virginia", "Wisconsin", "Wyoming")
df <- read.fasta(fasta_file)
if(remove_ref) {
df <- df %>% slice(-ref_row)
}
# split sequence IDs and get rid of any sequence that doesn't have full information
df <- df %>% separate_wider_delim(seq.name,
delim = "|",
names = c("Isolate_name", "Isolate_ID", "Type", "Clade",
"Collection_date", "Segment", "Lineage", "DNA_Accession_no"),
too_many = "debug", too_few = "debug")
df <- df %>% filter(`seq.name_ok`)
df <- df %>%
separate_wider_delim(`Isolate_name`,
delim = "/",
names = c("Influenza_type", "Host", "Country", "ID", "Year"),
too_many = "debug", too_few = "debug")
df <- df %>% filter(`Isolate_name_ok`)
# map states to USA and map the German entries to Germany. Foramt a few others.
df <- df %>% mutate(Country = if_else(Country %in% states, "USA", Country))
df <- df %>% mutate(Country = if_else(grepl("Germany", Country), "Germany", Country))
df <- df %>% mutate(Country = if_else(grepl("Nordrhein-Westfalen", Country), "Germany", Country))
df <- df %>% mutate(Country = if_else(Country %in% c("Sao_Paulo", "Rio_de_Janeiro"), "Brazil", Country))
df <- df %>% mutate(Country = if_else(Country %in% c("Kagoshima", "Ishikawa", "Iwate", "Hokkaido", "Hiroshima"), "Japan", Country))
# Combine common hosts
df <- df %>% mutate(Host = str_to_upper(Host))
df <- df %>% mutate(Host = if_else(grepl("DUCK", Host), "DUCK", Host))
df <- df %>% mutate(Host = if_else(grepl("MALLARD", Host), "DUCK", Host))
df <- df %>% mutate(Host = if_else(grepl("HEN", Host), "CHICKEN", Host))
df <- df %>% mutate(Host = if_else(grepl("CAT", Host), "CAT", Host))
df <- df %>% mutate(Host = if_else(grepl("FELINE", Host), "CAT", Host))
df <- df %>% mutate(Host = if_else(grepl("GOOSE", Host), "GOOSE", Host))
df <- df %>% mutate(Host = if_else(grepl("CROW", Host), "CROW", Host))
df <- df %>% mutate(Host = if_else(grepl("FOX", Host), "FOX", Host))
# filter sequence columns
df <- df %>% filter(str_length(seq.text) >= min_seq_length)
df <- df %>% mutate(seq.text = str_to_upper(seq.text))
df <- df %>% mutate(seq_length = str_length(seq.text))
# Since we removed the rows with missing values, remove the debug columns
df <- df %>% select(-c("Isolate_name_ok", "Isolate_name_pieces", "Isolate_name_remainder"))
df <- df %>% select(-c("seq.name_ok", "seq.name_pieces", "seq.name_remainder"))
return(df)
}
There's a lot of detail there, but the transformations are pretty simple. The results are a dataframe with 16 columns. Here are the first 10 rows. I left out the column with the sequences.
Influenza_type | Host | Country | ID | Year | Isolate_name | Isolate_ID | Type | Clade | Collection_date | Segment | Lineage | DNA_Accession_no | seq.name |
A | ENVIRONMENT | India | 240005 | 2024 | A/environment/India/240005/2024 | EPI_ISL_19521827 | A_/_H5N1 | 2.3.4.4b | 6/15/2024 | HA | | EPI3625882 | A/environment/India/240005/2024|EPI_ISL_19521827|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625882 |
A | CHICKEN | India | 240008 | 2024 | A/chicken/India/240008/2024 | EPI_ISL_19521829 | A_/_H5N1 | 2.3.4.4b | 6/15/2024 | HA | | EPI3625898 | A/chicken/India/240008/2024|EPI_ISL_19521829|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625898 |
A | CROW | India | 240013 | 2024 | A/crow/India/240013/2024 | EPI_ISL_19521828 | A_/_H5N1 | 2.3.4.4b | 6/15/2024 | HA | | EPI3625890 | A/crow/India/240013/2024|EPI_ISL_19521828|A_/_H5N1|2.3.4.4b|2024-06-15|HA||EPI3625890 |
A | MUTE_SWAN | Mangystau | 02 | 2024 | A/Mute_Swan/Mangystau/02/2024 | EPI_ISL_19259709 | A_/_H5N1 | 2.3.4.4b | 1/9/2024 | HA | | EPI3433322 | A/Mute_Swan/Mangystau/02/2024|EPI_ISL_19259709|A_/_H5N1|2.3.4.4b|2024-01-09|HA||EPI3433322 |
A | CROW | Japan | B080 | 2024 | A/large-billed_crow/Hokkaido/B080/2024 | EPI_ISL_18932042 | A_/_H5N1 | 2.3.4.4b | 1/22/2024 | HA | | EPI3067350 | A/large-billed_crow/Hokkaido/B080/2024|EPI_ISL_18932042|A_/_H5N1|2.3.4.4b|2024-01-22|HA||EPI3067350 |
A | DAIRY_COW | USA | 24-008766-001 | 2024 | A/dairy_cow/Kansas/24-008766-001/2024 | EPI_ISL_19145035 | A_/_H5N1 | 2.3.4.4b | 3/21/2024 | HA | | EPI3297748 | A/dairy_cow/Kansas/24-008766-001/2024|EPI_ISL_19145035|A_/_H5N1|2.3.4.4b|2024-03-21|HA||EPI3297748 |
A | GOOSE | Germany | 2024AI01291 | 2024 | A/Barnacle_Goose/Germany-SH/2024AI01291/2024 | EPI_ISL_19014078 | A_/_H5N1 | 2.3.4.4b | 2/9/2024 | HA | | EPI3158594 | A/Barnacle_Goose/Germany-SH/2024AI01291/2024|EPI_ISL_19014078|A_/_H5N1|2.3.4.4b|2024-02-09|HA||EPI3158594 |
A | CHICKEN | Germany | 2024AI00860 | 2024 | A/chicken/Germany-MV/2024AI00860/2024 | EPI_ISL_19014079 | A_/_H5N1 | 2.3.4.4b | 2/2/2024 | HA | | EPI3158602 | A/chicken/Germany-MV/2024AI00860/2024|EPI_ISL_19014079|A_/_H5N1|2.3.4.4b|2024-02-02|HA||EPI3158602 |
A | DAIRY_COW | USA | 24-010303-002 | 2024 | A/dairy_cow/Michigan/24-010303-002/2024 | EPI_ISL_19194255 | A_/_H5N1 | 2.3.4.4b | 4/3/2024 | HA | | EPI3364348 | A/dairy_cow/Michigan/24-010303-002/2024|EPI_ISL_19194255|A_/_H5N1|2.3.4.4b|2024-04-03|HA||EPI3364348 |
A | CHICKEN | Italy | 24VIR9865-7 | 2024 | A/laying_hen/Italy/24VIR9865-7/2024 | EPI_ISL_19554808 | A_/_H5N1 | 2.3.4.4b | 11/9/2024 | HA | | EPI3651272 | A/laying_hen/Italy/24VIR9865-7/2024|EPI_ISL_19554808|A_/_H5N1|2.3.4.4b|2024-11-09|HA||EPI3651272 |
We can generate a list of the top 10 countries submitting sequences,
> df_H5N1 %>% group_by(Country) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(n = 10)
# A tibble: 10 × 2
Country n
<chr> <int>
1 USA 2627
2 Czech_Republic 231
3 Japan 64
4 Poland 62
5 BC 45
6 Italy 43
7 Germany 38
8 Austria 26
9 Bangladesh 26
10 Vietnam 18
>
We can also get a list of the most common hosts in the data.
> df_H5N1 %>% group_by(Host) %>% summarise(n = n()) %>% arrange(desc(n)) %>% head(n = 10)
# A tibble: 10 × 2
Host n
<chr> <int>
1 DAIRY_COW 1928
2 CHICKEN 539
3 TURKEY 167
4 DUCK 158
5 GOOSE 81
6 CROW 62
7 CAT 54
8 MUTE_SWAN 54
9 GOAT 15
10 SANDERLING 12
>
Next post, I'll look at the common mutations in the data set.
You can get the source code for creating the table on
GitHub.