The Analytic Garden: SARS-CoV-2 JN.1

COVID-19 is still with us and it's underlying virus, SARS-CoV-2, is still evolving. The latest variant of concern is JN.1. As of February 2, 2024, the CDC estimates that JN.1 is the viral source of about 93% of all US COVID cases.

JN.1 evolved from BA.2.86. BA.2.86 has a large number changes in the spike protein when compared to the original Wuhan sequence. The mutations gave BA.2.86 improved ability to evade the human immune system.

In order to compare JN.1 changes to to BA.2.86, on January 22 I downloaded metadata for 16,460,375 sequences from GISAID. I also downloaded the full collection of FASTA SARS-CoV-2 sequences.

The analysis that follows is similar to this post. GISAID FASTA headers changed so I had to update some of code. I also wanted to add argument types to the Python code.

I used R to extract the BA.2.86 and JN.1 sequences from the metadata file.

filter_chunk <- function(pango_lineage) {
    function(df, pos) {
      df <- suppressMessages(as_tibble(df, .name_repair = 'universal'))
      df <- df %>% 
        filter(Pango.lineage == pango_lineage) %>%
        mutate(Collection.date = 
                 as.Date(Collection.date, format = '%Y-%m-%d')) %>%
        mutate(Submission.date = 
                 as.Date(Submission.date, format = '%Y-%m-%d'))
    }
}

system.time(df_BA_2_86 <- read_tsv_chunked('data/metadata.tsv.gz',
                                           DataFrameCallback$new(filter_chunk(pango_lineage = 'BA.2.86')),
					   chunk_size = 1000000))

system.time(df_JN_1 <- read_tsv_chunked('data/metadata.tsv.gz',
                                        DataFrameCallback$new(filter_chunk(pango_lineage = 'JN.1')),
					chunk_size = 1000000))

write_tsv(df_BA_2_86, 'data/BA_2_86.tsv')
write_tsv(df_JN_1, 'data/JN_1.tsv')

I plotted the variant counts.

system.time(df_variants_world <- get_selected_variants_ch('data/metadata.tsv.gz', selected_variants = c('BA.2.86', 'JN.1'), start_date = as.Date('2023-01-01'), country = NULL, title = 'World'))
system.time(df_variants_usa <- get_selected_variants_ch('data/metadata.tsv.gz', selected_variants = c('BA.2.86', 'JN.1'), start_date = as.Date('2023-01-01'), title = 'USA'))

As you can see, BA.2.86 didn't get much traction. JN.1 quickly took over.

Some of the FASTA sequences were shorter than full length and some had a large number of Ns. The Ns represent area where sequence identity wasn't determined.

ggplot(df_BA_2_86, aes(x = N.Content)) + geom_histogram(color = 'black', fill = 'white') + ggtitle('BA_2_86')
ggplot(df_JN_1, aes(x = N.Content)) + geom_histogram(color = 'black', fill = 'white') + ggtitle('JN_1')

I selected the BA.2.86 and JN.1 sequences from the FASTA file.

time python select_pango_sequences.py -i data/sequences.fasta.gz -m /data/BA_2_86.tsv -o /data/BA_2_86.fasta
time python select_pango_sequences.py -i data/sequences.fasta.gz -m data/JN_1.tsv -o data/JN_1.fasta

Next, the short sequences and sequences with a larger fraction of Ns were removed.

time python gisaid_remove_seqs.py -i data/BA_2_86.fasta -m /data/BA_2_86.tsv -o data/BA_2_86_processed.fasta -n 0.05 -l 28000
time python gisaid_remove_seqs.py -i data/JN_1.fasta -m data/JN_1.tsv -o data/JN_1_processed.fasta -n 0.05 -l 28000

Each set of sequences was aligned with MAFTT using the original Wuhan sequence as a reference..

time mafft --auto --preservecase --thread -1 --addfragments data/BA_2_86_processed.fasta data/EPI_ISL_402124.fasta > data/BA_2_86_aln.fasta
time mafft --auto --preservecase --thread -1 --addfragments data/JN_1_processed.fasta data/EPI_ISL_402124.fasta > data/JN_1_aln.fasta

Once the sequences are aligned, I counted the mutations the positions that differed from the Wuhan sequence.

time python gisaid_mutations.py -i data/BA_2_86_aln.fasta -o output/BA_2_86_mutations.csv -g data/NC_045512.2.gb
time python gisaid_mutations.py -i data/JN_1_aln.fasta -o output/JN_1_mutations.csv -g data/NC_045512.2.gb

The full mutation tables can be found on GitHub.

Finally, we can read the mutations in R and see the differences.

> BA_2_86_mutations <- read_csv('output/BA_2_86_mutations.csv')
> JN_1_mutations <- read_csv('output/JN_1_mutations.csv')
> JN_1_mutations %>% filter(! (mutation %in% BA_2_86_mutations$mutation)) %>% select(mutation, product)

  mutation product             
  <chr>    <chr>               
1 NA       NA                  
2 G282G    nsp3                
3 K1155R   nsp3                
4 R252K    nsp6                
5 L44L     nsp9                
6 C285C    3'-to-5' exonuclease
7 L455S    surface glycoprotein
8 F19L     ORF7b

There is only one difference in the spike protein (surface glycoprotein) but it's enough to increase the transmissibility of JN.1 over BA.2.86.

The code for this analysis is similar to the code for this post, but has been updated to include argument types and the updated GISAID FASTA headers. The code can be found on GitHub.

The Analytic Garden

SARS-CoV-2 JN.1

No comments:

Post a Comment

Labels

Contributors

wfmu