JN.1 evolved from BA.2.86. BA.2.86 has a large number changes in the spike protein when compared to the original Wuhan sequence. The mutations gave BA.2.86 improved ability to evade the human immune system.
In order to compare JN.1 changes to to BA.2.86, on January 22 I downloaded metadata for 16,460,375 sequences from GISAID. I also downloaded the full collection of FASTA SARS-CoV-2 sequences.
The analysis that follows is similar to this post. GISAID FASTA headers changed so I had to update some of code. I also wanted to add argument types to the Python code.
I used R to extract the BA.2.86 and JN.1 sequences from the metadata file.
filter_chunk <- function(pango_lineage) { function(df, pos) { df <- suppressMessages(as_tibble(df, .name_repair = 'universal')) df <- df %>% filter(Pango.lineage == pango_lineage) %>% mutate(Collection.date = as.Date(Collection.date, format = '%Y-%m-%d')) %>% mutate(Submission.date = as.Date(Submission.date, format = '%Y-%m-%d')) } }
system.time(df_BA_2_86 <- read_tsv_chunked('data/metadata.tsv.gz', DataFrameCallback$new(filter_chunk(pango_lineage = 'BA.2.86')), chunk_size = 1000000)) system.time(df_JN_1 <- read_tsv_chunked('data/metadata.tsv.gz', DataFrameCallback$new(filter_chunk(pango_lineage = 'JN.1')), chunk_size = 1000000)) write_tsv(df_BA_2_86, 'data/BA_2_86.tsv') write_tsv(df_JN_1, 'data/JN_1.tsv')
I plotted the variant counts.
system.time(df_variants_world <- get_selected_variants_ch('data/metadata.tsv.gz', selected_variants = c('BA.2.86', 'JN.1'), start_date = as.Date('2023-01-01'), country = NULL, title = 'World')) system.time(df_variants_usa <- get_selected_variants_ch('data/metadata.tsv.gz', selected_variants = c('BA.2.86', 'JN.1'), start_date = as.Date('2023-01-01'), title = 'USA'))
As you can see, BA.2.86 didn't get much traction. JN.1 quickly took over.
Some of the FASTA sequences were shorter than full length and some had a large number of Ns. The Ns represent area where sequence identity wasn't determined.
ggplot(df_BA_2_86, aes(x = N.Content)) + geom_histogram(color = 'black', fill = 'white') + ggtitle('BA_2_86') ggplot(df_JN_1, aes(x = N.Content)) + geom_histogram(color = 'black', fill = 'white') + ggtitle('JN_1')
I selected the BA.2.86 and JN.1 sequences from the FASTA file.
time python select_pango_sequences.py -i data/sequences.fasta.gz -m /data/BA_2_86.tsv -o /data/BA_2_86.fasta time python select_pango_sequences.py -i data/sequences.fasta.gz -m data/JN_1.tsv -o data/JN_1.fasta
Next, the short sequences and sequences with a larger fraction of Ns were removed.
time python gisaid_remove_seqs.py -i data/BA_2_86.fasta -m /data/BA_2_86.tsv -o data/BA_2_86_processed.fasta -n 0.05 -l 28000 time python gisaid_remove_seqs.py -i data/JN_1.fasta -m data/JN_1.tsv -o data/JN_1_processed.fasta -n 0.05 -l 28000
Each set of sequences was aligned with MAFTT using the original Wuhan sequence as a reference..
time mafft --auto --preservecase --thread -1 --addfragments data/BA_2_86_processed.fasta data/EPI_ISL_402124.fasta > data/BA_2_86_aln.fasta time mafft --auto --preservecase --thread -1 --addfragments data/JN_1_processed.fasta data/EPI_ISL_402124.fasta > data/JN_1_aln.fasta
Once the sequences are aligned, I counted the mutations the positions that differed from the Wuhan sequence.
time python gisaid_mutations.py -i data/BA_2_86_aln.fasta -o output/BA_2_86_mutations.csv -g data/NC_045512.2.gb time python gisaid_mutations.py -i data/JN_1_aln.fasta -o output/JN_1_mutations.csv -g data/NC_045512.2.gb
The full mutation tables can be found on GitHub.
Finally, we can read the mutations in R and see the differences.
> BA_2_86_mutations <- read_csv('output/BA_2_86_mutations.csv') > JN_1_mutations <- read_csv('output/JN_1_mutations.csv') > JN_1_mutations %>% filter(! (mutation %in% BA_2_86_mutations$mutation)) %>% select(mutation, product) mutation product <chr> <chr> 1 NA NA 2 G282G nsp3 3 K1155R nsp3 4 R252K nsp6 5 L44L nsp9 6 C285C 3'-to-5' exonuclease 7 L455S surface glycoprotein 8 F19L ORF7b
There is only one difference in the spike protein (surface glycoprotein) but it's enough to increase the transmissibility of JN.1 over BA.2.86.
The code for this analysis is similar to the code for this post, but has been updated to include argument types and the updated GISAID FASTA headers. The code can be found on GitHub.
No comments:
Post a Comment