BA.2.86

The SARS-CoV-2 variants BA.2.86 and its subvariant BA.2.86.1 have been showing up in the news over the last several weeks. The cause for concern is that the virus contains a number of new mutations in the spike protein compared to the Omicron variants. However, it's unclear what effect these mutations have on severity or disease transmission.

I downloaded 105 FASTA sequences with Pango lineage of BA.2.86 or BA.2.86.1 and a metafile with 15,962,305 records from GISAID on September 11, 2023. After filtering to remove the sequences containing incomplete genomes, I was left with 98 sequences.

The number of BA.2.86 and BA.2.86.1 sequences has been growing slowly over the past few months.


The code for this plot can be found here.

Mutation Pipeline

To study the mutations in the BA.2.86 and BA.2.86.1 sequences compared to the Wuhan reference sequence, I used the following pipeline. It's not a fully automated pipeline, but a series of steps.
First, the metafile was filtered to find the records of interest using R.

require(tidyverse)

filter_chunk <- function(df, pos) {
  df <- suppressMessages(as_tibble(df, .name_repair = 'universal'))
  df <- df %>% 
    filter(Pango.lineage == 'BA.2.86' | Pango.lineage == 'BA.2.86.1') %>%
    mutate(Collection.date = as.Date(Collection.date, format = '%Y-%m-%d'))
}  


system.time(df_ba_2_86 <- read_tsv_chunked('data/metadata.tsv.gz',
	     		                   DataFrameCallback$new(filter_chunk),
					   chunk_size = 1000000))
write_tsv(df_ba_2_86, 'data/ba_2_86.tsv')

The BA.2.86 and BA.28.6.1 FASTA files were combined.

# remove short sequences and combine files

from Bio import SeqIO

# file1 contains BA.2.86, file2 contains BA.2.86.1
file1 = 'data/gisaid_hcov-19_2023_09_07_13.fasta'
file2 = 'data/gisaid_hcov-19_2023_09_07_13_1.fasta'

fasta_list = []

with open(file1) as f:
    for rec in SeqIO.parse(f, 'fasta'):
        if len(str(rec.seq)) > 29000:
            fasta_list.append(rec)

with open(file2) as f:
    for rec in SeqIO.parse(f, 'fasta'):
        if len(str(rec.seq)) > 29000:
            fasta_list.append(rec)

f = open('data/ba.fasta', 'w')
for rec in fasta_list:
    SeqIO.write(rec, f, 'fasta')
f.close()

The sequences were aligned to the reference sequence with MAFFT.

# align to reference
time mafft --auto --preservecase --thread -1 --addfragments ba.fasta EPI_ISL_402124.fasta > ba_aln.fasta

The FASTA headers were reformatted.

# reformat FASTA headers
time python /mnt/d/Documents/analytic_garden/covid-19/src/GISAID/gisaid_reformat_fasta_headers.py -i data/ba_aln.fasta -m data/ba_2_86.tsv -o data/ba_aln_reformat.fasta -n 0

I calculated the mutual information among the aligned columns.

# mutual information
python /mnt/d/Documents/analytic_garden/covid-19/src/GISAID/gisaid_MI.py -i data/ba_aln_reformat.fasta -o output/ba_mi.csv

Here are the first few rows.

Position_1

Position_2

MI

2527

3177

0.848649

6183

12815

0.813147

2127

2527

0.563014

2127

3177

0.563014


Calculating the mutations.

mutations
python /mnt/d/Documents/analytic_garden/covid-19/src/GISAID/gisaid_mutations.py -i /mnt/g/Covid-19/GISAID/2023_09_05/data/ba_aln_reformat.fasta -o /mnt/g/Covid-19/GISAID/2023_09_05/output/ba_mutations.csv -g /mnt/g/Covid-19/GISAID/2023_09_05/data/NC_045512.2.gb

The full table of mutations can be found here.

Finally, the features of the columns with significant mutual information are determined.

# MI features
time python /mnt/d/Documents/analytic_garden/covid-19/src/GISAID/gisaid_MI_features.py -i output/ba_mi.csv -o output/ba_mi_features.csv -m output/ba_mutations.csv

Full results and code can be found on GitHub.

No comments:

Post a Comment