The Analytic Garden: BA.2.86

The SARS-CoV-2 variants BA.2.86 and its subvariant BA.2.86.1 have been showing up in the news over the last several weeks. The cause for concern is that the virus contains a number of new mutations in the spike protein compared to the Omicron variants. However, it's unclear what effect these mutations have on severity or disease transmission.

I downloaded 105 FASTA sequences with Pango lineage of BA.2.86 or BA.2.86.1 and a metafile with 15,962,305 records from GISAID on September 11, 2023. After filtering to remove the sequences containing incomplete genomes, I was left with 98 sequences.

The number of BA.2.86 and BA.2.86.1 sequences has been growing slowly over the past few months.

The code for this plot can be found here.

Mutation Pipeline

To study the mutations in the BA.2.86 and BA.2.86.1 sequences compared to the Wuhan reference sequence, I used the following pipeline. It's not a fully automated pipeline, but a series of steps.

First, the metafile was filtered to find the records of interest using R.

require(tidyverse)

filter_chunk <- function(df, pos) {
  df <- suppressMessages(as_tibble(df, .name_repair = 'universal'))
  df <- df %>% 
    filter(Pango.lineage == 'BA.2.86' | Pango.lineage == 'BA.2.86.1') %>%
    mutate(Collection.date = as.Date(Collection.date, format = '%Y-%m-%d'))
}  


system.time(df_ba_2_86 <- read_tsv_chunked('data/metadata.tsv.gz',
	     		                   DataFrameCallback$new(filter_chunk),
					   chunk_size = 1000000))
write_tsv(df_ba_2_86, 'data/ba_2_86.tsv')

The BA.2.86 and BA.28.6.1 FASTA files were combined.

# remove short sequences and combine files

from Bio import SeqIO

# file1 contains BA.2.86, file2 contains BA.2.86.1
file1 = 'data/gisaid_hcov-19_2023_09_07_13.fasta'
file2 = 'data/gisaid_hcov-19_2023_09_07_13_1.fasta'

fasta_list = []

with open(file1) as f:
    for rec in SeqIO.parse(f, 'fasta'):
        if len(str(rec.seq)) > 29000:
            fasta_list.append(rec)

with open(file2) as f:
    for rec in SeqIO.parse(f, 'fasta'):
        if len(str(rec.seq)) > 29000:
            fasta_list.append(rec)

f = open('data/ba.fasta', 'w')
for rec in fasta_list:
    SeqIO.write(rec, f, 'fasta')
f.close()

The sequences were aligned to the reference sequence with MAFFT.

# align to reference
time mafft --auto --preservecase --thread -1 --addfragments ba.fasta EPI_ISL_402124.fasta > ba_aln.fasta

The FASTA headers were reformatted.

# reformat FASTA headers
time python /mnt/d/Documents/analytic_garden/covid-19/src/GISAID/gisaid_reformat_fasta_headers.py -i data/ba_aln.fasta -m data/ba_2_86.tsv -o data/ba_aln_reformat.fasta -n 0

I calculated the mutual information among the aligned columns.

# mutual information
python /mnt/d/Documents/analytic_garden/covid-19/src/GISAID/gisaid_MI.py -i data/ba_aln_reformat.fasta -o output/ba_mi.csv

Here are the first few rows.

Position_1	Position_2	MI
2527	3177	0.848649
6183	12815	0.813147
2127	2527	0.563014
2127	3177	0.563014

Calculating the mutations.

mutations
python /mnt/d/Documents/analytic_garden/covid-19/src/GISAID/gisaid_mutations.py -i /mnt/g/Covid-19/GISAID/2023_09_05/data/ba_aln_reformat.fasta -o /mnt/g/Covid-19/GISAID/2023_09_05/output/ba_mutations.csv -g /mnt/g/Covid-19/GISAID/2023_09_05/data/NC_045512.2.gb

The full table of mutations can be found here.

Finally, the features of the columns with significant mutual information are determined.

# MI features
time python /mnt/d/Documents/analytic_garden/covid-19/src/GISAID/gisaid_MI_features.py -i output/ba_mi.csv -o output/ba_mi_features.csv -m output/ba_mutations.csv

Full results and code can be found on GitHub.

The Analytic Garden

BA.2.86

Mutation Pipeline

No comments:

Post a Comment

Labels

Contributors

wfmu