The Analytic Garden: Finding B.1.1.7 Mutations

This variant (B.1.1.7) is considerably more contagious than the original virus. It has spread rapidly around the globe and likely accounts already for at least one-third of all cases in the United States.
~ Francis Collins

As viral sequencing ramps up, more variants will be detected. Currently, three variants appear to be of concern: B.1.1.7, B.1.351, and P.1, first detected in the UK, South Africa, and Brazil, respectively. B.1.1.7 has an unusually large number of mutations, including many in the spike protein. Although this is the best established for B.1.1.7, all of these variants may transmit better from person to person. (Posted January 19, 2021)

~ Kartik Chandran, PhD

There is so much SARS-CoV-2 sequence data available that just manipulating it is a problem. This article describes just a few of the bottlenecks facing researchers trying to deal with a flood of data. The large number of sequences available for download means that a seemingly simple task like confirming the reported mutations in a lineage requires a fair amount of work. I'll try and describe the process I went through to produce a table of mutations.

The first step is to download the latest sequence and metadata from GISAID. I downloaded a set of data on March 18, 2021 consisting of 818,105 sequences. That was two weeks ago. Now there are more than 900,000 sequences available and more coming each day.

The meta file contains 28 columns, representing a variety of properties of the sequences.

head -n 1 metadata.tsv
strain  virus   gisaid_epi_isl  genbank_accession       date    region  country division        location region_exposure  country_exposure        division_exposure       segment length  host    age     sex     Nextstrain_clade  pango_lineage   GISAID_clade    originating_lab submitting_lab  authors url     title   paper_url date_submitted  purpose_of_sequencing

In order to get an overview of what was in the data, I did some counting.

# Nextstrain Strains in metadata
cut -d$'\t' -f 18 metadata.tsv | tail -n +2 | sort | uniq -c > strains.txt

# Region
cut -d$'\t' -f 6 metadata.tsv | tail -n +2 | sort | uniq -c > region.txt

# Region of exposure
cut -d$'\t' -f 10 metadata.tsv | tail -n +2 | sort | uniq -c > region_exposure.txt

# Country
cut -d$'\t' -f 7 metadata.tsv | tail -n +2 | sort | uniq -c > country.txt

# Country of exposure
cut -d$'\t' -f 11 metadata.tsv | tail -n +2 | sort | uniq -c > country_exposure.txt

#Lineage
cut -d$'\t' -f 19 metadata.tsv | tail -n +2 | sort | uniq -c > lineage.txt

lineage.txt shows that there are 161,760 B.1.1.7 records in the metatable.

The next step was to reformat the FASTA headers of the sequence file to include the country of exposure and the PANDO lineage. Like any big data set, there are problems: sequence IDs not included in the metafile, metafile records that don't have a corresponding sequence record, duplicate IDs, etc. Reformatting the headers makes life simpler in the remaining steps. All of the code, but not the data, is available on GitHub.

python gisaid_reformat_fasta_headers.py -i hov_19_sequences.fasta -m metadata.tsv -o hcov_19_reformat.fasta -r 2> missing_ids.txt -n 20

missing_ids.txt contains the IDs of sequences not found in metafile. In addition, the program removes sequences that contain long stretches of N's (the -n 20 option). These sequences are not copied to output. After processing 496,936 sequences remain.

Next, I extracted the sequences identified as being from the B.1.1.7 lineage based on the information in the FASTA header.

python gisaid_extract_fasta.py -i /gisaid_hcov_19_reformat.fasta -o gisaid_B_1_1_7.fasta -l B.1.1.7

gisaid_extract_fasta.py will extract sequences from a particular country of exposure as well and optionally convert 'U' to 'T' in the sequence. gisaid_B_1_1_7.fasta contains 118,945 sequences; a bit too big to deal with efficiently. Instead of using all of this data, I sample a smaller subset at random.


python gisaid_sample_seqs.py -i gisaid_B_1_1_7.fasta -o gisaid_B_1_1_7_10000.fasta -n 10000

The new file contains 10,000 sequences.

I want to align these sequences to a reference sequence. For the purposes of this analysis a mutation is a nucleotide variation from the the reference sequence. I will need to extract the reference sequence from the larger sequence file.

 gisaid_extract_ref_seq.py -i gisaid_hcov_19_reformat.fasta -o wuhan_ref_seq.fata

Next, I align the 10,000 sampled sequences to the reference sequence using MAFFT v7.475.

mafft --6merpair --maxambiguous 0.05 --preservecase --thread -1 --addfragments gisaid_B_1_1_7_10000.fasta wuhan_ref_seq.fasta > gisaid_B_1_1_7_10000_aln.fasta

Finally, the aligned sequences can be searched for variation.

python gisaid_mutations.py -i gisaid_B_1_1_7_10000_aln.fasta -o gisaid_B_1_1_7_10000_aln.csv -g NC_045512.gb

NC_045512.gb is the GenBank record for the Wuhan reference sequences. The sequence is identical to the GISAID reference sequence Wuhan/WIV04/2019.

The CSV file contains a 21 columns. I read it into R to extract some columns.

mut_B_1_1_7 <- read.csv('gisaid_B_1_1_7_10000_aln.csv', header = TRUE)
library(tidyverse)
mut_B_1_1_7_2 <- mut_B_1_1_7 %>% select(reference.positions, ref_nucleotide, alt_nucleotide, codon, ref_aa, alt_codons, alt_aas, aa_pos, mutation)

The final table looks like this.

reference.positions	ref_nucleotide	alt_nucleotide	codon	ref_aa	alt_codons	alt_aas	mutation
241	C	T
913	C	T	TCC	S	TCT	S	S36S
3037	C	T	TTC	F	TTT	F	F106F
3267	C	T	ACT	T	ATT	I	T183I
5388	C	A	GCT	A	GAT	D	A890D
5986	C	T	TTC	F	TTT	F	F1089F
6954	T	C	ATA	I	ACA	T	I1412T
14408	C	T	CCT	P	CTT	L	P314L
14676	C	T	CCC	P	CCT	P	P403P
15279	C	T	CAC	H	CAT	H	H604H
16176	T	C	ACT	T	ACC	T	T903T
23063	A	T	AAT	N	TAT	Y	N501Y
23271	C	A	GCT	A	GAT	D	A570D
23403	A	G	GAT	D	GGT	G	D614G
23604	C	A	CCT	P	CAT	H	P681H
23709	C	T	ACA	T	ATA	I	T716I
24506	T	G	TCA	S	GCA	A	S982A
24914	G	C	GAC	D	CAC	H	D1118H
27972	C	T	CAA	Q	TAA	*	Q27*
28048	G	T	AGA	R	ATA	I	R52I
28111	A	G	TAC	Y	TGC	C	Y73C
28280	G	C	GAT	D	CAT	H	D3H
28281	A	T	GAT	D	GTT	V	D3V
28282	T	A	GAT	D	GAA	E	D3E
28881	G	A	AGG	R	AAG	K	R203K
28882	G	A	AGG	R	AGA	R	R203R
28883	G	C	GGA	G	CGA	R	G204R
28977	C	T	TCT	S	TTT	F	S235F

These results correspond to those reported here.

Code for manipulating the sequences can be found on GitHub. Data is available from GISAID.

The Analytic Garden

Finding B.1.1.7 Mutations

No comments:

Post a Comment

Labels

Contributors

wfmu