The Analytic Garden: Sars-CoV-2 Covid-19 Part 4, Continuing Mutation Story

I have thought for a long time that the most likely virus that might cause a new pandemic would be a coronavirus. We don't yet know how contagious it (SARS-CoV-2) is. We know that it is being spread person to person, but we don't know to what extent.

~ Eric Toner

I think, with every kind of creature and every kind of human, there is no better. We're all just mutations, and I think that each mutation should be celebrated.

~ Arca

The D614G Sars-CoV-2 mutation, mentioned in a previous post, has been in the news again. The mutation is a single nucleotide change of a A to an G at position 23403 in the reference sequence, NC_045512, that causes a non-synonymous amino acid change from Aspartic acid (D) to Glycine (G). In that previous post, I mention a pre-print describing the mutation. The full article has now been published in Cell. They present some in-vitro data showing possible increased transmissability of the mutated virus as well as some clinical data that indicates the possibility of the same. However, so far there is not evidence that the mutated virus is more virulent.

The D614G mutation is located in the viral spike receptor binding domain. The spike protein binds to the host cell ACE2 receptors so it's an important target for therapeutics. Along with the A to G change at position 23403, there are three other co-varying mutations: C to T at position 214 (Argenine to Cysteine in the 5'UTR of ORF1ab), C to T at 3037 (Phenylalanine F synonymous mutation at amino acid position 106), and a C to T mutation at position 14408 (Leucine L synonymous mutation at 323). Combined with the mutation at 23403, the four mutations form a quartet.of changes C-C-C-A to G-G-G-T that co-vary.

Sequence Alignment

On July 14 2020 I downloaded 58043 aligned sequences from GISAID. Sequence collection dates ranged from Dec. 24 2019 to June. 28 2020. I removed duplicate sequences from the alignment resulting in 51493 sequences. GISAID aligned the sequences wit Mafft. The quality of the downloaded alignment is poor, with long gaps and strings of N's, probably due to poor quality sequences included in the alignment. GISAID has a restrictive data sharing policy. You need a .edu or similar e-mail address to access the data, plebs with GMail or similar accounts need not apply. You can't reshare the data despite the fact that most of it was generated for the most part using tax payer supported funds in various countries. You can't get the raw sequences. Most of the sequence data lacks GenBank or similar meta data links. Unfortunately, in terms of sequence data, it's the largest source.

The mutual information among the aligned columns shows how the four position of the quartet vary with one another

Nucleotide Pairs	MI
3037, 23403	0.78248
14408, 23403	0.78109
3037, 14408	0.77639
241, 23403	0.76604
241, 14408	0.75662
241, 3037	0.7554

I counted the C-C-C-A and G-G-G-T quartets, after eliminating quartets containing gap or ambiguous characters resulting in 49959 observations.

The table below shows the most varying positions.

reference columns	ref	consensus	consensus %	codons	aas	alt_codons	alt_aas	aa_pos	products
241	C	T	74.81935497	CGT					5''UTR
14408	C	T	75.14366417	CTA	L	TTA	L	323	RNA-dependent RNA polymerase
3037	C	T	75.21299203	TTC	F	TTT	F	106	nsp3
23403	A	G	75.32525709	GAT	D	GGT	G	614	surface glycoprotein
25563	G	G	76.97581116	CAG	Q	CAT	H	57	ORF3a protein
1059	C	C	82.06669138	ACC	T	ATC	I	85	nsp2

The T-T-T-G quartet has become dominant in the data as shown by the following stacked bar chart.

The plot reflects the data available and not necessarily prevalence of COVID-19 cases. As of this writing the virus is raging in the USA with no decrease in intensity in sight. The US data is likely to contain the T-T-T-G quartet.

Summary data and code, minus the GISAID sequences, are available here.

The Analytic Garden

Sars-CoV-2 Covid-19 Part 4, Continuing Mutation Story

Sequence Alignment

No comments:

Post a Comment

Labels

Contributors

wfmu