Sars-CoV-2 Covid-19 Part 4, Continuing Mutation Story

I have thought for a long time that the most likely virus that might cause a new pandemic would be a coronavirus. We don't yet know how contagious it (SARS-CoV-2) is. We know that it is being spread person to person, but we don't know to what extent.  
~ Eric Toner 

I think, with every kind of creature and every kind of human, there is no better. We're all just mutations, and I think that each mutation should be celebrated.
~ Arca

The D614G Sars-CoV-2 mutation, mentioned in a previous post, has been in the news again.  The mutation is a single nucleotide change of a A to an G at position 23403 in the reference sequence, NC_045512, that causes a non-synonymous  amino acid change from Aspartic acid (D) to Glycine (G). In that previous post, I mention a pre-print describing the mutation. The full article has now been published in Cell. They present some in-vitro data showing possible increased transmissability of  the mutated virus as well as some clinical data that indicates the possibility of the same. However, so far there is not evidence that the mutated virus is more virulent. 

The D614G mutation is located in the viral spike receptor binding domain. The spike protein binds to the host cell ACE2 receptors so it's an important target for therapeutics. Along with the A to  G change at position 23403, there are three other co-varying mutations: C to T at position 214 (Argenine to Cysteine in the 5'UTR of ORF1ab), C to T at  3037 (Phenylalanine F synonymous mutation at amino acid position 106), and a C to T mutation at position 14408 (Leucine L synonymous mutation at 323). Combined with the mutation at 23403, the four mutations form a quartet.of changes C-C-C-A to G-G-G-T that co-vary.

Sequence Alignment

On July 14 2020 I downloaded 58043 aligned sequences from GISAID. Sequence collection dates ranged from Dec. 24 2019 to June. 28 2020. I removed duplicate sequences from the alignment resulting in 51493 sequences. GISAID aligned the sequences wit Mafft. The quality of the downloaded alignment is poor, with long gaps and strings of N's, probably due to poor quality sequences included in the alignment. GISAID has a restrictive data sharing policy. You need a .edu or similar e-mail address to access the data, plebs with GMail or similar accounts need not apply. You can't reshare the data despite the fact that most of it was generated for the most part using tax payer supported funds in various countries. You can't get the raw sequences. Most of the sequence data lacks GenBank or similar meta data links. Unfortunately, in terms of sequence data, it's the largest source.

The mutual information among the aligned columns shows how the four position of the quartet vary with one another

Nucleotide Pairs

MI            

3037, 23403

 0.78248  

14408, 23403

0.78109 

3037, 14408

0.77639  

241, 23403

0.76604  

241, 14408

0.75662  

241, 3037

0.7554  


I counted the C-C-C-A and  G-G-G-T quartets, after eliminating quartets containing gap or ambiguous characters resulting in 49959 observations.

The table below shows the most varying positions.

reference columns

ref

consensus

consensus %

codons

aas

alt_codons

alt_aas

aa_pos

products

241

C

T

74.81935497

CGT

5''UTR

14408

C

T

75.14366417

CTA

L

TTA

L

323

RNA-dependent RNA polymerase

3037

C

T

75.21299203

TTC

F

TTT

F

106

nsp3

23403

A

G

75.32525709

GAT

D

GGT

G

614

surface glycoprotein

25563

G

G

76.97581116

CAG

Q

CAT

H

57

ORF3a protein

1059

C

C

82.06669138

ACC

T

ATC

I

85

nsp2


The T-T-T-G quartet has become dominant in the data as shown by the following stacked bar chart.

The plot reflects the data available and not necessarily prevalence of COVID-19 cases. As of this writing the virus is raging in the USA with no decrease in intensity in sight. The US data is likely to contain the  T-T-T-G quartet.

Summary data and code, minus the GISAID sequences, are  available here.


No comments:

Post a Comment