I have thought for a long time that the most likely virus that might cause a new pandemic would be a coronavirus. We don't yet
know how contagious it (SARS-CoV-2) is. We know that it is being spread
person to person, but we don't know to what extent. ~ Eric Toner
I think, with every kind of creature and every kind of human, there is
no better. We're all just mutations, and I think that each mutation
should be celebrated.
~ Arca
The D614G Sars-CoV-2 mutation, mentioned
in a previous post, has been in the
news again. The mutation is a single nucleotide change of a A to an G at position 23403 in the reference sequence,
NC_045512, that causes a non-synonymous amino acid change from Aspartic acid (D) to Glycine (G). In that previous post, I mention a pre-print describing the mutation. The
full article has now been published in Cell. They present some in-vitro data showing possible increased transmissability of the mutated virus as well as some clinical data that indicates the possibility of the same. However, so far there is not evidence that the mutated virus is more virulent.
The D614G mutation is located in the viral spike receptor binding domain. The spike protein binds to the host cell ACE2 receptors so it's an important target for therapeutics. Along with the A to G change at position 23403, there are three other co-varying mutations: C to T at position 214 (Argenine to Cysteine in the 5'UTR of ORF1ab), C to T at 3037 (Phenylalanine F synonymous mutation at amino acid position 106), and a C to T mutation at position 14408 (Leucine L synonymous mutation at 323). Combined with the mutation at 23403, the four mutations form a quartet.of changes C-C-C-A to G-G-G-T that co-vary.
Sequence Alignment
On July 14 2020 I downloaded 58043 aligned sequences from
GISAID. Sequence collection dates ranged from Dec. 24 2019 to June. 28 2020. I removed duplicate sequences from the alignment resulting in 51493 sequences. GISAID aligned the sequences wit
Mafft. The quality of the downloaded alignment is poor, with long gaps and strings of N's, probably due to poor quality sequences included in the alignment. GISAID has a restrictive data sharing policy. You need a .edu or similar e-mail address to access the data, plebs with GMail or similar accounts need not apply. You can't reshare the data despite the fact that most of it was generated for the most part using tax payer supported funds in various countries. You can't get the raw sequences. Most of the sequence data lacks GenBank or similar meta data links. Unfortunately, in terms of sequence data, it's the largest source.
The
mutual information among the aligned columns shows how the four position of the quartet vary with one another
Nucleotide Pairs
|
MI
|
3037, 23403
|
0.78248
|
14408, 23403
|
0.78109
|
3037, 14408
|
0.77639
|
241, 23403
|
0.76604
|
241, 14408
|
0.75662
|
241, 3037
|
0.7554
|
I counted the C-C-C-A and G-G-G-T quartets, after eliminating quartets containing gap or ambiguous characters resulting in 49959 observations.
The table below shows the most varying positions.
reference columns
|
ref
|
consensus
|
consensus %
|
codons
|
aas
|
alt_codons
|
alt_aas
|
aa_pos
|
products
|
241
|
C
|
T
|
74.81935497
|
CGT
|
|
|
|
|
5''UTR
|
14408
|
C
|
T
|
75.14366417
|
CTA
|
L
|
TTA
|
L
|
323
|
RNA-dependent RNA polymerase
|
3037
|
C
|
T
|
75.21299203
|
TTC
|
F
|
TTT
|
F
|
106
|
nsp3
|
23403
|
A
|
G
|
75.32525709
|
GAT
|
D
|
GGT
|
G
|
614
|
surface glycoprotein
|
25563
|
G
|
G
|
76.97581116
|
CAG
|
Q
|
CAT
|
H
|
57
|
ORF3a protein
|
1059
|
C
|
C
|
82.06669138
|
ACC
|
T
|
ATC
|
I
|
85
|
nsp2
|
The T-T-T-G quartet has become dominant in the data as shown by the following stacked bar chart.
The plot reflects the data available and not necessarily prevalence of COVID-19 cases. As of this writing the virus is raging in the USA with no decrease in intensity in sight. The US data is likely to contain the T-T-T-G quartet.
Summary data and code, minus the GISAID sequences, are available here.
No comments:
Post a Comment