I have been tracking mutations in the Sars-CoV-2 genome since the beginning of the pandemic. There are many mysteries about the changes in the genome, but we have also learned a lot. When I say "we", I mean the scientific/public health community in general. What follows is mostly a "natural science" approach to studying the Sars-CoV-2 mutations.
Coronaviruses have evolved the large single-stranded RNA genomes. Regulation of mRNA transcription and translations is facilitated by cis-acting structures that interact with each other over long genomic distances. See https://www.cell.com/molecular-cell/fulltext/S1097-2765(20)30782-6.
On November 11, 2020, I downloaded 26,434 Sars-CoV-2 genome sequences (data updated on Nov. 9, 2020) and their related GenBank records from https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/. Since some GenBank records only list the collection date as the month or even the year, I removed sequences with invalid dates. This left 24,769 sequences. I aligned the sequences to the reference sequence NC_045512.2 with Mafft version 7.471.
time ~/anaconda3/bin/mafft --auto --preservecase --thread -1 --addfragments sequences_valid_dates.fasta NC_045512.fasta > sequences_valid_dates_aln.fasta
Mutual Information
Next, I calculated mutual information among the columns with significant variation in nucleotide content in the aligned sequences, i.e. positions with significant numbers of mutations (variations from the reference sequence).
The mutual information (MI) between two columns tells us how the two columns vary with one another. That is, it quantifies the amount of information obtained about one column by observing the other. MI is measured in bits. For example if column 1 contained 500 A's followed by 500 C's and column 2 contained contained 500 C's followed by 500 A's, their mutual information would be 1 bit. In other words, they vary in complete tandem. If you see and A in column 1, you will see a C in column 2 in the same row. A C in column 1 indicates that there will be an A in column 2.
Obviously, columns in the aligned data will not vary perfectly with one another, but some columns vary quite strongly with others. Here are the aligned columns with MI >= 0.48 bits. Positions are locations in the reference genome NC_045512.
Position_1 |
Position_2 |
MI |
28882 |
28883 |
0.983502 |
28881 |
28882 |
0.979255 |
28881 |
28883 |
0.978091 |
7540 |
23401 |
0.895542 |
22992 |
23401 |
0.892951 |
7540 |
16647 |
0.892555 |
7540 |
22992 |
0.892441 |
16647 |
23401 |
0.892054 |
16647 |
22992 |
0.888951 |
18555 |
23401 |
0.879099 |
7540 |
18555 |
0.878659 |
18555 |
22992 |
0.876136 |
16647 |
18555 |
0.875278 |
1163 |
7540 |
0.815033 |
1163 |
23401 |
0.814758 |
1163 |
22992 |
0.811634 |
1163 |
16647 |
0.811379 |
1163 |
18555 |
0.798756 |
1059 |
25563 |
0.571521 |
1163 |
28882 |
0.568999 |
1163 |
28883 |
0.568928 |
1163 |
28881 |
0.567995 |
241 |
23403 |
0.540602 |
7540 |
28882 |
0.533841 |
23401 |
28882 |
0.533792 |
7540 |
28883 |
0.533778 |
23401 |
28883 |
0.533728 |
241 |
3037 |
0.533267 |
7540 |
28881 |
0.532923 |
23401 |
28881 |
0.532874 |
16647 |
28882 |
0.53254 |
16647 |
28883 |
0.532477 |
16647 |
28881 |
0.531621 |
22992 |
28882 |
0.530918 |
22992 |
28883 |
0.530854 |
22992 |
28881 |
0.530001 |
3037 |
23403 |
0.52915 |
18555 |
28882 |
0.518494 |
18555 |
28883 |
0.51843 |
18555 |
28881 |
0.517577 |
241 |
14408 |
0.504523 |
14408 |
23403 |
0.497725 |
3037 |
14408 |
0.490892 |
Notice from the table how the positions fall in groups. For example, positions 28,881, 28,882, and 28,883 fall into a group of positions strongly varying together. If we plot the positions and their relations to one another we can get an idea of which positions have significant MI with others.
The co-varying positions fall into 4 groups. The first group, we have seen before.
reference
positions |
ref_nucleotide |
alt_nucleotide |
feature_type |
codon |
alt_codons |
mutation |
product |
241 |
C |
T |
5'UTR |
||||
14408 |
C |
T |
mat_peptide |
CCT |
CTT |
P314L |
RNA-dependent
RNA polymerase |
3037 |
C |
T |
mat_peptide |
TTC |
TTT |
F106F |
nsp3 |
23403 |
A |
G |
CDS |
GAT |
GGT |
D614G |
surface
glycoprotein |
The alternate nucleotides have become dominant in the population of database sequences and thus among cases lending evidence to the idea that the D614G mutation makes the virus more transmissible.
A Co-varying Pair
reference
positions |
ref_nucleotide |
alt_nucleotide |
feature_type |
codon |
alt_codons |
mutation |
product |
1059 |
C |
T |
mat_peptide |
ACC |
ATC |
T85I |
nsp2 |
25563 |
G |
T |
CDS |
CAG |
CAT |
Q57H |
ORF3a protein |
A Strongly Co-varying Triple
reference
positions |
ref_nucleotide |
alt_nucleotide |
feature_type |
codon |
alt_codons |
mutation |
product |
notes |
28881 |
G |
A |
CDS |
AGG |
AAG |
R203K |
nucleocapsid
phosphoprotein |
ORF9;
structural protein |
28882 |
G |
A |
CDS |
AGG |
AGA |
R203R |
nucleocapsid
phosphoprotein |
ORF9;
structural protein |
28883 |
G |
C |
CDS |
GGA |
CGA |
G204R |
nucleocapsid
phosphoprotein |
ORF9;
structural protein |
reference
positions |
ref_nucleotide |
alt_nucleotide |
feature_type |
codon |
alt_codons |
mutation |
product |
28881 |
G |
A |
CDS |
AGG |
AAG |
R203K |
nucleocapsid
phosphoprotein |
28882 |
G |
A |
CDS |
AGG |
AGA |
R203R |
nucleocapsid
phosphoprotein |
28883 |
G |
C |
CDS |
GGA |
CGA |
G204R |
nucleocapsid
phosphoprotein |
1163 |
A |
T |
mat_peptide |
ATT |
TTT |
I120F |
nsp2 |
18555 |
C |
T |
mat_peptide |
GAC |
GAT |
D172D |
3'-to-5'
exonuclease |
16647 |
G |
T |
mat_peptide |
ACG |
ACT |
T137T |
helicase |
23401 |
G |
A |
CDS |
CAG |
CAA |
Q613Q |
surface glycoprotein spike protein |
7540 |
T |
C |
mat_peptide |
ACT |
ACC |
T1607T |
nsp3 |
22992 |
G |
A |
CDS |
AGC |
AAC |
S477N |
surface
glycoprotein spike protein |
The following plots show the variation of mutations over the course of the pandemic.
No comments:
Post a Comment