Sars-CoV-2 Co-Varying Changes

Understanding the origin of SARS-CoV-2 has implications for public health, economic, and political stability as well as basic science.
W. Ian Lipkin, MD

'I wish it need not have happened in my time,’ said Frodo. ‘So do I,’ said Gandalf, ‘and so do all who live to see such times. But that is not for them to decide. All we have to decide is what to do with the time that is given us.’” 
- J..R.R. Tolkein

I have been tracking mutations in the Sars-CoV-2 genome since the beginning of the pandemic. There are many mysteries about the changes in the genome, but we have also learned a lot. When I say "we", I mean the scientific/public health community in general. What follows is mostly a "natural science" approach to studying the Sars-CoV-2 mutations. 

Coronaviruses have evolved the large single-stranded RNA genomes. Regulation of mRNA transcription and translations is facilitated by cis-acting structures that interact with each other over long genomic distances. See https://www.cell.com/molecular-cell/fulltext/S1097-2765(20)30782-6.

On November 11, 2020, I downloaded 26,434 Sars-CoV-2 genome sequences (data updated on Nov. 9, 2020) and their related GenBank records from https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/. Since some GenBank records only list the collection date as the month or even the year, I removed sequences with invalid dates. This left 24,769 sequences. I aligned the sequences to the reference sequence NC_045512.2 with Mafft version 7.471.

 time ~/anaconda3/bin/mafft --auto --preservecase --thread -1 --addfragments sequences_valid_dates.fasta NC_045512.fasta > sequences_valid_dates_aln.fasta  

Mutual Information

Next, I calculated mutual information among the columns  with significant variation in nucleotide content in the aligned sequences, i.e. positions with significant numbers of mutations (variations from the reference sequence).

The mutual information (MI) between two columns tells us how the two columns vary with one another. That is, it quantifies the amount of information obtained about one column by  observing the other. MI is measured in bits. For example if column 1 contained 500 A's followed by 500 C's and column 2 contained contained 500 C's followed by 500 A's, their mutual information would be 1 bit. In other words, they vary in complete tandem. If you see and A in column 1, you will see a C in column 2 in the same row. A C in column 1 indicates that there will be an A in column 2.

Obviously, columns in the aligned data will not vary perfectly with one another, but some columns vary quite strongly with others. Here are the aligned columns with MI >= 0.48 bits. Positions are locations in the reference genome NC_045512.

Position_1

Position_2

MI

28882

28883

0.983502

28881

28882

0.979255

28881

28883

0.978091

7540

23401

0.895542

22992

23401

0.892951

7540

16647

0.892555

7540

22992

0.892441

16647

23401

0.892054

16647

22992

0.888951

18555

23401

0.879099

7540

18555

0.878659

18555

22992

0.876136

16647

18555

0.875278

1163

7540

0.815033

1163

23401

0.814758

1163

22992

0.811634

1163

16647

0.811379

1163

18555

0.798756

1059

25563

0.571521

1163

28882

0.568999

1163

28883

0.568928

1163

28881

0.567995

241

23403

0.540602

7540

28882

0.533841

23401

28882

0.533792

7540

28883

0.533778

23401

28883

0.533728

241

3037

0.533267

7540

28881

0.532923

23401

28881

0.532874

16647

28882

0.53254

16647

28883

0.532477

16647

28881

0.531621

22992

28882

0.530918

22992

28883

0.530854

22992

28881

0.530001

3037

23403

0.52915

18555

28882

0.518494

18555

28883

0.51843

18555

28881

0.517577

241

14408

0.504523

14408

23403

0.497725

3037

14408

0.490892


Notice from the table how the positions fall in groups. For example, positions 28,881, 28,882, and 28,883 fall into a group of positions strongly varying together. If we plot the positions and their relations to one another we can get an idea of which positions have significant MI with others.

It is important to note that the connections among the columns do not mean that a change in one position causes a change in another.

Co-varying Genome Positions

D614G and Friends


The co-varying positions fall into 4 groups. The first group, we have seen before.

reference positions

ref_nucleotide

alt_nucleotide

feature_type

codon

alt_codons

mutation

product

241

C

T

5'UTR

14408

C

T

mat_peptide

CCT

CTT

P314L

RNA-dependent RNA polymerase

3037

C

T

mat_peptide

TTC

TTT

F106F

nsp3

23403

A

G

CDS

GAT

GGT

D614G

surface glycoprotein


The D614G (aspartate to glycine in protein position 614) mutation in the spike protein has been implicated in increased transmissibility of the virus. D614G has become the most prevalent form globally. In addition, in infected individuals, G614 is associated with lower RT-PCR cycle thresholds, suggestive of higher upper respiratory tract viral loads, but not with increased disease severity. See https://www.cell.com/action/showPdf?pii=S0092-8674%2820%2930820-5 and https://www.biorxiv.org/content/10.1101/2020.06.14.151357v2.

The plot below show how the per cent each of the above nucleotides in the database has varied over the course of the pandemic.  Notice how the nucleotide group TTTG has completely replaced the original CCCA group. 


The alternate nucleotides have become dominant in the population of database sequences and thus among cases lending evidence to the idea that the D614G mutation makes the virus more transmissible. 

The plot below shows how the combination TTTG came to dominate the data collection. For clarity, this plot ignores combinations with a smaller number of variations.


A Co-varying Pair


A pair of nucleotides, 1059 and 25563, can be seen as co-varying in the figure at the top. 

reference positions

ref_nucleotide

alt_nucleotide

feature_type

codon

alt_codons

mutation

product

1059

C

T

mat_peptide

ACC

ATC

T85I

nsp2

25563

G

T

CDS

CAG

CAT

Q57H

ORF3a protein


As seen in the figure below, the original CG nucleotide pair is replaced by a TT or CT pair, but not completely. The CT combination shows up less frequently, contained in about 7% of the total sequences.




A Strongly Co-varying Triple


In the group of nine positions in lower left of the mutual information network, there is a group of three positions which co-vary strongly: 28881, 28882, and 28883.

reference positions

ref_nucleotide

alt_nucleotide

feature_type

codon

alt_codons

mutation

product

notes

28881

G

A

CDS

AGG

AAG

R203K

nucleocapsid phosphoprotein

ORF9; structural protein

28882

G

A

CDS

AGG

AGA

R203R

nucleocapsid phosphoprotein

ORF9; structural protein

28883

G

C

CDS

GGA

CGA

G204R

nucleocapsid phosphoprotein

ORF9; structural protein


The AAC combination seemed to dominate through the summer months, but GGG may make a comeback.



The full group in the lower left contains two spike protein mutations; a synonymous change Q613Q, and a non-synonymous S477N. 

reference positions

ref_nucleotide

alt_nucleotide

feature_type

codon

alt_codons

mutation

product

28881

G

A

CDS

AGG

AAG

R203K

nucleocapsid phosphoprotein

28882

G

A

CDS

AGG

AGA

R203R

nucleocapsid phosphoprotein

28883

G

C

CDS

GGA

CGA

G204R

nucleocapsid phosphoprotein

1163

A

T

mat_peptide

ATT

TTT

I120F

nsp2

18555

C

T

mat_peptide

GAC

GAT

D172D

3'-to-5' exonuclease

16647

G

T

mat_peptide

ACG

ACT

T137T

helicase

23401

G

A

CDS

CAG

CAA

Q613Q

surface glycoprotein

spike protein

7540

T

C

mat_peptide

ACT

ACC

T1607T

nsp3

22992

G

A

CDS

AGC

AAC

S477N

surface glycoprotein

spike protein


The following plots show the variation of mutations over the course of the pandemic.



All code and results can be downloaded from https://github.com/analytic-garden/Sars-Cov-2-Mutations


No comments:

Post a Comment