The Analytic Garden: Sars-CoV-2 Co-Varying Changes

Understanding the origin of SARS-CoV-2 has implications for public health, economic, and political stability as well as basic science.

~ W. Ian Lipkin, MD

'I wish it need not have happened in my time,’ said Frodo. ‘So do I,’ said Gandalf, ‘and so do all who live to see such times. But that is not for them to decide. All we have to decide is what to do with the time that is given us.’”

- J..R.R. Tolkein

I have been tracking mutations in the Sars-CoV-2 genome since the beginning of the pandemic. There are many mysteries about the changes in the genome, but we have also learned a lot. When I say "we", I mean the scientific/public health community in general. What follows is mostly a "natural science" approach to studying the Sars-CoV-2 mutations.

Coronaviruses have evolved the large single-stranded RNA genomes. Regulation of mRNA transcription and translations is facilitated by cis-acting structures that interact with each other over long genomic distances. See https://www.cell.com/molecular-cell/fulltext/S1097-2765(20)30782-6.

On November 11, 2020, I downloaded 26,434 Sars-CoV-2 genome sequences (data updated on Nov. 9, 2020) and their related GenBank records from https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/. Since some GenBank records only list the collection date as the month or even the year, I removed sequences with invalid dates. This left 24,769 sequences. I aligned the sequences to the reference sequence NC_045512.2 with Mafft version 7.471.

 time ~/anaconda3/bin/mafft --auto --preservecase --thread -1 --addfragments sequences_valid_dates.fasta NC_045512.fasta > sequences_valid_dates_aln.fasta

Mutual Information

Next, I calculated mutual information among the columns with significant variation in nucleotide content in the aligned sequences, i.e. positions with significant numbers of mutations (variations from the reference sequence).

The mutual information (MI) between two columns tells us how the two columns vary with one another. That is, it quantifies the amount of information obtained about one column by observing the other. MI is measured in bits. For example if column 1 contained 500 A's followed by 500 C's and column 2 contained contained 500 C's followed by 500 A's, their mutual information would be 1 bit. In other words, they vary in complete tandem. If you see and A in column 1, you will see a C in column 2 in the same row. A C in column 1 indicates that there will be an A in column 2.

Obviously, columns in the aligned data will not vary perfectly with one another, but some columns vary quite strongly with others. Here are the aligned columns with MI >= 0.48 bits. Positions are locations in the reference genome NC_045512.

Position_1	Position_2	MI
28882	28883	0.983502
28881	28882	0.979255
28881	28883	0.978091
7540	23401	0.895542
22992	23401	0.892951
7540	16647	0.892555
7540	22992	0.892441
16647	23401	0.892054
16647	22992	0.888951
18555	23401	0.879099
7540	18555	0.878659
18555	22992	0.876136
16647	18555	0.875278
1163	7540	0.815033
1163	23401	0.814758
1163	22992	0.811634
1163	16647	0.811379
1163	18555	0.798756
1059	25563	0.571521
1163	28882	0.568999
1163	28883	0.568928
1163	28881	0.567995
241	23403	0.540602
7540	28882	0.533841
23401	28882	0.533792
7540	28883	0.533778
23401	28883	0.533728
241	3037	0.533267
7540	28881	0.532923
23401	28881	0.532874
16647	28882	0.53254
16647	28883	0.532477
16647	28881	0.531621
22992	28882	0.530918
22992	28883	0.530854
22992	28881	0.530001
3037	23403	0.52915
18555	28882	0.518494
18555	28883	0.51843
18555	28881	0.517577
241	14408	0.504523
14408	23403	0.497725
3037	14408	0.490892

Notice from the table how the positions fall in groups. For example, positions 28,881, 28,882, and 28,883 fall into a group of positions strongly varying together. If we plot the positions and their relations to one another we can get an idea of which positions have significant MI with others.

It is important to note that the connections among the columns do not mean that a change in one position causes a change in another.

Co-varying Genome Positions

D614G and Friends

The co-varying positions fall into 4 groups. The first group, we have seen before.

reference positions	ref_nucleotide	alt_nucleotide	feature_type	codon	alt_codons	mutation	product
241	C	T	5'UTR
14408	C	T	mat_peptide	CCT	CTT	P314L	RNA-dependent RNA polymerase
3037	C	T	mat_peptide	TTC	TTT	F106F	nsp3
23403	A	G	CDS	GAT	GGT	D614G	surface glycoprotein

The D614G (aspartate to glycine in protein position 614) mutation in the spike protein has been implicated in increased transmissibility of the virus. D614G has become the most prevalent form globally. In addition, in infected individuals, G614 is associated with lower RT-PCR cycle thresholds, suggestive of higher upper respiratory tract viral loads, but not with increased disease severity. See https://www.cell.com/action/showPdf?pii=S0092-8674%2820%2930820-5 and https://www.biorxiv.org/content/10.1101/2020.06.14.151357v2.

The plot below show how the per cent each of the above nucleotides in the database has varied over the course of the pandemic. Notice how the nucleotide group TTTG has completely replaced the original CCCA group.

The alternate nucleotides have become dominant in the population of database sequences and thus among cases lending evidence to the idea that the D614G mutation makes the virus more transmissible.

The plot below shows how the combination TTTG came to dominate the data collection. For clarity, this plot ignores combinations with a smaller number of variations.

A Co-varying Pair

A pair of nucleotides, 1059 and 25563, can be seen as co-varying in the figure at the top.

reference positions	ref_nucleotide	alt_nucleotide	feature_type	codon	alt_codons	mutation	product
1059	C	T	mat_peptide	ACC	ATC	T85I	nsp2
25563	G	T	CDS	CAG	CAT	Q57H	ORF3a protein

As seen in the figure below, the original CG nucleotide pair is replaced by a TT or CT pair, but not completely. The CT combination shows up less frequently, contained in about 7% of the total sequences.

A Strongly Co-varying Triple

In the group of nine positions in lower left of the mutual information network, there is a group of three positions which co-vary strongly: 28881, 28882, and 28883.

reference positions	ref_nucleotide	alt_nucleotide	feature_type	codon	alt_codons	mutation	product	notes
28881	G	A	CDS	AGG	AAG	R203K	nucleocapsid phosphoprotein	ORF9; structural protein
28882	G	A	CDS	AGG	AGA	R203R	nucleocapsid phosphoprotein	ORF9; structural protein
28883	G	C	CDS	GGA	CGA	G204R	nucleocapsid phosphoprotein	ORF9; structural protein

The AAC combination seemed to dominate through the summer months, but GGG may make a comeback.

The full group in the lower left contains two spike protein mutations; a synonymous change Q613Q, and a non-synonymous S477N.

reference positions	ref_nucleotide	alt_nucleotide	feature_type	codon	alt_codons	mutation	product
28881	G	A	CDS	AGG	AAG	R203K	nucleocapsid phosphoprotein
28882	G	A	CDS	AGG	AGA	R203R	nucleocapsid phosphoprotein
28883	G	C	CDS	GGA	CGA	G204R	nucleocapsid phosphoprotein
1163	A	T	mat_peptide	ATT	TTT	I120F	nsp2
18555	C	T	mat_peptide	GAC	GAT	D172D	3'-to-5' exonuclease
16647	G	T	mat_peptide	ACG	ACT	T137T	helicase
23401	G	A	CDS	CAG	CAA	Q613Q	surface glycoprotein spike protein
7540	T	C	mat_peptide	ACT	ACC	T1607T	nsp3
22992	G	A	CDS	AGC	AAC	S477N	surface glycoprotein spike protein

The following plots show the variation of mutations over the course of the pandemic.

All code and results can be downloaded from https://github.com/analytic-garden/Sars-Cov-2-Mutations

The Analytic Garden

Sars-CoV-2 Co-Varying Changes

Mutual Information

D614G and Friends

A Co-varying Pair

A Strongly Co-varying Triple

No comments:

Post a Comment

Labels

Contributors

wfmu