SARS-CoV-2, COVID-19, and all that

The coronavirus pandemic is a world-changing event, like 9/11. There was a world before Covid-19. And there will be a world after Covid-19. But it won't be the same.~Oliver Markus Malloy

Hello SARS, goodbye world.
~Steven Magee



https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2#/media/File:2019-nCoV-CDC-23312_without_background.png

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the virus that causes the disease COVID-19. It's a single stranded positive-sense (+) RNA virus. It is highly contagious and is the source of a world-wide pandemic. Here in the US, we're being told to isolate ourselves as much as possible. Bars, restaurants, etc. are closed. There are many unknowns about the virus and the disease at present. For example, how infectious is the disease during the incubation period? 

The first known infections from the SARS-CoV-2 strain were discovered in Wuhan, China. The virus shares strong homology with  many SARS-like bat coronaviruses. Bats are probably the most likely reservoir of the virus. It's likely that it reached humans through an intermediate vector, possibly pangolins.


https://en.wikipedia.org/wiki/Pangolin


In addition to the overall social and economic implications.for the world, here at home, we have our own concerns. Bev is an EMT, so she is on the front lines of the pandemic. I am in the COVID-19 high risk group because of an underlying condition.

Since we're stuck at home, we decided to see what we could learn about the virus. What follows is our quick and incomplete analysis of the viral genome.

SARS-CoV-2

SARS-CoV-2 belongs to the coronavirus family. Other coronaviruses cause diseases such as the common cold, Middle East respiratory syndrome (MERS), and severe acute respiratory syndrome (SARS). It's genome is approximately 30,000 base pairs. 

As of March 18 2020, there were 93 complete SARS-CoV-2 genomes deposited at NCBI Virus. We downloaded them and aligned them with Muscle version 3.8.31. We also downloaded the full GenBank records for all 93 genomes. Among the sequences was the reference sequence NC_045512. 

The alignment showed that almost all aligned positions were 100% identical. However, two aligned reference sequence positions  8782 and 28144 (alignment positions 8785 and 28150) showed significant (73%) variation. GenBank shows position 8782 lies in orf1ab. Orf1ab codes for a polyprotein (ID: YP_009724389.1). Position 8782 is in a peptide labeled as nsp4B_TM; a transmembrane domain 2 (TM2), ID YP_009725300.1. Position 28144 is in orf8 (YP_009724396.1). It has 100% amino acid homology a hypothetical bat SARS protein.

We weren't the first to spot the variation in these positions. David Sinclair pointed this out on March 14 on Twitter. It is also described in this preprint.

The interesting thing about the variation in these positions can be seen in following table. The nucleotide vary together. Both aligned positions consist of T and C. Where one has a T, the other has a C. This is true of all 93 sequences even though they were collected in different locations and at different times. The Y nucleotide is T or C.

ID
Collection Date
Country
Pos 8782
Pos 28144
MT007544
25-Jan-20
Australia: Victoria
C
T
MT126808
28-Feb-20
Brazil
C
T
NC_045512
Dec-19
China
C
T
MN908947
Dec-19
China
C
T
MN988668
2-Jan-20
China
C
T
MN988669
2-Jan-20
China
C
T
MT093631
8-Jan-20
China
C
T
MN975262
11-Jan-20
China
T
C
MT135041
26-Jan-20
China: Beijing
T
C
MT135042
28-Jan-20
China: Beijing
T
C
MT135044
28-Jan-20
China: Beijing
T
C
MT135043
28-Jan-20
China: Beijing
T
C
MT123290
5-Feb-20
China: Guangdong, Guangzhou
C
T
MT123292
27-Jan-20
China: Guangzhou
T
C
MT123293
29-Jan-20
China: Guangzhou
C
T
MT123291
29-Jan-20
China: Guangzhou
C
T
MT039873
20-Jan-20
China: Hangzhou
C
T
MT019529
23-Dec-19
China: Hubei, Wuhan
C
T
MT019530
30-Dec-19
China: Hubei, Wuhan
C
T
MT019531
30-Dec-19
China: Hubei, Wuhan
C
T
MT019532
30-Dec-19
China: Hubei, Wuhan
C
T
MT019533
1-Jan-20
China: Hubei, Wuhan
C
T
MT121215
2-Feb-20
China: Shanghai
C
T
MN938384
10-Jan-20
China: Shenzhen
T
C
MN996531
30-Dec-19
China: Wuhan
C
T
MN996529
30-Dec-19
China: Wuhan
C
T
MN996527
30-Dec-19
China: Wuhan
C
T
MN996530
30-Dec-19
China: Wuhan
C
T
MN996528
30-Dec-19
China: Wuhan
C
T
MT049951
17-Jan-20
China: Yunnan
T
C
MT012098
27-Jan-20
India: Kerala State
C
T
MT050493
31-Jan-20
India: Kerala State
T
C
MT066156
30-Jan-20
Italy
C
T
MT072688
1/13/2020
Nepal
C
T
MT039890
Jan-20
South Korea
C
T
MT093571
2/7/2020
Sweden
C
T
MT192759
25-Jan-20
Taiwan
C
T
MT066175
31-Jan-20
Taiwan
T
C
MT066176
5-Feb-20
Taiwan
C
T
MN985325
19-Jan-20
USA
T
C
MT184911
17-Feb-20
USA
C
T
MT159705
17-Feb-20
USA
C
T
MT159717
17-Feb-20
USA
C
T
MT184912
17-Feb-20
USA
C
T
MT159708
17-Feb-20
USA
C
T
MT159707
17-Feb-20
USA
C
T
MT159706
17-Feb-20
USA
C
T
MT159710
17-Feb-20
USA
C
T
MT184910
18-Feb-20
USA
C
T
MT159718
18-Feb-20
USA
C
T
MT159719
18-Feb-20
USA
C
T
MT159713
18-Feb-20
USA
C
T
MT159714
18-Feb-20
USA
C
T
MT184907
18-Feb-20
USA
C
T
MT159709
20-Feb-20
USA
C
T
MT159711
20-Feb-20
USA
C
T
MT184908
21-Feb-20
USA
C
T
MT159722
21-Feb-20
USA
C
T
MT159720
21-Feb-20
USA
C
T
MT159721
21-Feb-20
USA
C
T
MT184909
21-Feb-20
USA
C
T
MT184913
24-Feb-20
USA
C
T
MT159716
24-Feb-20
USA
C
T
MT159715
24-Feb-20
USA
C
T
MT159712
25-Feb-20
USA
C
T
MN997409
22-Jan-20
USA: AZ
T
C
MN994468
22-Jan-20
USA: CA
C
T
MN994467
23-Jan-20
USA: CA
T
C
MT044258
27-Jan-20
USA: CA
C
T
MT027063
29-Jan-20
USA: CA
C
T
MT027062
29-Jan-20
USA: CA
C
T
MT027064
29-Jan-20
USA: CA
C
T
MT106052
6-Feb-20
USA: CA
T
C
MT106053
10-Feb-20
USA: CA
C
T
MT118835
23-Feb-20
USA: CA
C
T
MT192765
3/11/2020
USA: CA, San Diego County
C
T
MT044257
28-Jan-20
USA: IL
T
C
MT039888
29-Jan-20
USA: MA
C
T
MT188341
5-Mar-20
USA: MN
T
C
MT188340
7-Mar-20
USA: MN
C
T
MT188339
9-Mar-20
USA: MN
T
C
MT152824
24-Feb-20
USA: Snohomish County, WA
T
C
MT106054
11-Feb-20
USA: TX
T
C
MT020880
25-Jan-20
USA: WA
T
C
MT020881
25-Jan-20
USA: WA
T
C
MT163716
27-Feb-20
USA: WA
C
T
MT163717
28-Feb-20
USA: WA
T
C
MT163718
29-Feb-20
USA: WA
T
C
MT163719
1-Mar-20
USA: WA
T
C
MT039887
31-Jan-20
USA: WI
C
T
MT192773
22-Jan-20
Viet Nam: Ho Chi Minh city
C
T
MT192772
22-Jan-20
Viet Nam: Ho Chi Minh city
C
T


Both sites are in the third position in their respective codons. The codon containing position 8782 is either AGC or AGT which codes for Serine. Position 28144 is the second position of codons TTA or TCA. TTA codes for Leucine. TCA codes for Serine. The substitution at 8782 is a synonymous mutations, i.e. it doesn't change the amino acid sequence.

Given the high degree of covariation between these two columns, it's no surprise that the mutual information of the two columns is high: MI = 0.90 bits.

We trimmed the genomic sequences at each end so that all the sequences were the same length and build a neighbor joining phylogenetic tree using Jalview 2.11.0. The highlighted IDs are for the sequences with T at position 8782.





The only other features of note in the sequences is a 15 base gap in sequence MT159716. There are  other mutations scattered throughout the sequences, but no other positions show the degree of variation shown by those above. There is also a three base insertion, TTC, relative to the other sequences in MT188341 at position 21331. These mutations may be significant or not. It's hard to tell at this point.

What does it all mean?

We don't know. In the full genome data, there isn't enough information to do forensic tracking of where the variations arise. For example many sequences simply list USA as the location. The fact that two covary means that positions are important to the structure of the virus.  We would like to hear from anyone more knowledgeable about these subjects.

You can download the data here.

No comments:

Post a Comment