https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2#/media/File:2019-nCoV-CDC-23312_without_background.png |
The first known infections from the SARS-CoV-2 strain were discovered in Wuhan, China. The virus shares strong homology with many SARS-like bat coronaviruses. Bats are probably the most likely reservoir of the virus. It's likely that it reached humans through an intermediate vector, possibly pangolins.
https://en.wikipedia.org/wiki/Pangolin |
In addition to the overall social and economic implications.for the world, here at home, we have our own concerns. Bev is an EMT, so she is on the front lines of the pandemic. I am in the COVID-19 high risk group because of an underlying condition.
Since we're stuck at home, we decided to see what we could learn about the virus. What follows is our quick and incomplete analysis of the viral genome.
SARS-CoV-2
SARS-CoV-2 belongs to the coronavirus family. Other coronaviruses cause diseases such as the common cold, Middle East respiratory syndrome (MERS), and severe acute respiratory syndrome (SARS). It's genome is approximately 30,000 base pairs.
As of March 18 2020, there were 93 complete SARS-CoV-2 genomes deposited at NCBI Virus. We downloaded them and aligned them with Muscle version 3.8.31. We also downloaded the full GenBank records for all 93 genomes. Among the sequences was the reference sequence NC_045512.
The alignment showed that almost all aligned positions were 100% identical. However, two aligned reference sequence positions 8782 and 28144 (alignment positions 8785 and 28150) showed significant (73%) variation. GenBank shows position 8782 lies in orf1ab. Orf1ab codes for a polyprotein (ID: YP_009724389.1). Position 8782 is in a peptide labeled as nsp4B_TM; a transmembrane domain 2 (TM2), ID YP_009725300.1. Position 28144 is in orf8 (YP_009724396.1). It has 100% amino acid homology a hypothetical bat SARS protein.
We weren't the first to spot the variation in these positions. David Sinclair pointed this out on March 14 on Twitter. It is also described in this preprint.
The interesting thing about the variation in these positions can be seen in following table. The nucleotide vary together. Both aligned positions consist of T and C. Where one has a T, the other has a C. This is true of all 93 sequences even though they were collected in different locations and at different times. The Y nucleotide is T or C.
ID
|
Collection Date
|
Country
|
Pos 8782
|
Pos 28144
|
MT007544
|
25-Jan-20
|
Australia: Victoria
|
C
|
T
|
MT126808
|
28-Feb-20
|
Brazil
|
C
|
T
|
NC_045512
|
Dec-19
|
China
|
C
|
T
|
MN908947
|
Dec-19
|
China
|
C
|
T
|
MN988668
|
2-Jan-20
|
China
|
C
|
T
|
MN988669
|
2-Jan-20
|
China
|
C
|
T
|
MT093631
|
8-Jan-20
|
China
|
C
|
T
|
MN975262
|
11-Jan-20
|
China
|
T
|
C
|
MT135041
|
26-Jan-20
|
China: Beijing
|
T
|
C
|
MT135042
|
28-Jan-20
|
China: Beijing
|
T
|
C
|
MT135044
|
28-Jan-20
|
China: Beijing
|
T
|
C
|
MT135043
|
28-Jan-20
|
China: Beijing
|
T
|
C
|
MT123290
|
5-Feb-20
|
China: Guangdong, Guangzhou
|
C
|
T
|
MT123292
|
27-Jan-20
|
China: Guangzhou
|
T
|
C
|
MT123293
|
29-Jan-20
|
China: Guangzhou
|
C
|
T
|
MT123291
|
29-Jan-20
|
China: Guangzhou
|
C
|
T
|
MT039873
|
20-Jan-20
|
China: Hangzhou
|
C
|
T
|
MT019529
|
23-Dec-19
|
China: Hubei, Wuhan
|
C
|
T
|
MT019530
|
30-Dec-19
|
China: Hubei, Wuhan
|
C
|
T
|
MT019531
|
30-Dec-19
|
China: Hubei, Wuhan
|
C
|
T
|
MT019532
|
30-Dec-19
|
China: Hubei, Wuhan
|
C
|
T
|
MT019533
|
1-Jan-20
|
China: Hubei, Wuhan
|
C
|
T
|
MT121215
|
2-Feb-20
|
China: Shanghai
|
C
|
T
|
MN938384
|
10-Jan-20
|
China: Shenzhen
|
T
|
C
|
MN996531
|
30-Dec-19
|
China: Wuhan
|
C
|
T
|
MN996529
|
30-Dec-19
|
China: Wuhan
|
C
|
T
|
MN996527
|
30-Dec-19
|
China: Wuhan
|
C
|
T
|
MN996530
|
30-Dec-19
|
China: Wuhan
|
C
|
T
|
MN996528
|
30-Dec-19
|
China: Wuhan
|
C
|
T
|
MT049951
|
17-Jan-20
|
China: Yunnan
|
T
|
C
|
MT012098
|
27-Jan-20
|
India: Kerala State
|
C
|
T
|
MT050493
|
31-Jan-20
|
India: Kerala State
|
T
|
C
|
MT066156
|
30-Jan-20
|
Italy
|
C
|
T
|
MT072688
|
1/13/2020
|
Nepal
|
C
|
T
|
MT039890
|
Jan-20
|
South Korea
|
C
|
T
|
MT093571
|
2/7/2020
|
Sweden
|
C
|
T
|
MT192759
|
25-Jan-20
|
Taiwan
|
C
|
T
|
MT066175
|
31-Jan-20
|
Taiwan
|
T
|
C
|
MT066176
|
5-Feb-20
|
Taiwan
|
C
|
T
|
MN985325
|
19-Jan-20
|
USA
|
T
|
C
|
MT184911
|
17-Feb-20
|
USA
|
C
|
T
|
MT159705
|
17-Feb-20
|
USA
|
C
|
T
|
MT159717
|
17-Feb-20
|
USA
|
C
|
T
|
MT184912
|
17-Feb-20
|
USA
|
C
|
T
|
MT159708
|
17-Feb-20
|
USA
|
C
|
T
|
MT159707
|
17-Feb-20
|
USA
|
C
|
T
|
MT159706
|
17-Feb-20
|
USA
|
C
|
T
|
MT159710
|
17-Feb-20
|
USA
|
C
|
T
|
MT184910
|
18-Feb-20
|
USA
|
C
|
T
|
MT159718
|
18-Feb-20
|
USA
|
C
|
T
|
MT159719
|
18-Feb-20
|
USA
|
C
|
T
|
MT159713
|
18-Feb-20
|
USA
|
C
|
T
|
MT159714
|
18-Feb-20
|
USA
|
C
|
T
|
MT184907
|
18-Feb-20
|
USA
|
C
|
T
|
MT159709
|
20-Feb-20
|
USA
|
C
|
T
|
MT159711
|
20-Feb-20
|
USA
|
C
|
T
|
MT184908
|
21-Feb-20
|
USA
|
C
|
T
|
MT159722
|
21-Feb-20
|
USA
|
C
|
T
|
MT159720
|
21-Feb-20
|
USA
|
C
|
T
|
MT159721
|
21-Feb-20
|
USA
|
C
|
T
|
MT184909
|
21-Feb-20
|
USA
|
C
|
T
|
MT184913
|
24-Feb-20
|
USA
|
C
|
T
|
MT159716
|
24-Feb-20
|
USA
|
C
|
T
|
MT159715
|
24-Feb-20
|
USA
|
C
|
T
|
MT159712
|
25-Feb-20
|
USA
|
C
|
T
|
MN997409
|
22-Jan-20
|
USA: AZ
|
T
|
C
|
MN994468
|
22-Jan-20
|
USA: CA
|
C
|
T
|
MN994467
|
23-Jan-20
|
USA: CA
|
T
|
C
|
MT044258
|
27-Jan-20
|
USA: CA
|
C
|
T
|
MT027063
|
29-Jan-20
|
USA: CA
|
C
|
T
|
MT027062
|
29-Jan-20
|
USA: CA
|
C
|
T
|
MT027064
|
29-Jan-20
|
USA: CA
|
C
|
T
|
MT106052
|
6-Feb-20
|
USA: CA
|
T
|
C
|
MT106053
|
10-Feb-20
|
USA: CA
|
C
|
T
|
MT118835
|
23-Feb-20
|
USA: CA
|
C
|
T
|
MT192765
|
3/11/2020
|
USA: CA, San Diego County
|
C
|
T
|
MT044257
|
28-Jan-20
|
USA: IL
|
T
|
C
|
MT039888
|
29-Jan-20
|
USA: MA
|
C
|
T
|
MT188341
|
5-Mar-20
|
USA: MN
|
T
|
C
|
MT188340
|
7-Mar-20
|
USA: MN
|
C
|
T
|
MT188339
|
9-Mar-20
|
USA: MN
|
T
|
C
|
MT152824
|
24-Feb-20
|
USA: Snohomish County, WA
|
T
|
C
|
MT106054
|
11-Feb-20
|
USA: TX
|
T
|
C
|
MT020880
|
25-Jan-20
|
USA: WA
|
T
|
C
|
MT020881
|
25-Jan-20
|
USA: WA
|
T
|
C
|
MT163716
|
27-Feb-20
|
USA: WA
|
C
|
T
|
MT163717
|
28-Feb-20
|
USA: WA
|
T
|
C
|
MT163718
|
29-Feb-20
|
USA: WA
|
T
|
C
|
MT163719
|
1-Mar-20
|
USA: WA
|
T
|
C
|
MT039887
|
31-Jan-20
|
USA: WI
|
C
|
T
|
MT192773
|
22-Jan-20
|
Viet Nam: Ho Chi Minh city
|
C
|
T
|
MT192772
|
22-Jan-20
|
Viet Nam: Ho Chi Minh city
|
C
|
T
|
Both sites are in the third position in their respective codons. The codon containing position 8782 is either AGC or AGT which codes for Serine. Position 28144 is the second position of codons TTA or TCA. TTA codes for Leucine. TCA codes for Serine. The substitution at 8782 is a synonymous mutations, i.e. it doesn't change the amino acid sequence.
Given the high degree of covariation between these two columns, it's no surprise that the mutual information of the two columns is high: MI = 0.90 bits.
We trimmed the genomic sequences at each end so that all the sequences were the same length and build a neighbor joining phylogenetic tree using Jalview 2.11.0. The highlighted IDs are for the sequences with T at position 8782.
The only other features of note in the sequences is a 15 base gap in sequence MT159716. There are other mutations scattered throughout the sequences, but no other positions show the degree of variation shown by those above. There is also a three base insertion, TTC, relative to the other sequences in MT188341 at position 21331. These mutations may be significant or not. It's hard to tell at this point.
What does it all mean?
We don't know. In the full genome data, there isn't enough information to do forensic tracking of where the variations arise. For example many sequences simply list USA as the location. The fact that two covary means that positions are important to the structure of the virus. We would like to hear from anyone more knowledgeable about these subjects.
You can download the data here.
No comments:
Post a Comment