Where the Deer and the Antelope Play

 SARS-CoV-2, in the realm of coronaviruses, has a very broad species range.
~ Laura Bashor Colorado State University

It raises the urgent question -- we know the deer are effectively transmitting virus among themselves, and then who are they giving it to?
~ Suresh Kuchipudi Pennsylvania State University, 

I was reading a news article about the spread of SARS-CoV-2 in the white-tailed deer 
(Odocoileus virginianus) population. I looked out my window and there were two white-tailed deer munching grass in the field next to the house.  

The news article was based on this paper that described sampling 283 white-tailed deer retropharyngeal lymph nodes and found that 94 (33%) of the deer tested positive for SARS-CoV-2. The authors indicate that infections likely resulted from multiple human-to-deer spillover and deer-to-deer transmission events.

I thought it might be interesting to see if there was genomic data with Odocoileus virginianus as a host for SARS-CoV-2. I downloaded 108 sequences with host labeled Odocoileus virginianus from GISAID on November 23, 2021. I added the reference sequence, EPI_ISL_402124, rearranged the FASTA headers  so that the accession ID was the sequence ID. I also eliminated sequences with greater than 20 N's in a row. This left 102 sequences, including the reference.

I also downloaded the metadata file containing 5,376,937 records including the 108 deer sequences. The metadata showed that the deer sequences have been assigned to a variety of different Pango lineages.

meta_data %>% filter(Host == 'Odocoileus virginianus') %>% group_by(Pango.lineage) %>% count() %>% rename(Count = n)

Pango Lineage

Count

B.1         

11

B.1.119     

2

B.1.2       

52

B.1.234     

6

B.1.240     

1

B.1.264     

1

B.1.311     

20

B.1.396     

1

B.1.582     

2

B.1.596     

10

C.36.3      

1

None  

1


These sequences are sublineages of B.1.  B.1 is one of the original lineages, showing up early in the pandemic, 2020-01-01.

Most were collected during December 2020. It would be good to have uniform sampling of the deer population in the future. Until then, it's hard to predict trends.

df_deer <- meta_data %>% filter(Host == 'Odocoileus virginianus') 
ggplot(df_deer) + geom_histogram(aes(x = Collection.date, fill = Pango.lineage))  +
  xlab('Collection Date') + 
  ggtitle('Odocoileus virginianus       2020-09-28 to 2021-02-25')



You can see that, so far, the samples came from limited regions.

temp <- df_deer %>% 
     select(Location) %>%
     separate(Location, into = c('Region', 'Country', 'State'), sep = ' / ') %>%
     group_by(State) %>%
     count() %>%
     rename(Count = n)


State

Count

Iowa

94

Ohio

14


Mutations


I aligned the downloaded sequences with  MAFFT v7.490.  MAFFT converts the sequences to lower case, so it was necessary to convert them back to upper case.

mafft --retree 2 --maxiterate 1000 deer_seqs_filter.fasta > deer_aln.fasta


from Bio import SeqIO
from Bio import SeqRecord
from Bio import Seq

with open('deer_aln_uc.fasta', 'w') as fd:
    for rec in SeqIO.parse('deer_aln.fasta', 'fasta'):
        seq = str(rec.seq).upper()
        new_seq = SeqRecord.SeqRecord(seq = Seq.Seq(seq),
                                    id = rec.id,
                                    description = rec.description)
        SeqIO.write(new_seq, fd, 'fasta')
        

Once the sequences were reformatted, I created a table of mutations, i.e. variations from the GISAID reference sequence, EPI_ISL_402124. 

python gisaid_mutations.py -i deer_aln_uc.fasta -o deer_mut.csv -g NC_045512.gb       

gisaid_mutations.py compares aligned sequences against the the reference sequence. It uses the sequence NC_045512.2 GenBank annotation to identify genomic regions of interest.


The program identifies 537 nucleotide variations. Including 77 variations in the spike protein. You can see the genomic regions where the mutations were found in following table.


df_mut <- read.csv('output/deer_mut.csv')
df_mut %>% group_by(product) %>% count() %>% rename(Count = n)



product

Count

36

2'-O-ribose methyltransferase

7

3'-to-5' exonuclease

35

3C-like proteinase

12

endoRNAse

10

envelope protein

2

helicase

26

leader protein

27

membrane glycoprotein

8

nsp10

6

nsp2

38

nsp3

62

nsp4

35

nsp6

16

nsp7

7

nsp8

10

nsp9

8

nucleocapsid phosphoprotein

33

ORF10 protein

3

ORF3a protein

23

ORF7a protein

19

ORF7b

1

ORF8 protein

13

RNA-dependent RNA polymerase

23

surface glycoprotein

77


Here are the spike mutations.


df_mut %>% 
filter(product == 'surface glycoprotein') %>%
select(reference.positions, ref_nucleotide, alt_nucleotide, codon, alt_codons, mutation)



reference.positions

ref_nucleotide

alt_nucleotide

codon

alt_codons

mutation

21575

C

T

CTT

TTT

L5F

21597

C

T

TCT

TTT

S12F

21614

C

T

CTT

TTT

L18F

21618

C

G

ACA

AGA

T19R

21622

C

T

ACC

ACT

T20T

21627

C

T

ACT

ATT

T22I

21648

C

T

ACT

ATT

T29I

21658

C

T

TTC

TTT

F32F

21705

T

C

TTA

TCA

L48S

21707

C

T

CAT

TAT

H49Y

21721

C

T

GAC

GAT

D53D

21770

G

C

GTC

CTC

V70L

21776

G

A

GGG

AGG

G72R

21802

T

C

GAT

GAC

D80D

21981

T

-

TTT

T-T

F140del

21982

T

-

TTT

TT-

F140del

21983

T

-

TTG

-TG

L141del

21984

T

-

TTG

T-G

L141del

21985

G

-

TTG

TT-

L141del

21986

G

-

GGT

-GT

G142del

21987

G

-

GGT

G-T

G142del

21988

T

-

GGT

GG-

G142del

21989

G

-

GTT

-TT

V143del

21990

T

-

GTT

G-T

V143del

21991

T

-

GTT

GT-

V143del

21992

T

-

TAT

-AT

Y144del

21993

A

T

TAT

TTT

Y144F

22021

G

T

ATG

ATT

M153I

22029

A

-

GAG

G-G

E156del

22030

G

-

GAG

GA-

E156del

22031

T

-

TTC

-TC

F157del

22032

T

-

TTC

T-C

F157del

22033

C

-

TTC

TT-

F157del

22034

A

-

AGA

-AG

R158del

22149

A

-

AAT

A-T

N196del

22150

T

-

AAT

AA-

N196del

22162

T

C

TAT

TAC

Y200Y

22295

C

T

CAT

TAT

H245Y

22350

C

T

GCA

GTA

A263V

22432

C

T

GAC

GAT

D290D

22450

C

T

CTC

CTT

L296L

22498

C

T

ATC

ATT

I312I

22627

G

A

AGG

AGA

R355R

22669

T

C

TAT

TAC

Y369Y

22687

C

T

TCC

TCT

S375S

22750

T

C

TAT

TAC

Y396Y

22798

A

G

CCA

CCG

P412P

22995

C

A

ACA

AAA

T478K

23014

A

C

GAA

GAC

E484D

23066

G

T

GGT

TGT

G502C

23266

C

T

GAC

GAT

D568D

23277

C

T

ACT

ATT

T572I

23282

G

T

GAT

TAT

D574Y

23284

T

C

GAT

GAC

D574D

23401

G

T

CAG

CAT

Q613H

23403

A

G

GAT

GGT

D614G

23604

C

G

CCT

CGT

P681R

23664

C

T

GCA

GTA

A701V

23740

T

C

ATT

ATC

I726I

24055

T

C

GCT

GCC

A831A

24206

A

T

ATC

TTC

I882F

24232

A

G

GCA

GCG

A890A

24320

C

A

CAA

AAA

Q920K

24337

C

T

AAC

AAT

N925N

24410

G

A

GAT

AAT

D950N

24418

C

A

GTC

GTA

V952V

24542

G

T

GAT

TAT

D994Y

24544

T

C

GAT

GAC

D994D

24781

G

A

AAG

AAA

K1073K

24900

A

-

CAA

C-A

Q1113del

24997

A

G

TTA

TTG

L1145L

24998

G

T

GAC

TAC

D1146Y

25018

A

G

TTA

TTG

L1152L

25169

C

T

CTT

TTT

L1203F

25244

G

T

GTA

TTA

V1228L

25292

C

A

CTC

ATC

L1244I

25350

C

A

CCA

CAA

P1263Q


Notice that the D614G mutation appears. It shows up in all of the sequences. This mutation appeared relatively early in the pandemic.  It has been found that to enhance viral replication in human lung airway tissues. Other mutations oif note are highlighted in the table along with links to descriptions of their relevance. The full set of mutations for all the genomic regions is available on GitHub.

Mutual Information

I also calculated the mutual information between the aligned sequence columns. 

python gisaid_MI.py -i deer_aln_uc.fasta -o deer_mi.csv

The table below shows the mutual information between positions of the aligned sequences with MI >= 0.7.

Position_1

Position_2

MI

10319

25907

0.941045455

25907

28472

0.883140787

25907

27964

0.844033431

27964

28472

0.816303445

10319

28472

0.816298105

10319

27964

0.768796929

204

24997

0.764232905

12525

13665

0.74484856

11919

12737

0.729264896

13665

28893

0.728748237

12525

28893

0.720429105

11195

11919

0.714923206

19180

21776

0.711697689

12737

12792

0.711413545

11195

12792

0.704167711

18424

25907

0.701801109

10319

18424

0.701731398


You can see the connections among the pairs in the following plot. The connections do not mean that one position necessarily influences the other. It just means that they vary together in this small data set. That correlation may disappear as more deer sequences are added to teh database.



We can examine the products that are involved in these pairs.

python gisaid_MI_features.py -i deer_mi.csv -o deer_mi_features.csv -m deer_mut.csv -c 0.7

ref.positions

ref_nucleotide

alt_nucleotide

codon

alt_codons

mutation

product

10319

C

T

CTT

TTT

L89F

3C-like proteinase

25907

G

T

GGT

GTT

G172V

ORF3a protein

28472

C

T

CCT

TCT

P67S

nucleocapsid phosphoprotein

27964

C

T

TCA

TTA

S24L

ORF8 protein

204

G

T

24997

A

G

TTA

TTG

L1145L

surface glycoprotein

12525

C

T

ACA

ATA

T145I

nsp8

13665

C

T

CAC

CAT

H66H

RNA-dependent RNA polymerase

11919

C

T

TCT

TTT

S26F

nsp7

12737

A

G

ACT

GCT

T18A

nsp9

28893

C

T

CCT

CTT

P207L

nucleocapsid phosphoprotein

11195

C

T

CTT

TTT

L75F

nsp6

19180

G

T

GTA

TTA

V381L

3'-to-5' exonuclease

21776

G

A

GGG

AGG

G72R

surface glycoprotein

12792

A

T

AAG

ATG

K36M

nsp9

18424

A

G

AAT

GAT

N129D

3'-to-5' exonuclease



The presence of COVID in common animal hosts is worrisome. They provide a reservoir for the disease and a possible breeding ground for mutations.

No comments:

Post a Comment