Finding B.1.1.7 Mutations

This variant (B.1.1.7) is considerably more contagious than the original virus. It has spread rapidly around the globe and likely accounts already for at least one-third of all cases in the United States.
~ Francis Collins

As viral sequencing ramps up, more variants will be detected. Currently, three variants appear to be of concern: B.1.1.7, B.1.351, and P.1, first detected in the UK, South Africa, and Brazil, respectively. B.1.1.7 has an unusually large number of mutations, including many in the spike protein. Although this is the best established for B.1.1.7, all of these variants may transmit better from person to person. (Posted January 19, 2021) 
~ Kartik Chandran, PhD 

There is so much SARS-CoV-2 sequence data available that just manipulating it is a problem. This article describes just a few of the bottlenecks facing researchers trying to deal with a flood of data. The large number of sequences available for download means that a seemingly simple task like confirming the reported mutations in a lineage requires a fair amount of work. I'll try and describe the process I went through to produce a table of mutations.

The first step is to download the latest sequence and metadata from GISAID. I downloaded a set of data on March 18, 2021 consisting of 818,105 sequences. That was two weeks ago. Now there are more than 900,000 sequences available and more coming each day.

The meta file contains 28 columns, representing a variety of properties of the sequences.

head -n 1 metadata.tsv
strain  virus   gisaid_epi_isl  genbank_accession       date    region  country division        location region_exposure  country_exposure        division_exposure       segment length  host    age     sex     Nextstrain_clade  pango_lineage   GISAID_clade    originating_lab submitting_lab  authors url     title   paper_url date_submitted  purpose_of_sequencing

In order to get an overview of what was in the data, I did some counting. 

# Nextstrain Strains in metadata
cut -d$'\t' -f 18 metadata.tsv | tail -n +2 | sort | uniq -c > strains.txt

# Region
cut -d$'\t' -f 6 metadata.tsv | tail -n +2 | sort | uniq -c > region.txt

# Region of exposure
cut -d$'\t' -f 10 metadata.tsv | tail -n +2 | sort | uniq -c > region_exposure.txt

# Country
cut -d$'\t' -f 7 metadata.tsv | tail -n +2 | sort | uniq -c > country.txt

# Country of exposure
cut -d$'\t' -f 11 metadata.tsv | tail -n +2 | sort | uniq -c > country_exposure.txt

#Lineage
cut -d$'\t' -f 19 metadata.tsv | tail -n +2 | sort | uniq -c > lineage.txt


lineage.txt shows that there are 161,760 B.1.1.7 records in the metatable. 

The next step was to reformat the FASTA headers of the sequence file to include the country of exposure and the PANDO lineage. Like any big data set, there are problems: sequence IDs not included in the metafile, metafile records that don't have a corresponding sequence record, duplicate IDs, etc. Reformatting the headers makes life simpler in the remaining steps. All of the code, but not the data, is available on GitHub.
python gisaid_reformat_fasta_headers.py -i hov_19_sequences.fasta -m metadata.tsv -o hcov_19_reformat.fasta -r 2> missing_ids.txt -n 20
missing_ids.txt contains the IDs of sequences not found in metafile. In addition, the program removes sequences that contain long stretches of  N's (the -n 20 option). These sequences are not copied to output. After processing 496,936 sequences remain.

Next, I extracted the sequences identified as being from the B.1.1.7 lineage based on the information in the FASTA header. 
python gisaid_extract_fasta.py -i /gisaid_hcov_19_reformat.fasta -o gisaid_B_1_1_7.fasta -l B.1.1.7
gisaid_extract_fasta.py will extract sequences from a particular country of exposure as well and optionally convert 'U' to 'T' in the sequence. gisaid_B_1_1_7.fasta contains 118,945 sequences; a bit too big to deal with efficiently. Instead of using all of this data, I sample a smaller subset at random.

python gisaid_sample_seqs.py -i gisaid_B_1_1_7.fasta -o gisaid_B_1_1_7_10000.fasta -n 10000
The new file contains 10,000 sequences.

I want to align these sequences to a reference sequence. For the purposes of this analysis a mutation is a nucleotide variation from the the reference sequence. I will need to extract the reference sequence from the larger sequence file.
 gisaid_extract_ref_seq.py -i gisaid_hcov_19_reformat.fasta -o wuhan_ref_seq.fata
Next, I align the 10,000 sampled sequences to the reference sequence using MAFFT v7.475.
mafft --6merpair --maxambiguous 0.05 --preservecase --thread -1 --addfragments gisaid_B_1_1_7_10000.fasta wuhan_ref_seq.fasta > gisaid_B_1_1_7_10000_aln.fasta
Finally, the aligned sequences can be searched for variation.
python gisaid_mutations.py -i gisaid_B_1_1_7_10000_aln.fasta -o gisaid_B_1_1_7_10000_aln.csv -g NC_045512.gb
NC_045512.gb is the GenBank record for the Wuhan reference sequences. The sequence is identical to the  GISAID reference sequence Wuhan/WIV04/2019.

The  CSV file contains a 21 columns. I read it into R to extract some columns.

mut_B_1_1_7 <- read.csv('gisaid_B_1_1_7_10000_aln.csv', header = TRUE)
library(tidyverse)
mut_B_1_1_7_2 <- mut_B_1_1_7 %>% select(reference.positions, ref_nucleotide, alt_nucleotide, codon, ref_aa, alt_codons, alt_aas, aa_pos, mutation)

The final table looks like this.

reference.positions

ref_nucleotide

alt_nucleotide

codon

ref_aa

alt_codons

alt_aas

mutation

241

C

T

913

C

T

TCC

S

TCT

S

S36S

3037

C

T

TTC

F

TTT

F

F106F

3267

C

T

ACT

T

ATT

I

T183I

5388

C

A

GCT

A

GAT

D

A890D

5986

C

T

TTC

F

TTT

F

F1089F

6954

T

C

ATA

I

ACA

T

I1412T

14408

C

T

CCT

P

CTT

L

P314L

14676

C

T

CCC

P

CCT

P

P403P

15279

C

T

CAC

H

CAT

H

H604H

16176

T

C

ACT

T

ACC

T

T903T

23063

A

T

AAT

N

TAT

Y

N501Y

23271

C

A

GCT

A

GAT

D

A570D

23403

A

G

GAT

D

GGT

G

D614G

23604

C

A

CCT

P

CAT

H

P681H

23709

C

T

ACA

T

ATA

I

T716I

24506

T

G

TCA

S

GCA

A

S982A

24914

G

C

GAC

D

CAC

H

D1118H

27972

C

T

CAA

Q

TAA

*

Q27*

28048

G

T

AGA

R

ATA

I

R52I

28111

A

G

TAC

Y

TGC

C

Y73C

28280

G

C

GAT

D

CAT

H

D3H

28281

A

T

GAT

D

GTT

V

D3V

28282

T

A

GAT

D

GAA

E

D3E

28881

G

A

AGG

R

AAG

K

R203K

28882

G

A

AGG

R

AGA

R

R203R

28883

G

C

GGA

G

CGA

R

G204R

28977

C

T

TCT

S

TTT

F

S235F


These results correspond to those reported here.

Code for manipulating the sequences can be found on GitHub. Data is available from GISAID.

No comments:

Post a Comment