This variant (B.1.1.7) is considerably more contagious than the original virus. It has spread rapidly around the globe and likely accounts already for at least one-third of all cases in the United States.
~ Francis Collins
~ Francis Collins
As viral sequencing ramps up, more variants will be detected. Currently, three variants appear to be of concern: B.1.1.7, B.1.351, and P.1, first detected in the UK, South Africa, and Brazil, respectively. B.1.1.7 has an unusually large number of mutations, including many in the spike protein. Although this is the best established for B.1.1.7, all of these variants may transmit better from person to person. (Posted January 19, 2021)
~ Kartik Chandran, PhD
There is so much SARS-CoV-2 sequence data available that just manipulating it is a problem. This article describes just a few of the bottlenecks facing researchers trying to deal with a flood of data. The large number of sequences available for download means that a seemingly simple task like confirming the reported mutations in a lineage requires a fair amount of work. I'll try and describe the process I went through to produce a table of mutations.
The first step is to download the latest sequence and metadata from GISAID. I downloaded a set of data on March 18, 2021 consisting of 818,105 sequences. That was two weeks ago. Now there are more than 900,000 sequences available and more coming each day.
The meta file contains 28 columns, representing a variety of properties of the sequences.
head -n 1 metadata.tsv
strain virus gisaid_epi_isl genbank_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex Nextstrain_clade pango_lineage GISAID_clade originating_lab submitting_lab authors url title paper_url date_submitted purpose_of_sequencing
In order to get an overview of what was in the data, I did some counting.
# Nextstrain Strains in metadata
cut -d$'\t' -f 18 metadata.tsv | tail -n +2 | sort | uniq -c > strains.txt
# Region
cut -d$'\t' -f 6 metadata.tsv | tail -n +2 | sort | uniq -c > region.txt
# Region of exposure
cut -d$'\t' -f 10 metadata.tsv | tail -n +2 | sort | uniq -c > region_exposure.txt
# Country
cut -d$'\t' -f 7 metadata.tsv | tail -n +2 | sort | uniq -c > country.txt
# Country of exposure
cut -d$'\t' -f 11 metadata.tsv | tail -n +2 | sort | uniq -c > country_exposure.txt
#Lineage
cut -d$'\t' -f 19 metadata.tsv | tail -n +2 | sort | uniq -c > lineage.txt
lineage.txt shows that there are 161,760 B.1.1.7 records in the metatable.
The next step was to reformat the FASTA headers of the sequence file to include the country of exposure and the PANDO lineage. Like any big data set, there are problems: sequence IDs not included in the metafile, metafile records that don't have a corresponding sequence record, duplicate IDs, etc. Reformatting the headers makes life simpler in the remaining steps. All of the code, but not the data, is available on GitHub.
python gisaid_reformat_fasta_headers.py -i hov_19_sequences.fasta -m metadata.tsv -o hcov_19_reformat.fasta -r 2> missing_ids.txt -n 20
missing_ids.txt contains the IDs of sequences not found in metafile. In addition, the program removes sequences that contain long stretches of N's (the -n 20 option). These sequences are not copied to output. After processing 496,936 sequences remain.
Next, I extracted the sequences identified as being from the B.1.1.7 lineage based on the information in the FASTA header.
python gisaid_extract_fasta.py -i /gisaid_hcov_19_reformat.fasta -o gisaid_B_1_1_7.fasta -l B.1.1.7
gisaid_extract_fasta.py will extract sequences from a particular country of exposure as well and optionally convert 'U' to 'T' in the sequence. gisaid_B_1_1_7.fasta contains 118,945 sequences; a bit too big to deal with efficiently. Instead of using all of this data, I sample a smaller subset at random.
python gisaid_sample_seqs.py -i gisaid_B_1_1_7.fasta -o gisaid_B_1_1_7_10000.fasta -n 10000
The new file contains 10,000 sequences.I want to align these sequences to a reference sequence. For the purposes of this analysis a mutation is a nucleotide variation from the the reference sequence. I will need to extract the reference sequence from the larger sequence file.
gisaid_extract_ref_seq.py -i gisaid_hcov_19_reformat.fasta -o wuhan_ref_seq.fata
Next, I align the 10,000 sampled sequences to the reference sequence using MAFFT v7.475.
mafft --6merpair --maxambiguous 0.05 --preservecase --thread -1 --addfragments gisaid_B_1_1_7_10000.fasta wuhan_ref_seq.fasta > gisaid_B_1_1_7_10000_aln.fasta
Finally, the aligned sequences can be searched for variation.
python gisaid_mutations.py -i gisaid_B_1_1_7_10000_aln.fasta -o gisaid_B_1_1_7_10000_aln.csv -g NC_045512.gb
NC_045512.gb is the GenBank record for the Wuhan reference sequences. The sequence is identical to the GISAID reference sequence Wuhan/WIV04/2019.The CSV file contains a 21 columns. I read it into R to extract some columns.
mut_B_1_1_7 <- read.csv('gisaid_B_1_1_7_10000_aln.csv', header = TRUE)
library(tidyverse)
mut_B_1_1_7_2 <- mut_B_1_1_7 %>% select(reference.positions, ref_nucleotide, alt_nucleotide, codon, ref_aa, alt_codons, alt_aas, aa_pos, mutation)
The final table looks like this.
reference.positions |
ref_nucleotide |
alt_nucleotide |
codon |
ref_aa |
alt_codons |
alt_aas |
mutation |
241 |
C |
T |
|||||
913 |
C |
T |
TCC |
S |
TCT |
S |
S36S |
3037 |
C |
T |
TTC |
F |
TTT |
F |
F106F |
3267 |
C |
T |
ACT |
T |
ATT |
I |
T183I |
5388 |
C |
A |
GCT |
A |
GAT |
D |
A890D |
5986 |
C |
T |
TTC |
F |
TTT |
F |
F1089F |
6954 |
T |
C |
ATA |
I |
ACA |
T |
I1412T |
14408 |
C |
T |
CCT |
P |
CTT |
L |
P314L |
14676 |
C |
T |
CCC |
P |
CCT |
P |
P403P |
15279 |
C |
T |
CAC |
H |
CAT |
H |
H604H |
16176 |
T |
C |
ACT |
T |
ACC |
T |
T903T |
23063 |
A |
T |
AAT |
N |
TAT |
Y |
N501Y |
23271 |
C |
A |
GCT |
A |
GAT |
D |
A570D |
23403 |
A |
G |
GAT |
D |
GGT |
G |
D614G |
23604 |
C |
A |
CCT |
P |
CAT |
H |
P681H |
23709 |
C |
T |
ACA |
T |
ATA |
I |
T716I |
24506 |
T |
G |
TCA |
S |
GCA |
A |
S982A |
24914 |
G |
C |
GAC |
D |
CAC |
H |
D1118H |
27972 |
C |
T |
CAA |
Q |
TAA |
* |
Q27* |
28048 |
G |
T |
AGA |
R |
ATA |
I |
R52I |
28111 |
A |
G |
TAC |
Y |
TGC |
C |
Y73C |
28280 |
G |
C |
GAT |
D |
CAT |
H |
D3H |
28281 |
A |
T |
GAT |
D |
GTT |
V |
D3V |
28282 |
T |
A |
GAT |
D |
GAA |
E |
D3E |
28881 |
G |
A |
AGG |
R |
AAG |
K |
R203K |
28882 |
G |
A |
AGG |
R |
AGA |
R |
R203R |
28883 |
G |
C |
GGA |
G |
CGA |
R |
G204R |
28977 |
C |
T |
TCT |
S |
TTT |
F |
S235F |
These results correspond to those reported here.
Code for manipulating the sequences can be found on GitHub. Data is available from GISAID.
No comments:
Post a Comment