Posts

Showing posts from April, 2021

Data Wrangling with Julia, R, and Python

Image
What we have is a data glut.  ~ Vernor Vinge It is a capital mistake to theorize before one has data. ~ Sherlock Holmes GISAID provides a metafile describing the contents of genome sequence files. The metafile is a tab delimited table. The version I downloaded on April 15, 2021 contains 1,107,140 records, each with 22 columns.  The data columns are  [ 1 ] "Virus.name" "Type" "Accession.ID" [ 4 ] "Collection.date" "Location" "Additional.location.information" [ 7 ] "Sequence.length" "Host" "Patient.age" [ 10 ] "Gender" "Clade" "Pango.lineage" [ 13 ] "Pangolin.version" "Variant" ...

Julia as Glue

  A language that doesn't affect the way you think about programming is not worth knowing. ~   Alan J. Perlis Is it possible that software is not like anything else, that it is meant  to be discarded: that the whole point is to always see it as a soap  bubble? ~   Alan J Perlis Is Julia the solution to the two language problem ? The two language  problem is usually described as arising from using  a language  like Python or R and discovering that it is two slow for the task at hand. In order to get adequate performance, key routines are written in C or C++. A prime example is Numpy for Python. Another example is using Python for data munging and R for statistical  analysis and visualization. I have been using the Python and R combo for some time now. I have started looking at Julia as a possible replacement for both which, if it is as fast as advertised, has the bonus of speeding up processing. One of the key tasks in data munging is gl...

Brazil and P.1

Image
Normal led to this. ~ Ed Young, write for The Atlantic Be fast, have no regrets... If you need to be right before you move, you will never win. ~ Mike Ryan, epidemiologist at WHO I had intended to continue exploring the use of the Julia Language for analyzing SAR-CoV-2 genomes, but first I wanted to take a quick look at the growth of the P.1 variant , a.k.a. the Brazilian variant. This lineage is considered a variant of concern . It has  has 17 amino acid changes, ten of which are in its spike protein, including these three designated to be of particular concern:  N501Y ,  E484K  and K417T. We have seen these before.  On April 15, 2021, I downloaded the metafile from GISAID . The file contains 1,107,140 records of SAR-CoV-2 observations, mostly drawn from humans. I read the file into R and tried to use the routine presented here to show a plot of the growth of the P.1 variant in Brazil. The code failed.  It didn't take long to figure out why the program di...

BioJulia and BioPython

Julia is dynamically typed, feels like a scripting language, and has good support for interactive use. ~ The Julia Programming Language Did we mention it should be as fast as C? ~ Jeff Bezanson, Stefan Karpinski, Viral B. Shah, Alan Edelman I have been interested in the Julia language for sometime, but haven't had a chance to do much with it. The recent release of version 1.6 piqued my interest. I decided to try a simple exercise to get back into Julia programming. In particular, I wanted to try BioJulia . In a previous post , I discussed searching for mutations in SARS-CoV-2 sequence data. The original code was written in Python and used BioPython and pandas . I wanted to compare the Python implementation to a Julia version. One of the first tasks in the mutation pipeline was to reformat the FASTA headers in the genome sequences downloaded from GISAID. This is a simple exercise: read the sequence file, one record at a time; reformat the FASTA header ID to match the the strain ID...

Finding B.1.1.7 Mutations

This variant (B.1.1.7) is considerably more contagious than the original virus. It has spread rapidly around the globe and likely accounts already for at least one-third of all cases in the United States. ~ Francis Collins As viral sequencing ramps up, more variants will be detected. Currently, three variants appear to be of concern: B.1.1.7, B.1.351, and P.1, first detected in the UK, South Africa, and Brazil, respectively. B.1.1.7 has an unusually large number of mutations, including many in the spike protein. Although this is the best established for B.1.1.7, all of these variants may transmit better from person to person. (Posted January 19, 2021)  ~ Kartik Chandran, PhD  There is so much SARS-CoV-2 sequence data available that just manipulating it is a problem. This article describes just a few of the bottlenecks facing researchers trying to deal with a flood of data. The large number of sequences available for download means that a seemingly simple task like confirming t...