Posts

Showing posts from November, 2022

COVID-19 Variants and Chunking

Image
 On November 22, 2022, I downloaded a metadata file containing 13,932,236 records from GISAID  in order to view the growth of several emerging SARS-CoV-2 variants of interest. First, the results. I plotted the growth of the variants BA.2.75, BQ.1, BQ.1.1, BQ.1.18, CL.1, and XBB in the USA from June 1 to the present.  BQ.1. and BQ.1.1 are growing in the data. It's unclear yet whether they will have a large impact. We'll see after the Thanksgiving holidays. Memory Issues The metadata file is large, 9.8 GB on the disk. Loading into R on Windows takes a little over two minutes and used about 8 GB for the dataframe. For the purposes of demonstration, I'm using Windows 10 22H2, RStudio 2022.07.2, and R 4.2.2. > system.time(meta_data <- read_tsv( 'data/metadata.tsv' , name_repair = 'universal' )) user system elapsed 197.84 9.44 146.04 > object.size(meta_data) 8087686552 bytes > RStudio uses 9.6 GB once the data is loaded. PS...

Another Logistic Regression from Scratch

Image
 The world doesn't really need another description of how to code logistic regression. A good description of how to implement logistic regression can already be found here . In addition, there are many great packages for logistic regression in Python,  sklearn.linear_model. LogisticRegression ; R glm ; Julia GLM ; and many more.  I started following this course on Udemy. The course began with a brief discussion of logistic regression. I have used logistic regression techniques many times, but I didn't have a clear idea of how to implement it. I thought I might as well try. What follows is a very simple implementation of binary logistic regression . Logistic Regression - Who lives, who dies? Consider the following data. It's from the R package alr4 . It describes the fate of the infamous Donner party . It consists of 91 observations of five variables. The important columns for our purposed are age , y (survival), and sex . We want to know how did age and sex affect s...