My wife asked me to write a little program for her. She wanted to download some COVID-19 data from health.data.ny.gov and calculate a moving average of the sum of two of the columns in the file. The URL of the data is https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD.
The data file has 37 columns and 148,011+ rows. We only care about the first few. Here is the header and the first four rows and several columns
As of
Date |
Facility
PFI |
Facility
Name |
DOH
Region |
Facility
County |
Facility
Network |
NY
Forward Region |
Patients
Currently Hospitalized |
Patients
Admitted Due to COVID |
Patients
Admitted Not Due to COVID |
Patients
Newly Admitted |
Patients
Positive After Admission |
3/26/2020 |
1 |
ALBANY
MEDICAL CENTER HOSPITAL |
CAPITAL
DISTRICT REGIONAL OFFICE |
ALBANY |
ALBANY
MEDICAL CENTER |
CAPITAL
REGION |
12 |
1 |
1 |
||
3/26/2020 |
5 |
ST
PETERS HOSPITAL |
CAPITAL
DISTRICT REGIONAL OFFICE |
ALBANY |
ST.
PETERS HEALTH PARTNERS |
CAPITAL
REGION |
10 |
0 |
2 |
||
3/26/2020 |
37 |
CUBA
MEMORIAL HOSPITAL INC |
WESTERN
REGIONAL OFFICE |
ALLEGANY |
INDEPENDENT |
WESTERN
NEW YORK |
0 |
0 |
0 |
||
3/26/2020 |
39 |
MEMORIAL
HOSP OF WM F AND GERTRUDE F JONES AKA JONES MEMORIAL HOSP |
WESTERN
REGIONAL OFFICE |
ALLEGANY |
UNIVERSITY
OF ROCHESTER MEDICAL CENTER |
WESTERN
NEW YORK |
0 |
0 |
0 |
Luckily, the data is tidy. This means the data files doesn't have to be extensively reformatted before processing.
R
When faced with small project to manipulate a data table, my inclination is to use R. R's tidyverse libraries make manipulating a data table like the one above relatively simple.
covid_hospitalizations <- function(lag = 7) { require(tidyverse) require(zoo) url <- 'https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD' hospitals <-c('ALBANY MEDICAL CENTER HOSPITAL', 'ST PETERS HOSPITAL', 'COLUMBIA MEMORIAL HOSPITAL', 'SAMARITAN HOSPITAL', 'SARATOGA HOSPITAL', 'ELLIS HOSPITAL')
df <- read.csv(url) %>%
select(As.of.Date, Facility.Name, Patients.Newly.Admitted, Patients.Positive.After.Admission) %>% filter(Facility.Name %in% {{ hospitals }}) %>% mutate(As.of.Date = as.Date(As.of.Date, format = '%m/%d/%Y')) %>% group_by(As.of.Date) %>% summarise(Patients.Newly.Admitted = sum(Patients.Newly.Admitted), Patients.Positive.After.Admission = sum(Patients.Positive.After.Admission)) %>% mutate(New.Covid.Cases = Patients.Newly.Admitted + Patients.Positive.After.Admission) %>% mutate(Moving.Avg = rollmean(New.Covid.Cases, lag, align = 'right', na.pad = TRUE)) %>% select(As.of.Date, New.Covid.Cases, Moving.Avg) %>%
map_df(rev)
return(df)
}
> df <- read.csv('https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD') > df <- df[, c("As.of.Date", "Facility.Name", "Patients.Newly.Admitted", "Patients.Positive.After.Admission")]
Python
import pandas as pd def main(): url = 'https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD' out_file = 'hospitalizations_7day_ma.csv' hospitals = ('ALBANY MEDICAL CENTER HOSPITAL', 'ST PETERS HOSPITAL', 'COLUMBIA MEMORIAL HOSPITAL', 'SAMARITAN HOSPITAL', 'SARATOGA HOSPITAL', 'ELLIS HOSPITAL') cols = ['As of Date', 'Facility Name', 'Patients Newly Admitted', 'Patients Positive After Admission'] lag = 7 df = pd.read_csv(url) df = df[cols] df = df.loc[df['Facility Name'].isin(hospitals)] df['As of Date'] = pd.to_datetime(df['As of Date'], format = '%m/%d/%Y') df = df.groupby(['As of Date'])[['Patients Newly Admitted', 'Patients Positive After Admission']].sum() df['New Covid Cases'] = df['Patients Newly Admitted'] + df['Patients Positive After Admission'] df['Moving Average'] = df['New Covid Cases'].rolling(window = lag).mean() df = df.reset_index() df = df.dropna() df = df[['As of Date', 'New Covid Cases', 'Moving Average']] df = df.sort_values(['As of Date'], ascending = 0) df.to_csv(out_file, index = False) if __name__ == "__main__": main()
import pandas as pd def select_hospitals(df, hospitals): return df.loc[df['Facility Name'].isin(hospitals)] def main(): url = 'https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD' out_file = 'hospitalizations_7day_ma_2.csv' hospitals = ('ALBANY MEDICAL CENTER HOSPITAL', 'ST PETERS HOSPITAL', 'COLUMBIA MEMORIAL HOSPITAL', 'SAMARITAN HOSPITAL', 'SARATOGA HOSPITAL', 'ELLIS HOSPITAL') cols = ['As of Date', 'Facility Name', 'Patients Newly Admitted', 'Patients Positive After Admission'] lag = 7 df = (pd.read_csv(url) .filter(cols) .assign(As_of_Date = lambda x: pd.to_datetime(x['As of Date'], format = '%m/%d/%Y')) .pipe(select_hospitals, hospitals) # I couldn't use a lambda here .groupby(['As_of_Date'])[['Patients Newly Admitted', 'Patients Positive After Admission']] .sum() .reset_index()
.assign(New_Covid_Cases = lambda x: x['Patients Newly Admitted'] + x['Patients Positive After Admission']) .assign(Moving_Average = lambda x: x['New_Covid_Cases'].rolling(window = lag).mean()).drop_na()
.filter(['As_of_Date', 'New_Covid_Cases', 'Moving_Average']) .sort_values(['As_of_Date'], ascending = 0) ) df.to_csv(out_file, index = False) if __name__ == "__main__": main()
Julia
using DataFrames using CSV using Dates using RollingFunctions function main() url = "https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD" out_file = "hospitalizations_7day_ma_3.csv" hospitals = ["ALBANY MEDICAL CENTER HOSPITAL", "ST PETERS HOSPITAL", "COLUMBIA MEMORIAL HOSPITAL", "SAMARITAN HOSPITAL", "SARATOGA HOSPITAL", "ELLIS HOSPITAL"] cols = ["As of Date", "Facility Name", "Patients Newly Admitted", "Patients Positive After Admission"] lag = 7 df = CSV.read(download(url), DataFrame) df = select(df, cols) df = filter(row -> row."Facility Name" ∈ hospitals, df) df[!, "As of Date"] = Date.(df."As of Date", "mm/dd/yyyy") df = combine(groupby(df, "As of Date"), ["Patients Newly Admitted", "Patients Positive After Admission"] .=> sum) df."New Covid Cases" = df."Patients Newly Admitted_sum" + df."Patients Positive After Admission_sum" df."Moving Average" = runmean(df."New Covid Cases", lag) df = select(df, ["As of Date", "New Covid Cases", "Moving Average"]) sort!(df, "As of Date", rev = true) CSV.write(out_file, df) end main()
using DataFrames using CSV using Dates using RollingFunctions using Pipe function main() url = "https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD" out_file = "hospitalizations_7day_ma_4.csv" hospitals = ["ALBANY MEDICAL CENTER HOSPITAL", "ST PETERS HOSPITAL", "COLUMBIA MEMORIAL HOSPITAL", "SAMARITAN HOSPITAL", "SARATOGA HOSPITAL", "ELLIS HOSPITAL"] cols = ["As of Date", "Facility Name", "Patients Newly Admitted", "Patients Positive After Admission"] lag = 7 # helper for runmean mva(x) = runmean(x, lag) df = @pipe CSV.read(download(url), DataFrame, dateformat = "mm/dd/yyyy") |> select(_, cols) |> filter(row -> row."Facility Name" ∈ hospitals, _) |> combine(groupby(_, "As of Date"), ["Patients Newly Admitted", "Patients Positive After Admission"] .=> sum) |> transform!(_, ["Patients Newly Admitted_sum", "Patients Positive After Admission_sum"] => (+) => "New Covid Cases") |> transform!(_, ["New Covid Cases"] => mva => "Moving Average") |> select(_, ["As of Date", "New Covid Cases", "Moving Average"]) |> sort!(_, "As of Date", rev = true) CSV.write(out_file, df) end main()
No comments:
Post a Comment