Don't Let Your Pipes Leak



My wife asked me to write a little program for her. She wanted to download some COVID-19 data from health.data.ny.gov and calculate a moving average of the sum of two of the columns in the file. The URL of the data is https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD.

The data file has 37 columns and 148,011+ rows. We only care about the first few. Here is the header and the first four rows and several columns

As of Date

Facility PFI

Facility Name

DOH Region

Facility County

Facility Network

NY Forward Region

Patients Currently Hospitalized

Patients Admitted Due to COVID

Patients Admitted Not Due to COVID

Patients Newly Admitted

Patients Positive After Admission

3/26/2020

1

ALBANY MEDICAL CENTER HOSPITAL

CAPITAL DISTRICT REGIONAL OFFICE

ALBANY

ALBANY MEDICAL CENTER

CAPITAL REGION

12

1

1

3/26/2020

5

ST PETERS HOSPITAL

CAPITAL DISTRICT REGIONAL OFFICE

ALBANY

ST. PETERS HEALTH PARTNERS

CAPITAL REGION

10

0

2

3/26/2020

37

CUBA MEMORIAL HOSPITAL INC

WESTERN REGIONAL OFFICE

ALLEGANY

INDEPENDENT

WESTERN NEW YORK

0

0

0

3/26/2020

39

MEMORIAL HOSP OF WM F AND GERTRUDE F JONES AKA JONES MEMORIAL HOSP

WESTERN REGIONAL OFFICE

ALLEGANY

UNIVERSITY OF ROCHESTER MEDICAL CENTER

WESTERN NEW YORK

0

0

0


The task is to select rows that contain one of a small group of facilities, sum Patients Currently Hospitalized and Patients Positive After Admission, and calculate a seven day moving average ofthe sum.

Luckily, the data is tidy. This means the data files doesn't have to be extensively reformatted before processing.

R

When faced with small project to manipulate a data table, my inclination is to use R. R's tidyverse libraries make manipulating a data table like the one above relatively simple.

covid_hospitalizations <- function(lag = 7) {
  require(tidyverse)
  require(zoo)
  
  url <- 'https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD'
  hospitals <-c('ALBANY MEDICAL CENTER HOSPITAL', 'ST PETERS HOSPITAL', 'COLUMBIA MEMORIAL HOSPITAL',
                'SAMARITAN HOSPITAL', 'SARATOGA HOSPITAL', 'ELLIS HOSPITAL')
   df <- read.csv(url) %>%
    select(As.of.Date, Facility.Name, Patients.Newly.Admitted, Patients.Positive.After.Admission) %>%
    filter(Facility.Name %in% {{ hospitals }}) %>%
    mutate(As.of.Date = as.Date(As.of.Date, format = '%m/%d/%Y')) %>%
    group_by(As.of.Date) %>%
    summarise(Patients.Newly.Admitted = sum(Patients.Newly.Admitted),
              Patients.Positive.After.Admission = sum(Patients.Positive.After.Admission)) %>%
    mutate(New.Covid.Cases = Patients.Newly.Admitted + Patients.Positive.After.Admission) %>%
    mutate(Moving.Avg = rollmean(New.Covid.Cases, lag, align = 'right', na.pad = TRUE)) %>%
    select(As.of.Date, New.Covid.Cases, Moving.Avg) %>%
    map_df(rev)      

  return(df)
}

The function downloads the data from the URL and selects a subset of the columns. The %>% symbol is a magrittr pipe. It "pipes" the left hand side values into the expression on the right-hand side. The dataframe created by read.csv(url) is piped to the select function which selects four columns from among the 37 columns in the original dataframe.

You can think of the first two steps as performing these operations:

> df <- read.csv('https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD')
> df <- df[, c("As.of.Date", "Facility.Name", "Patients.Newly.Admitted", "Patients.Positive.After.Admission")]

The next two step select rows where the Facility.Name is one of a list of hospitals and converts the date column to R's datetime format for easy sorting. group_by takes a table and converts it into a grouped table so operations can be performed on the group.  summarise (British spelling) sums two columns for each date and mutate combines the two into a new column. We next calculate a new moving average column, select a few output columns and reverse the table so that the most recent date is at the top.

This style of programming encourages function thinking. The program consists of a series of transformations on a data object. This approach favors immutable objects and copy-on-modify. It avoids procedure programming techniques like for loops.

Python

Unfortunately, my wife didn't have R installed on her computer.  She did have Python installed from a previous project.  Python's pandas library provides a collection of function for manipulation data tables.

Here's my first pass at a Python version.

import pandas as pd

def main():
    url = 'https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD'
    out_file = 'hospitalizations_7day_ma.csv'
    hospitals = ('ALBANY MEDICAL CENTER HOSPITAL', 'ST PETERS HOSPITAL', 'COLUMBIA MEMORIAL HOSPITAL',
                 'SAMARITAN HOSPITAL', 'SARATOGA HOSPITAL', 'ELLIS HOSPITAL')
    cols = ['As of Date', 'Facility Name',
            'Patients Newly Admitted', 'Patients Positive After Admission']
    lag = 7

    df = pd.read_csv(url)
    df = df[cols]
    df = df.loc[df['Facility Name'].isin(hospitals)]
    df['As of Date'] = pd.to_datetime(df['As of Date'], format = '%m/%d/%Y')
    df = df.groupby(['As of Date'])[['Patients Newly Admitted', 'Patients Positive After Admission']].sum()
    df['New Covid Cases'] = df['Patients Newly Admitted'] + df['Patients Positive After Admission']
    df['Moving Average'] = df['New Covid Cases'].rolling(window = lag).mean()
    df = df.reset_index()
    df = df.dropna()
    df = df[['As of Date', 'New Covid Cases', 'Moving Average']]
    df = df.sort_values(['As of Date'], ascending = 0)
    
    df.to_csv(out_file, index = False)
    
if __name__ == "__main__":
    main()

This works but it is a bit ugly. It performs essentially the same task as the R version, but looks hard to read and interpret. There's too much copying gong on.

Pandas has a pipe function. It's not quite the same as R's magrittr pipe, but it does allow you to chain dataframes together and apply a function to the dataframe. It's more common to chain function in Python by using the '.' notation to apply a function to an object. For example, df.to_csv(out_file, index = False). The to_csv function is applied to df, a dataframe objectIt's possible to chain multiple function to a dataframe if each function returns an object. 

The following code chains a series of functions to perform the same operations as the R version.

import pandas as pd

def select_hospitals(df, hospitals):
 return df.loc[df['Facility Name'].isin(hospitals)]

def main():
    url = 'https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD'
    out_file = 'hospitalizations_7day_ma_2.csv'
    hospitals = ('ALBANY MEDICAL CENTER HOSPITAL', 'ST PETERS HOSPITAL', 'COLUMBIA MEMORIAL HOSPITAL',
                 'SAMARITAN HOSPITAL', 'SARATOGA HOSPITAL', 'ELLIS HOSPITAL')
    cols = ['As of Date', 'Facility Name',
            'Patients Newly Admitted', 'Patients Positive After Admission']
    lag = 7

    df = (pd.read_csv(url)
        .filter(cols)
        .assign(As_of_Date = lambda x: pd.to_datetime(x['As of Date'], format = '%m/%d/%Y'))
        .pipe(select_hospitals, hospitals)  # I couldn't use a lambda here
        .groupby(['As_of_Date'])[['Patients Newly Admitted', 'Patients Positive After Admission']]
        .sum()
        .reset_index()
        .assign(New_Covid_Cases = lambda x: x['Patients Newly Admitted'] + x['Patients Positive After Admission'])
        .assign(Moving_Average = lambda x: x['New_Covid_Cases'].rolling(window = lag).mean())
        .drop_na()
        .filter(['As_of_Date', 'New_Covid_Cases', 'Moving_Average'])
        .sort_values(['As_of_Date'], ascending = 0)
    )
    
    df.to_csv(out_file, index = False)
    
if __name__ == "__main__":
    main()

Each of the functions operates on a dataframe and returns a dataframe. The lambda expressions in functions like assign are passed the current dataframe. Note that Python's filter is similar to R's select function.

Julia

Julia provides a DataFrames and CSV libraries that have similar functionality to pandas. The following Julia code is similar to the first Python version.

using DataFrames
using CSV
using Dates
using RollingFunctions

function main()
    url = "https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD"
    out_file = "hospitalizations_7day_ma_3.csv"
    hospitals = ["ALBANY MEDICAL CENTER HOSPITAL", "ST PETERS HOSPITAL", "COLUMBIA MEMORIAL HOSPITAL",
                 "SAMARITAN HOSPITAL", "SARATOGA HOSPITAL", "ELLIS HOSPITAL"]
    cols = ["As of Date", "Facility Name",
            "Patients Newly Admitted", "Patients Positive After Admission"]
    lag = 7

    df = CSV.read(download(url), DataFrame)
    df = select(df, cols)
    df = filter(row -> row."Facility Name"  hospitals, df)
    df[!, "As of Date"] = Date.(df."As of Date", "mm/dd/yyyy")
    df = combine(groupby(df, "As of Date"), ["Patients Newly Admitted", "Patients Positive After Admission"] .=> sum)
    df."New Covid Cases" = df."Patients Newly Admitted_sum" + df."Patients Positive After Admission_sum"
    df."Moving Average" = runmean(df."New Covid Cases", lag)
    df = select(df, ["As of Date", "New Covid Cases", "Moving Average"])
    sort!(df, "As of Date", rev = true)

    CSV.write(out_file, df)
end

main()

This is Julia 1.7.0. It's important to make sure what version of Julia you dealing with on the web. Julia is a moving target. Solutions older than a few years found on the web may contain functions that either no longer work or work differently.

Julia has a Pipe library. it actually has several different forms of pipes. They provide functionality similar to R's magrittr pipe. In the code below I use a simple pipe from the Pipe.jl library. The library exposes a macro @pipe and a pipe operator |> to chain functions.

using DataFrames
using CSV
using Dates
using RollingFunctions
using Pipe

function main()
    url = "https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD"
    out_file = "hospitalizations_7day_ma_4.csv"
    hospitals = ["ALBANY MEDICAL CENTER HOSPITAL", "ST PETERS HOSPITAL", "COLUMBIA MEMORIAL HOSPITAL",
                 "SAMARITAN HOSPITAL", "SARATOGA HOSPITAL", "ELLIS HOSPITAL"]
    cols = ["As of Date", "Facility Name",
            "Patients Newly Admitted", "Patients Positive After Admission"]
    lag = 7

    # helper for runmean
    mva(x) = runmean(x, lag)

    df = @pipe CSV.read(download(url), DataFrame, dateformat = "mm/dd/yyyy") |>
        select(_, cols) |>
        filter(row -> row."Facility Name"  hospitals, _) |>
        combine(groupby(_, "As of Date"), ["Patients Newly Admitted", "Patients Positive After Admission"] .=> sum) |>
        transform!(_, ["Patients Newly Admitted_sum", "Patients Positive After Admission_sum"] => (+) => "New Covid Cases") |>
        transform!(_, ["New Covid Cases"] => mva => "Moving Average") |>
        select(_, ["As of Date", "New Covid Cases", "Moving Average"]) |>
        sort!(_, "As of Date", rev = true)

    CSV.write(out_file, df)
end

main()

'_' is a placeholder for the current dataframe.

I think Julia as a language is great. Julia as software still has some problems before I could use it as my everyday programming tool. 

All code examples can be found on GitHub.

No comments:

Post a Comment