The Trouble with Tibbles

Obviously, tribbles are very perceptive creatures, Captain.
~ Spock (Leonard Nimoy)

I find it more helpful to think of R as having a programming language than being a programming language. 
John D. Cook

I use R quite frequentl and I often make use of the tidyverse library. Tidyverse implements the "tidy" data philosophy. Tidy data sets are stored in tables where each row is an observation and each column is a variable. The tidyverse library, along with severl other tidy libraries, implements this idea in R. 

To learn about tidyverse, the best place to start is Garrett Grolemund and Hadley Wickham's book R for Data Science. It's available online for free. What follows is not a tidyverse tutorial. There are plenty of those on the web. I want to point out few difficulties I have encountered while trying to parameterize tidyverse operations.

Tibbles

In the tidyverse library, the basic table data structure is the tibble, an enhanced data frame. Tidyverse provides a collection of methods for easy manipulation of tibbles and data frames. For example, the following code creates a tibble with three columns and ten rows.
> library(tidyverse)
> df <- tibble(Index = 1:10, Norm = rnorm(10), Unif = runif(10, -1, 1))
> df
# A tibble: 10 x 3
   Index   Norm   Unif
   <int>  <dbl>  <dbl>
 1     1  0.286  0.296
 2     2  0.533  0.230
 3     3  1.55   0.228
 4     4  0.189  0.805
 5     5  1.18   0.157
 6     6 -0.831 -0.366
 7     7 -2.16   0.669
 8     8  1.30   0.688
 9     9  1.47   0.365
10    10  0.853 -0.452
Data manipulation with tidyverse consists of a series of transformations of data tables. Each transformation results in a new data table containing the transformed data. For instance tidyverse provides a method to select columns. Selecting a subset of columns results in new data table consisting of all rows of the selected columns.

> df %>% select(Index, Unif)
# A tibble: 10 x 2
   Index   Unif
   <int>  <dbl>
 1     1  0.296
 2     2  0.230
 3     3  0.228
 4     4  0.805
 5     5  0.157
 6     6 -0.366
 7     7  0.669
 8     8  0.688
 9     9  0.365
10    10 -0.452

The %>% symbol is a tidyverse pipe operator. It feeds the table preceding it into the method that follows. Notice that the select() statement returns a new tibble. Also, we don't have to quote the column names or supply column vectors such as df$Index. select takes care associating names with their bound columns.
    
Our table is tiny, but suppose we had a table with hundreds of columns. We might want to parameterize the columns to be selected. For this small example, we might write a function like the following.

select_column <- function(df, column) {
  require(tidyverse)
  
  df %>% select(Index, column)
}

Running this function produces the following output.

> select_column(df,'Norm')
Note: Using an external vector in selections is ambiguous.
i Use `all_of(column)` instead of `columns` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
# A tibble: 10 x 2
   Index   Norm
   <int>  <dbl>
 1     1  0.286
 2     2  0.533
 3     3  1.55 
 4     4  0.189
 5     5  1.18 
 6     6 -0.831
 7     7 -2.16 
 8     8  1.30 
 9     9  1.47 
10    10  0.853


Despite the warning, we get a new tibble containing the selected columns. WE could avoid the warning message by using all_of(columns) in the select() statement.

Data Filtering

If all you could do was select columns tidyverse wouldn't be much use. The tidyverse libraries provide a large collection of table manipulation function. For example, suppose we wanted to return only rows where the value in the selected column was greater than zero.

> df %>% select(Index, Norm) %>% filter(Norm > 0)
# A tibble: 8 x 2
  Index  Norm
  <int> <dbl>
1     1 0.286
2     2 0.533
3     3 1.55 
4     4 0.189
5     5 1.18 
6     8 1.30 
7     9 1.47 
8    10 0.853


filter() returns each row for which the condition evaluates to TRUE. In order to parameterize this action, we could modify our function.

select_column <- function(df, column) {
  require(tidyverse)
  
  df %>% 
    select(Index, column) %>% 
    filter(column > 0)
}


When we run this new version we get

> select_column(df, 'Norm')
# A tibble: 10 x 2
   Index   Norm
   <int>  <dbl>
 1     1  0.286
 2     2  0.533
 3     3  1.55 
 4     4  0.189
 5     5  1.18 
 6     6 -0.831
 7     7 -2.16 
 8     8  1.30 
 9     9  1.47 
10    10  0.853


There is something wrong here. We didn't get the same result as we would get from the R command line version and even worse, the function didn't tell us that something was amiss.

I naively expected filter() to do something like this: look for column in the data table namespace; if it wasn't found there, search the enclosing environments for a value bound to column and find Norm bound to column. Instead, it treats column as a simple string and since 'column' > 0 evaluates to TRUE, all rows are returned. This behavior is different than select()'s operation. What's going on?

This StackOverflow post gives us a hint. The problem is that filter() expects an evaluated symbol in the condition. It won't search the environment for you. There  are a couple of ways to get an evaluated symbol. We can use get() to fetch the bound value from the enclosing environment.

select_column <- function(df, column) {
  require(tidyverse)
  
  df %>% 
    select(Index, column) %>% 
    filter(get(column) > 0)
}

This version produces the correct output. Another approach is to recognize that column is bound to a string and to turn the string into a symbol using sym() and use !! to insert the string's code tree into filter().

select_column <- function(df, column) {
  require(tidyverse)
  
  df %>% 
    select(Index, column) %>% 
    filter(!! sym(column) > 0)
}
This code also produces the correct results.

Multiple Parameters

Let's add some more columns to the data table. 

> df2 <- df %>% add_column(Type = rep(c('red', 'blue'), 5), Source = c('Air', 'Water', 'Ground', 'Grass', 'Trees', 'Air', 'Air', 'Air', 'Water', 'Water'))
> df2
# A tibble: 10 x 5
   Index   Norm   Unif Type  Source
   <int>  <dbl>  <dbl> <chr> <chr> 
 1     1  0.286  0.296 red   Air   
 2     2  0.533  0.230 blue  Water 
 3     3  1.55   0.228 red   Ground
 4     4  0.189  0.805 blue  Grass 
 5     5  1.18   0.157 red   Trees 
 6     6 -0.831 -0.366 blue  Air   
 7     7 -2.16   0.669 red   Air   
 8     8  1.30   0.688 blue  Air   
 9     9  1.47   0.365 red   Water 
10    10  0.853 -0.452 blue  Water 


We can group the data by new columns and get a count of occurrences. group_by() groups the data by the columns passed to it and summarize() counts the number of occurrences in each group.

> df2 %>% group_by(Source, Type) %>% summarize(n = n())
`summarise()` regrouping output by 'Source' (override with `.groups` argument)
# A tibble: 7 x 3
# Groups:   Source [5]
  Source Type      n
  <chr>  <chr> <int>
1 Air    blue      2
2 Air    red       2
3 Grass  blue      1
4 Ground red       1
5 Trees  red       1
6 Water  blue      2
7 Water  red       1


To parameterize this process, I'll try the methods I used previously.

count_entries <- function(df, group_cols) {
  df2 %>% 
    group_by(get(group_cols)) %>% 
    summarize(n = n())
}

However, this methods won't do what we want.

> count_entries(df2, c('Source', 'Type'))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 2
  `get(group_cols)`     n
  <chr>             <int>
1 Air                   4
2 Grass                 1
3 Ground                1
4 Trees                 1
5 Water                 3
> 

The problem is that get() only evaluates the first element of the vector. Using !! sym() produces an error stating that "Error: Only strings can be converted to symbols." group_cols is bound to a vector.

The solution is to use another tidyverse method and some new syntax. This code produces the desired output.

count_entries <- function(df, group_cols) {
  df %>% 
    group_by(across({{ group_cols }})) %>% 
    summarize(n = n())
}

R uses lazy evaluation. That means that arguments like group_cols contain a promise not a value. The promise is a method that will evaluate the variable in the correct environment, but not until it is actually needed. The {{ }} notation is called an embrace and it causes R to fulfill the promise. across() applies a function, the embrace in this case, to a set of columns.

Adding a Calculation

A common operation on tables is to create a summary of one or more groups of the table data. Here, we create a new column that contains the mean of each group for the Norm data column

> df2 %>% group_by(Source) %>% summarize(Norm_mean = mean(Norm)) %>% select(Source, Norm_mean)
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 2
  Source Norm_mean
  <chr>      <dbl>
1 Air       -0.353
2 Grass      0.189
3 Ground     1.55 
4 Trees      1.18 
5 Water      0.952

To parametrize this code, we need to be able to construct a new column in a data table. 

calc_mean <- function(df, select_column, group_column) {
  require(tidyverse)
  
  new_column <- paste(select_column, 'mean', sep = '_')
  
  df %>% 
    group_by(!! sym(group_column)) %>%
    summarise(!! new_column := mean(!! sym(select_column))) %>%
    select(all_of(c(group_column, new_column)))
}

I used !! and sym() to create a symbol from character variables passed to the function and then insert the code-tree for the symbol. In the summarize() method I created a new column from text. Note the use of :=. This is a special form of the assignment function because R doesn't allow expressions on the left side of an equal sign.

This method also works.

calc_mean <- function(df, select_column, group_column) {
  require(tidyverse)
  
  new_column <- paste(select_column, 'mean', sep = '_')
  
  df %>% 
    group_by(across({{ group_column}})) %>%
    summarise(!! new_column := mean(!! sym(select_column))) %>%
    select(all_of(c(group_column, new_column)))
}

Parameterization

Parameterization of tidyverse functions can be tricky. I have just scratched the surface of the problems that I have encountered because I didn't understand the underlying details of tidy evaluation and expected tidyverse functions to operate in a manner analogous to base R. To use tidyverse effectivly in functions, you have to understand the specifics of how the various methods operate. The problem is that many if not most R users are not computer scientists, they are biologists, statisticians, geologists, etc. Most users will not want to, or even care to, find out the details. They just want to computer a statistic or plot a graph. The level of detail needed to effectively use tidyverse in functions will discourage them.

The techniques described above work with R 4.0 and the dplyr library version 1.02.

To learn more about the details of R data structures and evaluation see Hadley Wickham's book Advanced R and the Programming with dplyr website. This page also gives some insight into advanced R programming.


No comments:

Post a Comment