The Analytic Garden: Contrasting Contrasts

Contrast is what makes photography interesting. Conrad Hall
Read more at: https://www.brainyquote.com/quotes/conrad_hall_285330?src=t_contrast

Contrast is what makes photography interesting. Conrad Hall
Read more at: https://www.brainyquote.com/quotes/conrad_hall_285330?src=t_contrast\

Contrast is what makes photography interesting. Conrad Hall
Read more at: https://www.brainyquote.com/quotes/conrad_hall_285330?src=t_contrast

Contrast is what makes photography interesting. Conrad Hall
Read more at: https://www.brainyquote.com/authors/conrad_hall

Contrast is what makes photography interesting.
Conrad Hall

It is trivially easy to create a linear regression model in R, Python, Julia, and other languages if you have numerical data in a table. Consider the following example with fictitious data in R.

> df <- data.frame(y=2 * 1:100 + 1.5 + rnorm(100, 0, 1), x=1:100)
> f1 <- lm(y ~ x, data=df)
> summary(f1)

Call:
lm(formula = y ~ x, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92791 -0.66786 -0.01113  0.75449  2.11251 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.942378   0.197287   9.845 2.62e-16 ***
x           1.992826   0.003392 587.561  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.979 on 98 degrees of freedom
Multiple R-squared:  0.9997,    Adjusted R-squared:  0.9997 
F-statistic: 3.452e+05 on 1 and 98 DF,  p-value: < 2.2e-16

> library(ggplot2)
> ggplot(data=df) + geom_point(mapping=aes(x=x, y=y)) + geom_line(data=data.frame(x=df$x, fit=predict(f1)), aes(x=x, y=fit), color="red", size=1.5)
>

The red line is the fitted model. The points are the data values.

Multiple linear regression with numerical values is similar. Just model

\[lm(y \sim {x_1} + {x_2} + ...{x_n},data = some.data.frame)\]

The "+" indicates that the variable that follows should be included the regression. The "~" separates teh dependent variable from the independent, modelling variables.

However, it is possible to build more complicated models. What happens if we suspect that a categorical variable influences the result. A categorical variable takes on one of a limited number of possible values. For example, the variable sex could take on values of "M" or "F". How can we handle a situation like that.

Once again, R makes it easy. Consider some more fictitious data.

> dat <- data.frame(result = c(rnorm(100, 2), rnorm(50, 1), rnorm(100,0)), condition=c(rep("A", 100), rep("B", 50), rep("C", 100)))
> head(dat)
    result condition
1 1.278243         A
2 2.642262         A
3 2.401794         A
4 1.583893         A
5 1.566025         A
6 2.500336         A
> ggplot(data=dat) + geom_boxplot(mapping=aes(x=condition, y=result))
> f2 <- lm(result ~ condition, data=dat)
> summary(f2)

Call:
lm(formula = result ~ condition, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1653 -0.6255 -0.0002  0.6344  2.8745 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.9157     0.1025  18.681  < 2e-16 ***
conditionB   -1.0297     0.1776  -5.797 2.05e-08 ***
conditionC   -2.1379     0.1450 -14.742  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.025 on 247 degrees of freedom
Multiple R-squared:  0.4681,    Adjusted R-squared:  0.4638 
F-statistic: 108.7 on 2 and 247 DF,  p-value: < 2.2e-16

>

The data frame has two columns; a numerical result and a condition variable with 3 possible values: A, B, or C. R fits the model without a problem, but what do the results mean. With strictly numerical variables, interpreting the coefficients is straightforward. The coefficients are the slopes of the regression lines.

In the summary of the model, two new variables appear, conditionB and conditionC. They represent the effects of condition=B and condition=C on result. In order to understand what that means, let's look at how R handles conditional variables in a case like this.

Categorical variables are handled by adding dummy variables to the data and modeling the dependent variable using the dummy variables. When adding dummy variables, if the categorical variable has k possible values, we add k-1 dummy variables.

There are several ways to encode the dummy variables. The simplest is to set the value of the dummy variable to 1 when the condition holds and 0 otherwise. For example, conditionB would have a 1 in each row where condition == "B" and 0 in every other row.

> dat2 <- data.frame(result=dat$result)
> dat2$conditionB <- as.numeric(dat$condition == "B")
> dat2$conditionC <- as.numeric(dat$condition == "C")
>

dat2 has three columns. Rows containing "B" in dat$condition contain a 1 in column conditionB and 0's in the other rows. Similarly, Rows containing "C" in dat$condition contain a 1 in column conditionC and 0's in the other rows. We could have chosen any k-1 of the k condition values as dummy variables. By default R chooses the first condition value found to be left out.

We can now model the data as a numerical linear regression.

> f3 <- lm(result ~ conditionB + conditionC, data=dat2)
> summary(f3)

Call:
lm(formula = result ~ conditionB + conditionC, data = dat2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1653 -0.6255 -0.0002  0.6344  2.8745 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.9157     0.1025  18.681  < 2e-16 ***
conditionB   -1.0297     0.1776  -5.797 2.05e-08 ***
conditionC   -2.1379     0.1450 -14.742  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.025 on 247 degrees of freedom
Multiple R-squared:  0.4681,    Adjusted R-squared:  0.4638 
F-statistic: 108.7 on 2 and 247 DF,  p-value: < 2.2e-16

>

In the results shown above, the intercept represents the effects of dat$condition == "A" . That is, when conditionB==0 and conditionC==0, we expect result to have a value of 1.9157. The coefficients of the other variables represent the effects of those conditions on condition "A". These effects are called contrasts. The approach we have used is called a treatment contrast. It is also called dummy coding.

The way we encoded the dummy variables in not the only method. Effect coding allows us to compare the effects of the conditions to the grand mean of the data rather than to one of the other conditions. The grand mean is the mean of the means of the dependent variable, result, for each of the conditions.

To use effect coding, the k-1 condition dummy variables are encoded as before, except that rows that contain the left out condition are encoded with -1. We will use condition == "C" as the left out variable to match what R does in this case.

> dat3 <- data.frame(result=dat$result)
> dat3$conditionA <- as.numeric(dat$condition == "A")
> dat3$conditionB <- as.numeric(dat$condition == "B")
> dat3$conditionA[dat$condition == "C"]=-1
> dat3$conditionB[dat$condition == "C"]=-1
>

> f4 <- lm(result ~ conditionA + conditionB, data=dat3)
> summary(f4)

Call:
lm(formula = result ~ conditionA + conditionB, data = dat3)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1653 -0.6255 -0.0002  0.6344  2.8745 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.85984    0.06837  12.577   <2e-16 ***
conditionA   1.05587    0.09044  11.675   <2e-16 ***
conditionB   0.02619    0.10810   0.242    0.809    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.025 on 247 degrees of freedom
Multiple R-squared:  0.4681,    Adjusted R-squared:  0.4638 
F-statistic: 108.7 on 2 and 247 DF,  p-value: < 2.2e-16

>

The intercept in this model is the grand mean and the coefficients represent the deviation from the grand mean.

> tapply(dat$result, dat$condition, mean)
         A          B          C 
 1.9157075  0.8860320 -0.2222134 
> mean(tapply(dat$result, dat$condition, mean))
[1] 0.859842
>

Effect coding can be handled more simply in R by using the contrasts argument.

> f5 <- lm(result ~ condition, data=dat, contrasts = list(condition=contr.sum))
> summary(f5)

Call:
lm(formula = result ~ condition, data = dat, contrasts = list(condition = contr.sum))

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1653 -0.6255 -0.0002  0.6344  2.8745 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.85984    0.06837  12.577   <2e-16 ***
condition1   1.05587    0.09044  11.675   <2e-16 ***
condition2   0.02619    0.10810   0.242    0.809    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.025 on 247 degrees of freedom
Multiple R-squared:  0.4681,    Adjusted R-squared:  0.4638 
F-statistic: 108.7 on 2 and 247 DF,  p-value: < 2.2e-16

>

R provides a number of different contrast types. See https://www.unc.edu/courses/2006spring/ecol/145/001/docs/lectures/lecture26.htm.

The Analytic Garden

Contrasting Contrasts

No comments:

Post a Comment

Labels

Contributors

wfmu