The Analytic Garden: Mixed Models Part 1

Let's suppose we have a collection of data. It looks like this.

> head(df_test)
         y        x Group
1 3.361500 56.76933     4
2 2.779182 53.85903     2
3 1.183500 23.78973     1
4 2.475000 61.04853    11
5 3.847500 86.93463     1
6 1.665000 25.19943     3
>

There are 1000 rows in the data table. We think x is a predictor of y. Group is an indicator variable, a factor. It combines information about who collected the data, where it was collected, techniques, etc. You can think of Group as describing who collected the data.

Disclaimer: This is fake data, but it is based on some real that that I was analyzing.

With 21 Groups, it's a bit difficult to see the distinctions among the points from the various Groups. Displaying more that few distinct colors in R or any other tool is difficult.

Here's the distribution of the data within Groups.

> ggplot(df_test, aes(x = Group, y = x)) + 
    geom_boxplot() + 
    geom_jitter(width = 0.25, size = 0.5)

 print(df_test %>% group_by(Group) %>% count(), n = Inf)
# A tibble: 21 × 2
# Groups:   Group [21]
   Group     n
   <fct> <int>
 1 1       368
 2 2        40
 3 3        53
 4 4        65
 5 5         4
 6 6        15
 7 7         5
 8 8         8
 9 9        21
10 10       48
11 11       91
12 12       90
13 13        7
14 14       18
15 15       21
16 16       80
17 17        9
18 18       47
19 19        5
20 20        2
21 21        3
>

The data seems to be dominated by a few groups, especially Group 1. In addition, the means of the x values vary between groups. A similar boxplot of y values also shows variation among the means.

Do differences in data collected from various groups affect predictions of y from x?

From the x-y plot, it looks like there might be a strong linear relationship between x and y. A linear model might be appropriate. However, It looks like the variance of the data might not be constant as we move along the x axis.

We'll begin with simplest possible model and add complications to see if we can get an idea of how Group is affecting the predictions.

Fixed Effect Models

A fixed effects model is a statistical model in which the model parameters are fixed or non-random quantities. In the models described below, we will try to estimate the fixed coefficients of the models. This is in contrast to a random effects model in which the parameters are random variables drawn from some, possibly unknown, distribution.

Intercept Only Model

The simplest model of y is to simply take the mean of the data. In terms of linear models, this is called an intercept only model.

\[\begin{array}{l}{y_i} = {\beta _0} + {\varepsilon _i}\\{\varepsilon _i} \sim N(0,{\sigma ^2})\end{array}\]

${\beta _0}$ represents the single estimated value and ${\varepsilon _i}$ represents an error term.

We can write this model more compactly as

\[y \sim N({\beta _0},{\sigma ^2})\]

That is, we are assuming that y is normally distributed around a constant mean.

This model is not very interesting. It estimates all y values with a single value, the mean of the y data.

In R terms:

lm1 <- lm(y ~ 1, df_test)

term	estimate	std.error	statistic	p.value	Significance
(Intercept)	3.110914914	0.033022011	94.20731236	0	***

A Linear Model

A more interesting model is to estimate y from x:

\[{y_i} = {\beta _0} + {\beta _1}{x_i} + {\varepsilon _i}\]

or in R,

> lm2 <- lm(y ~ x, df_test)
> summary(lm2)$r.squared
[1] 0.9216033
>

term	estimate	std.error	statistic	p.value	Significance
(Intercept)	0.429480279	0.026427797	16.2510812	7.26E-53	***
x	0.046343568	0.000427859	108.3149471	0	***

This model fits the data as a single line with intercept given by ${\beta _0}$ and a slope ${\beta _1}$. ${\varepsilon _i}$ is a random error term. In this model, we are making certain assumptions about the data.

\[y \sim N({\beta _0} + {\beta _1}x,{\sigma ^2})\]

That is, we assume that the data is normally distributed with mean ${{\beta }_{0}}+{{\beta }_{1}}x$ and constant variance ${\sigma ^2}$ . Residual errors are assumed to have mean zero and are normally distributed with constant variance, ${{\varepsilon }_{i}}\sim N(0,{{\sigma }^{2}})$.

Despite the simplicity of the model, the R-squared value shows a good fit. The small p-values indicate that the slope and intercept are significantly different from zero.

ggplot(df_test) + 
 geom_point(aes(x = x, y = y)) +
 geom_abline(intercept = coef(lm2)[1], slope = coef(lm2)[2])

The straight line represents the lm2 regression line.

A qqplot will show if the assumption of normal errors around the mean is adequate.

The deviation from the straight line at the tails shows that the assumption of normal errors is not strictly true. That doesn't make the model wrong or useless, but it's something we need to be aware of. The residuals most likely have a distribution with longer tails then normal.

The model above is called a “pooled” model by Gelman and Hill[i]. In the pooled model, group indicators such as Group are not included, and all data points are treated the same.

Does Group Affect the Intercept?

In contrast to the pooling the data, we can treat the data coming from a particular Group as a group. In this model, we treat each group separately. The group data influences the intercept of the group's regression line. Each of the groups will have its own intercept but will share a constant slope.

We model the data as

\[\begin{array}{l}{y_i} = {\beta _{0j}} + {\beta _1}{x_i} + {\varepsilon _i}\\y \sim N({\beta _0}_j + {\beta _1}x,{\sigma ^2})\end{array}\]

The j subscript in ${\beta _0}_j$ indicates the group number. There is a separate intercept for each group. Gelman and Hill call this the “no-pooling” model. When fitting this model, R chooses one of the groups to be the reference level and returns coefficients showing the contrast to the reference level. In our case, the reference level is Group 1. ${\beta _1}$ is the slope of the fit lines.

> lm3 <- lm(y ~ x + Group, df_test)
> summary(lm3)$r.squared
[1] 0.9382607
>

Adding group results in a slightly better r-squared at the cost of more parameters.

term	estimate	std.error	statistic	p.value	Significance
(Intercept)	0.360332077	0.027405965	13.14794332	1.78E-36	***
x	0.047131182	0.000435566	108.2066739	0	***
Group2	-0.07538652	0.044281162	-1.702451264	0.088988577	.
Group3	0.066682535	0.038890202	1.714635865	0.086728812	.
Group4	0.064706621	0.035322174	1.83189806	0.067270603	.
Group5	0.41749091	0.132207688	3.157841396	0.001638125	**
Group6	0.482346602	0.070488133	6.842947662	1.37E-11	***
Group7	0.018043288	0.11812715	0.152744632	0.878631169
Group8	0.294320063	0.093729708	3.140093677	0.001739619	**
Group9	-0.191411538	0.060235494	-3.177720059	0.001530941	**
Group10	-0.015922307	0.040954613	-0.388779324	0.697524021
Group11	-0.161850618	0.031305195	-5.170088195	2.84E-07	***
Group12	0.281433594	0.030887334	9.111618311	4.48E-19	***
Group13	0.140925663	0.100076771	1.408175558	0.159396886
Group14	-0.311566742	0.063307379	-4.921491717	1.01E-06	***
Group15	0.103031195	0.058911077	1.74892738	0.080617274	.
Group16	0.051262508	0.032700559	1.567634019	0.117290043
Group17	-0.148061746	0.088630367	-1.670553236	0.095129913	.
Group18	-0.004371314	0.041148058	-0.106233781	0.915418655
Group19	0.026960639	0.118274662	0.227949403	0.819733223
Group20	0.083024504	0.185934553	0.446525415	0.65531661
Group21	0.426608271	0.152241561	2.802180089	0.005176093	**

Group1’s intercept is shown in the results as “(Intercept)”. The values for the other groups are contrasts to Group1. To get the intercept for the other groups, add the contrast to Group1’s intercept. For example, Group2’s intercept is 0.360332077 + (-0.07538652) = 0.284945557. All Groups have same slope, shown in the table as x.

With this model, we have a single slope and 21 intercepts, one for each group. Plotting the individual groups shows that the intercept varies with group.

Do Groups Affect the Slope?

We can examine the effects of the groups on the slopes of the individual fits by including coefficients for individual slopes. The slightly more complicated model is,

\[\begin{array}{l}{y_i} = {\beta _{0j}} + {\beta _{1j}}{x_i} + {\varepsilon _i}\\y \sim N({\beta _0}_j + {\beta _{1j}}x,{\sigma ^2})\\\end{array}\]

${\beta _{1j}}$ represents the slope term for group j, i.e. the slope represented by x is conditioned by the group.

> lm4 <- lm(y ~ x + Group + Group:x, df_test)
> summary(lm4)$r.squared
[1] 0.9420782
>

${\beta _{1j}}$ represents the slope term for group j. In R, is called an interaction. It means the slope represented by x is conditioned by Group.

Like the preceding model, R returns estimates of the contrasts of the intercepts and in this case also contrasts of the slopes. The intercept and slope for Group1, the reference level, are “(Intercept)” and “x” respectively.

term	estimate	std.error	statistic	p.value	Significance
(Intercept)	0.327341057	0.035265421	9.282210427	1.08E-19	***
x	0.047736145	0.000598329	79.78237403	0	***
Group2	-0.160763278	0.182865882	-0.879132162	0.379550074
Group3	0.436566802	0.112747557	3.872073273	0.000115236	***
Group4	-0.048725662	0.12217049	-0.39883332	0.690104905
Group5	0.772420727	0.210121883	3.676060377	0.000250019	***
Group6	0.818985189	0.134193341	6.103024041	1.51E-09	***
Group7	0.543263921	0.942664681	0.576306646	0.5645433
Group8	0.301480732	0.250073417	1.205568889	0.228281442
Group9	-0.493229407	0.12981909	-3.799359604	0.000154218	***
Group10	-0.04718856	0.175264044	-0.269242677	0.787800917
Group11	-0.314246327	0.112618026	-2.790373248	0.005369266	**
Group12	0.527963451	0.087878654	6.007869128	2.67E-09	***
Group13	-0.297769109	0.303638702	-0.980669153	0.327003507
Group14	6.652037742	7.897121744	0.842336988	0.399809588
Group15	-1.055984376	0.646956991	-1.632232732	0.102959097
Group16	0.138299803	0.110239818	1.254535842	0.209953293
Group17	-0.6357577	0.466520362	-1.362765169	0.173276776
Group18	0.116008531	0.176642979	0.656740117	0.51150576
Group19	0.374947709	0.614983092	0.609687833	0.54221319
Group20	1.20435612	1.061705537	1.134359837	0.256927397
Group21	4.987221872	1.743917693	2.859780535	0.004331392	**
x:Group2	0.001050166	0.002516176	0.417365786	0.676504392
x:Group3	-0.008554796	0.002460093	-3.477427521	0.000529001	***
x:Group4	0.001903607	0.002019025	0.942834988	0.346003012
x:Group5	-0.010787213	0.005177655	-2.083416742	0.037477885	*
x:Group6	-0.014218907	0.005022958	-2.83078363	0.004740559	**
x:Group7	-0.008431415	0.014878328	-0.566691007	0.571056871
x:Group8	-0.000159151	0.004020386	-0.03958611	0.968431349
x:Group9	0.012845941	0.004526545	2.837913174	0.004636831	**
x:Group10	0.000287811	0.002404708	0.119686302	0.904756746
x:Group11	0.002098228	0.001623657	1.292285511	0.19656991
x:Group12	-0.00425254	0.001426503	-2.981094186	0.002944823	**
x:Group13	0.00735169	0.004853974	1.514571391	0.130210828
x:Group14	-0.131278034	0.148891426	-0.881703112	0.378158539
x:Group15	0.024393641	0.013509973	1.8056025	0.071294263	.
x:Group16	-0.001430034	0.00164576	-0.868920355	0.385108328
x:Group17	0.007218412	0.006896262	1.04671371	0.295495653
x:Group18	-0.001860404	0.002500385	-0.744046906	0.457030552
x:Group19	-0.005075435	0.008580858	-0.591483395	0.554336117
x:Group20	-0.020374834	0.019000852	-1.072311629	0.283850132
x:Group21	-0.06229649	0.023678667	-2.630911986	0.008652347	**

The intercept for Group2 is 0.327341057+ (0.160763278) = 0.166577779. x:Group2 represents the slope contrast for Group2. The slope is given by x + x:Group2, 0.047736145 + 0.001050166 = 0.049836477.

Plotting the slopes shows some dramatic changes in the slopes for a few for the group. In particular, Group 14’s slope is in a different direction from that of the other groups. Notice that the regression line for Group14 is different from the rest. Most of the slopes have slight shifts from Group1.

Does Adding Groups Improve the Models?

We have looked at four linear models:

\[\begin{array}{l}lm1:y \sim N({\beta _0},{\sigma ^2})\\lm2:y \sim N({\beta _0} + {\beta _1}x,{\sigma ^2})\\lm3:y \sim N({\beta _{0j}} + {\beta _1}x,{\sigma ^2})\\lm4:y \sim N({\beta _{0j}} + {\beta _{1j}}x,{\sigma ^2})\end{array}\]

The models are nested, i.e. the independent variables of the simpler model are a subset of the more complex model. We will assume a Null Hypothesis that the simple model is better, and we reject the null hypothesis if the p-value is small. In our case, less than 0.001.

One way to compare models is with anova (analysis of variance).

> anova(lm1, lm2, lm3, lm4)
Analysis of Variance Table

Model 1: y ~ 1
Model 2: y ~ x
Model 3: y ~ x + Group
Model 4: y ~ x + Group + Group:x
  Res.Df     RSS Df Sum of Sq          F    Pr(>F)    
1    999 1089.36                                      
2    998   85.40  1   1003.96 15242.9063 < 2.2e-16 ***
3    978   67.26 20     18.15    13.7753 < 2.2e-16 ***
4    958   63.10 20      4.16     3.1571 3.849e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

Given the small p-values we can safely reject the null hypothesis. Despite increased degrees of freedom, the more complex models are preferred over the simpler ones.

Another common method of comparing models is the log likelihood ratio test. In this test, the log likelihood ratios of nested models are compared sequentially, and a chi squared test of significance is performed. We reject the null hypothesis if the resulting p-values are small.

> library(lmtest)
> lrtest(lm1, lm2, lm3, lm4)
Likelihood ratio test

Model 1: y ~ 1
Model 2: y ~ x
Model 3: y ~ x + Group
Model 4: y ~ x + Group + Group:x
  #Df   LogLik Df    Chisq Pr(>Chisq)    
1   2 -1461.73                           
2   3  -188.75  1 2545.974  < 2.2e-16 ***
3  23   -69.32 20  238.861  < 2.2e-16 ***
4  43   -37.40 20   63.828  1.791e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

The small p-values indicate that we can safely reject the null hypothesis that simpler models are preferred.

Data and code for producing the tables and plots can be found on GitHub.

[i] Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. New York: Cambridge University Press.

The Analytic Garden

Mixed Models Part 1

Do differences in data collected from various groups affect predictions of y from x?

Fixed Effect Models

Intercept Only Model

A Linear Model

Does Group Affect the Intercept?

Do Groups Affect the Slope?

Does Adding Groups Improve the Models?

No comments:

Post a Comment

Labels

Contributors

wfmu