Hetero What?

Normality is the Great Neurosis of civilization.
Tom Robbins,

Normality is death.
Theodor W. Adorno


Consider the following data:


It looks like there is a definite linear trend in the data, so a natural model might be

\[{\mathbf{y}} = {\beta _0} + {\beta _1}{\mathbf{x}} + {\mathbf{\varepsilon }}\]

The betas are numerical coefficients and ${\mathbf{\varepsilon }}$ represents an error term, i.e. the linear difference between the model values and the data values.

Fitting a linear model to this sort of data is almost trivial in languages like R, Python, or Julia. In R:

> fit.lm <- lm(y ~ x, df.lm$df)
> summary(fit.lm)

Call:
lm(formula = y ~ x, data = df.lm$df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.29382 -0.04390 -0.00026  0.03665  0.39976 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.1046755  0.0052070   20.10   <2e-16 ***
x           0.0247682  0.0003006   82.41   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08239 on 998 degrees of freedom
Multiple R-squared:  0.8719,    Adjusted R-squared:  0.8717 
F-statistic:  6791 on 1 and 998 DF,  p-value: < 2.2e-16

> ggplot(df.lm$df, aes(x=x, y=y)) + geom_point() + geom_abline(slope=coef(fit.lm)[2], intercept=coef(fit.lm)[1], color="red")


df.lm$df is a data frame containing the x and y values.


The red line shows the fitted model.

Behind this deceptively simple example are a number of assumptions. This page lists the ten most important assumptions we are making in fitting the data to a linear model. It also gives R code to test each of these assumptions. Violating any of these assumptions doesn't necessarily make our model wrong. However, making decisions based on incorrect assumptions can get us in trouble.

Let's examine just a couple of these assumptions: homoscedasticity of residuals and normality of the residuals. Homoscedasticity of the residuals means that the errors, ${\mathbf{\varepsilon }}$ in the linear model shown above, have the same variance. In other words, however the residuals are distributed, all the residual values can be described as coming from a single distribution with finite variance. Normality of the residuals means that a histogram of the residuals should look like a normal curve. Combining these two assumptions, means that the errors are modeled as being drawn from a normal distribution with the mean of zero and a fixed finite variance. In other words,

\[\varepsilon  \sim N(0,{\sigma ^2})\]

Let's test our assumptions.

Zero Mean

We can test that assumption easily in R:

> mean(resid(fit.lm))
[1] -3.525839e-18

The mean of the residual is very close to zero, so we are safe on that count.

Normality

R has a number of tests for normality, but you can often determine non-normality by visual inspection of the data. Here's a histogram of the residuals.


> ggplot(data.frame(r=resid(fit.lm)), aes(x=r)) + 
   geom_histogram(aes(y = ..density..), fill="white", color="black") + labs(x="Residuals")


The plot looks sort of normal, but it looks like the tails are long.

It's easy for the eye to be deceived when looking like a plot like this. A better way to visualize whether there is a deviation from normality is the qqplot. A qqplot plots quantiles of a standard normal distribution against quantiles of the residuals. If the data were perfectly normal, we would expect a straight line. Deviation from the line indicate deviations  from the normal distribution.

> qqplot.data(df.lm$df, fit.lm)
> 


The departure from the straight line at either end of the plot is indicative of a long tailed distribution in contrast to a normal distribution.

Violations of homoscedasticity, called heteroscedasticity,  can result in overestimating the goodness  of fit. It is also a problem if your goal is to understand the error model rather than just fit a model to some data. However, homoscedasticity is not required for the estimates to be unbiased, consistent, and asymptotically normal.

Dealing with heteroscedasticity

The visualization plots show us that the residuals from our model are not normal. Unfortunately, the standard tests for heteroscedasticity don't show significant deviation from homoscedasticity.

> car::ncvTest(fit.lm)
Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 0.01943981    Df = 1     p = 0.889113 
>  

Given the pvalue, the test says that we can't reject the null hypothesis that the variance if the residuals is constant. However, the plots show definite non-normality. The shape of the histogram might lead to suspicions that the residuals might be modeled as a mixture of normals, possibly with at least one mixture component with a wide variance. In fact, that's how the data was generated. It is artificial data generated from a linear model with errors drawn from a mixture or three normals.

Here's how the data was generated.

generate.lm.mix.data <- function(beta = c(0.1, 0.025),
                                 mu = c(0.01, -0.02), 
                                 sigma = c(0.1, 0.05, 0.02), 
                                 lambda = c(0.5, 0.3),
                                 N=1000) {
# generate.lm.mix.data - generate a linear model with error terms drawn from
#                        a mixture of three normal distributions
#
# arguments:
# beta - a vector of linear model components (intercept, slope)
# mu - a vector of two means for the normal distributions (mu1, mu2) 
#      There are three means. The means are constrained:
#      lambda[1]*mu[1] + lambda[2]*mu[2] + lambda[3]*mu[3] = 0
# sigma - a vector of standard deviations (sigma[1], sigma[2], sigma[3])
# lambda - a vector of two mixture parameters (lambda[1], lambda[2])
#          There are three mixture parameters. They are constrained to sum to 1
#          lambda[1] + lambda[2] + lambda[3] = 1
# N - an integer, the number of x and y values to generate
#
# return:
# a list
# df - a data frame of genratef x and y values
# beta - the vector of linear model components (intercept, slope)
# mu - the vector of means of the normal distributions (mu[1], mu[2], mu[3])
# sigma - a vector of standard deviations (sigma[1], sigma[2], sigma[3])
# lambda - vector of mixture components (lambda[1], lambda[2], lambda[3])
# errors - vector of length N of draws from the three normal distributions
# components - a vector of length N indicating which distribution the error was drawn from
#
# requires:
# sigma values > 0
# lambda values > 0 and < 1
#
# this function does no error checking
#
  # calculate missing lambda and mu values
  lambda <- c(lambda[1], lambda[2], 1-(lambda[1] + lambda[2]))
  mu <- c(mu[1], mu[2], -(lambda[1] * mu[1] + lambda[2] * mu[2])/lambda[3])
  
  # sample component and calulate errors
  components <- sample(1:3, prob=c(lambda[1], lambda[2], lambda[3]),
                       size=N, replace=TRUE)
  e <- rnorm(N, mu[components], sigma[components])

  x <- seq(0, 30, length=N)
  y <- beta[1] + beta[2] * x + e
  
  return(list(df = data.frame(x = x, y  = y),
              beta = beta,
              mu = mu,
              sigma = sigma,
              lambda = lambda,
              errors = e,
              components = components))
}  

R provides a number of methods for dealing with linear models which have  non-normal distributions of residuals. This vignette from lmvar package describes one in detail and provides references to others.

For this example, I will use another approach. I will fit the linear model ${\mathbf{y}} = {\beta _0} + {\beta _1}{\mathbf{x}}$ and a mixture model of the residuals. That is we will use an expectation–maximization (EM) algorithm to fit the betas, the means and standard deviations of the individual normal distributions, and the mixture parameters.

I looked at how to fit a mixture model of normal distributions in a previous post. I won't go into the details of EM here. I'll use slightly different approach to fitting the parameters than the standard EM for mixtures.

Expectation–maximization for residuals as mixtures of normals

Here's where I cheat a bit. In general, I wouldn't know that the residuals could be modeled effectively by a mixture of distributions. I might assume that given the histogram above, but I wouldn't necessarily know how many components make up the mixture. In practice, I would try a variety of models and use AIC or BIC to choose among the models. However, in this case I know that the residuals are drawn from a mixture of three normal distributions, because I generated the data.

I'll take a hybrid approach to the EM algorithm. For the expectation step, I'll calculate the residuals based on the current estimates of the beta parameters. I can use the estimate of the probability of each residual value being drawn from one of three mixture distributions. I do this using the current estimates of the distribution means and standard deviations, and the mixing parameters.

    # expectation
    r <- df$y - (beta[1] + beta[2] * df$x)
    
    # get the probability of each residual being drawn from each distribution
    p <- matrix(0, N, K)
    for(k in 1:K) {
      p[, k] <- lambda[k] * dnorm(r, mu[k], sigma[k])
    }
    p <- p / rowSums(p) # normalize

I'll take a slightly non-standard approach to the maximization step. First, I'll use the estimated probabilities for each point to get a maximum likelihood estimate of the mixing parameters, $\lambda$. Next, I will get new estimates of the linear model parameters and the parameters for the mixture distribution by maximizing the log likelihood of the mixture distribution. For practical purposes I will minimize the negative of the log likelihood. In addition, I will constrain the means of the distribution to be somewhat centered around zero by requiring

\[\sum\limits_{k = 1}^3 {{\lambda _k}{\mu _k} = 0} \]

Estimating the mixture parameters

    # maximization
    lambda = colSums(p)/N


Here are the equality constraint function and the log likelihood function.

  eqfn <- function(x) {
    # eqfn - lefthand side of the constraint on mu and 
    #
    # arguments:
    # x - parameter vector passed by solnp
    #   x[3:(3+K-1)] are the means of each of the distributions
    # see the comments for LL
    #
    # returns:
    # the sum of the mixture parameters * means
    #
    q <- sum(lambda * x[3:(3+K-1)])

    return(c(q))
  }
  
  LL <- function(x) {
    # LL - log likelihood function
    #      Calculate the negative of the log probability of the residuals 
    #      given a set of parameter estimates
    # 
    # arguments:
    # x - parameter vector passed by solnp
    #   x[1:2] - the linear model coefficients (beta[1], beta[2])
    #   x[3:(3+K-1)] - the means of each of the mixture distributions
    #   x[(3+K):(3+2K-1)] - the standard deviations of the mixture distributions
    #
    # returns:
    # The negative log of the mixture probability at each point
    #
    # requires:
    # lambda - a vector argument of the parent function. It is the mixture parameters
    #
    r <- df$y - (x[1] + x[2] * df$x) # residuals
    
    # sum the mixture values
    r2 <- 0
    for(k in 1:K) {
      r2 <- r2 + suppressWarnings(lambda[k] * dnorm(r, x[2+k], x[2+K+k]))
    }

    if(debug) 
      cat(x, -sum(log(r2)), "\n")
    
    return(-sum(log(r2)))
  }


To accomplish this, I'll use a non-linear optimization routine with an equality constraint along with lower bounds on the value of the standard deviations.

    # constraint right-hand sides
    eqB <- c(0)  # equality constraint
    lb <- c(-1.0e+06, -1.0e+06,
            -1.0e+06, -1.0e+06, -1.0e+06,
            0, 0, 0) # lower bounds on parameters
    ub <- c(1.0e+06, 1.0e+06,
            1.0e+06, 1.0e+06, 1.0e+06,
            1.0e+06, 1.0e+06, 1.0e+06) # upper bounds

    # start from current estimates
    start <- c(beta0 = beta[1], beta1 = beta[2],
              mu1 = mu[1], mu2 = mu[2], mu3 = mu[3],
              sigma1 = sigma[1], sigma2 = sigma[2], sigma3=sigma[3])
    names(start) <- c("beta0", "beta1",
                      "mu1", "mu2", "mu3",
                      "sigma1", "sigma2", "sigma3")

    # find the parameters that maximize the log likelihood
    fit <- solnp(start, LL, 
                 eqfun = eqfn, eqB = eqB, 
                 LB = lb, UB = ub, 
                 control = list(outer.iter = maxOptIter))


This differs from a more standard EM approach like the one shown here.

Using the new estimates, the process is repeated until the change in the log likelihood is below some preset value.

Getting Started

EM finds a local optimum. This particular version is sensitive to the starting values for the parameters. A small change in the initial values of  $\mu $, $\sigma$, or $\lambda$ can produce a slightly different optimal solution. One approach is to systematically try a variety of starting points and report the solution with the maximum log likelihood. I tried that, but a simpler but effective approach for this problem was to use the linear model from above for the initial estimate of the betas and use fit it's residuals to a mixture model to get starting values the other parameters. We can use R's mixtools package to fit the residuals to a mixture of normals. Restarting several times using the results of  mixtools::normalmixEM as starting values produced the following solution.

> fit.mix <- mixtools::normalmixEM(resid(fit.lm), 
                                 k=3, maxit=20000)
> em.fit.lm <- em.lm(df.lm$df, plotIt=TRUE, beta=coef(fit.lm), mu=fit.mix$mu, 
                      sigma=fit.mix$sigma, 
                      lambda=fit.mix$lambda)





The colors in scatter plot above represent the component with the maximum posterior probability.

em.fit.lm reports the estimates of the parameters.

> em.fit.lm$fit$pars
       beta0        beta1          mu1          mu2          mu3       sigma1       sigma2 
 0.100146875  0.025072677 -0.033703944  0.011228524  0.004631718  0.024529732  0.021273649 
      sigma3 
 0.103116720 
> em.fit.lm$lambda
[1] 0.1623976 0.2413947 0.5962077
> 

Finally, here is a qqplot of the residuals vs. the mixture of three normal distributions.


The source code is available here.

No comments:

Post a Comment