J. W. Gibbs as quoted by Irving Fisher
In certain versions you can hear a women's voice say "Reticulating Splines"
Sim City
What is a Spline Anyway?
Consider this data,
> head(women.height)
height weight
1 58.04797 115.2637
2 58.52034 115.3882
3 58.67279 115.8949
4 58.86290 116.2369
5 58.90874 117.6327
6 59.09578 119.8376
The plot shows a series of women's weight vs. height. The data is made-up, but realistic. There's a definite relationship between increasing height and increasing weight. We could build a linear model of this relationship.
The red line is a linear model height ~ weight. That is
\[height =\alpha + \beta * weight + \varepsilon (0,{\sigma ^2})\]
where ${\alpha}$ and ${\beta}$ are constants and ${\varepsilon (0,{\sigma ^2})}$ is Gaussian noise with a mean of zero and some variance.
The fit doesn't look all that satisfactory. It misses the curves in the plot. Maybe we could do better with a higher order polynomial. This where splines come into play.
A spline is a function defined by piecewise polynomials. The general idea is that a sequence of points, called knots, are chosen and a polynomial curve is fit to the data points. The piecewise curves meet at the knot positions.
Probably the most common form of spline used for data fitting is the cubic spline. Cubic spline regression fits cubic functions that are joined at a series of k knots The functions are smooth because the first and second derivatives match at the knots.
In terms of regression, we are modeling the data as
\[E(Y|X) = {\beta _0} + {\beta _1}X + {\beta _2}{X^2} + {\beta _3}{X^3} + {\beta _4}{(X - {a_1})^3} + {\beta _5}{(X - {a_2})^3} + ... + {\beta _{k + 3}}{(X - {a_k})^3}\]
where ${\beta _0}...{\beta _{k + 3}}$ are coefficients and ${a_1}...{a_k}$ are the knot positions. What this says is that the expected value (mean) of Y is modeled as a cubic polynomial.
In order to makes this a bit more specific, spline regression is often modeled with B-splines also called basis splines. A spline function can be expressed as a linear combination of B-splines. mathematically, a spline function can be expressed as
\[{S_{n,t}}(x) = \sum\limits_{i = 0}^{knots + 3} {{\beta _i}{B_{i,n}}(x)} \]
where n is the degree of the spline, in the case of cubic splines 3. There is a recursive formula for calculating ${{B_{i,n}}(x)}$. See https://en.wikipedia.org/wiki/B-spline#Definition for details.
We won't directly calculate the basis values. Instead, we'll use the bs() function from the R package splines. bs() returns a matrix of B-spline basis evaluated at x.
Given a basis matrix, we can model our data as
\[{\bf{Y}} = {\bf{B(X)\beta }} + \varepsilon (0,{\sigma ^2}{\bf{I}})\]
${\bf{B(X)}}$ is the basis matrix evaluated at each X position. ${\bf{\beta}}$ is the vector of coefficients.
\[height =\alpha + \beta * weight + \varepsilon (0,{\sigma ^2})\]
where ${\alpha}$ and ${\beta}$ are constants and ${\varepsilon (0,{\sigma ^2})}$ is Gaussian noise with a mean of zero and some variance.
The fit doesn't look all that satisfactory. It misses the curves in the plot. Maybe we could do better with a higher order polynomial. This where splines come into play.
A spline is a function defined by piecewise polynomials. The general idea is that a sequence of points, called knots, are chosen and a polynomial curve is fit to the data points. The piecewise curves meet at the knot positions.
Probably the most common form of spline used for data fitting is the cubic spline. Cubic spline regression fits cubic functions that are joined at a series of k knots The functions are smooth because the first and second derivatives match at the knots.
In terms of regression, we are modeling the data as
\[E(Y|X) = {\beta _0} + {\beta _1}X + {\beta _2}{X^2} + {\beta _3}{X^3} + {\beta _4}{(X - {a_1})^3} + {\beta _5}{(X - {a_2})^3} + ... + {\beta _{k + 3}}{(X - {a_k})^3}\]
where ${\beta _0}...{\beta _{k + 3}}$ are coefficients and ${a_1}...{a_k}$ are the knot positions. What this says is that the expected value (mean) of Y is modeled as a cubic polynomial.
In order to makes this a bit more specific, spline regression is often modeled with B-splines also called basis splines. A spline function can be expressed as a linear combination of B-splines. mathematically, a spline function can be expressed as
\[{S_{n,t}}(x) = \sum\limits_{i = 0}^{knots + 3} {{\beta _i}{B_{i,n}}(x)} \]
where n is the degree of the spline, in the case of cubic splines 3. There is a recursive formula for calculating ${{B_{i,n}}(x)}$. See https://en.wikipedia.org/wiki/B-spline#Definition for details.
We won't directly calculate the basis values. Instead, we'll use the bs() function from the R package splines. bs() returns a matrix of B-spline basis evaluated at x.
Given a basis matrix, we can model our data as
\[{\bf{Y}} = {\bf{B(X)\beta }} + \varepsilon (0,{\sigma ^2}{\bf{I}})\]
${\bf{B(X)}}$ is the basis matrix evaluated at each X position. ${\bf{\beta}}$ is the vector of coefficients.
A Bayesian Approach
In R, it's relatively easy to build a regression model of the women's height data using cubic splines.
> summary(w.spline.fit <- lm(weight ~ bs(height, df=5), data=women.height))
Call:
lm(formula = weight ~ bs(height, df = 5), data = women.height)
Residuals:
Min 1Q Median 3Q Max
-2.84862 -0.54325 -0.03484 0.58158 2.11255
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 113.1512 0.6908 163.790 < 2e-16 ***
bs(height, df = 5)1 8.2935 1.0517 7.886 5.54e-12 ***
bs(height, df = 5)2 26.2674 0.7501 35.019 < 2e-16 ***
bs(height, df = 5)3 24.4347 1.0861 22.499 < 2e-16 ***
bs(height, df = 5)4 47.8717 0.8294 57.717 < 2e-16 ***
bs(height, df = 5)5 48.8285 0.8576 56.938 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9298 on 94 degrees of freedom
Multiple R-squared: 0.9947, Adjusted R-squared: 0.9945
F-statistic: 3556 on 5 and 94 DF, p-value: < 2.2e-16
bs() returns the basis matrix. df=5 (# of knots = df - degree, which is three for cubic splines) tells bs() that we want it to choose two knots at the inner quantiles of the height. The fit is better than the linear fit shown above.
The regression model is maximum likelihood estimate (MLE) of the relationship between height and weight. That is it finds the set of parameters, ${\beta}$, that maximizes the probability of the weight data, the data likelihood, given the height data.
\[likelihood = P(weight|height,\beta )\]
R gives us the parameters and some more information about how good the fit to data is, but is that the full story? For instance, is there more we can say about ${\beta}$? For example, what is our level of uncertainty about the parameters?
Bayesian regression helps answer these questions. Bayes' theorem says
\[P(Y|X,\Theta) = \frac{{P(X|Y,\Theta )P(Y,\Theta )}}{{P(X)}}\]
where in our case X is the height data and Y is the weight data. ${\Theta}$ represents the parameters. ${P(X|Y,\Theta )}$ is the likelihood. ${P(Y,\Theta )}$ is called the prior probability. ${P(Y|X)}$ is called the posterior probability.
In other words,
\[Posterior \propto Likelihood*Prior\]
The prior represents our initial degree of belief in Y and ${\theta}$ before examining the data.
What is all of this saying about our model of the height and weight data? We will assume that the errors are independent and have equal but unknown variance. We also assume that the height is normally distributed with mean ${\bf{B(X)\beta }}$ and variance ${\sigma ^2{\bf{I}}}$.
Putting this in terms of our height and weight data.
\[{\bf{weight}}|{\bf{height}},{\bf{\beta }},{\sigma ^2} \sim N({\bf{B(height)\beta }},{\sigma ^2}{\bf{I}})\]
In other words, the data is normally distributed around ${\bf{B(height)\beta }}$ with a variance of ${\sigma^2}$.
In order to estimate ${\beta}$ and ${\sigma^2}$, we turn Bayes' rule around
\[P(\beta ,{\sigma ^2}|{\bf{weight}},{\bf{B(height)}}) \propto P({\bf{weight}}|{\bf{B(height)}},\beta ,{\sigma ^2})P(\beta ,{\sigma ^2})\]
${P(\beta ,{\sigma ^2})}$ is our prior belief about the parameters. We don't know much about ${\beta}$ and ${\sigma^2}$, so a common choice for a prior is the uninformative prior that is uniform on (${\beta}$, log(${\sigma}$)), or
\[P(\beta ,{\sigma ^2}) \propto P(\beta )P({\sigma ^2}) \propto P({\sigma ^2})\]
That is we're saying $P(\beta ) \propto 1$, i.e. a flat prior for ${\beta}$. In other words, we don't really have an opinion about ${\beta}$.
Using this prior, we are saying that the joint probability of ${\beta}$ and ${\sigma}$ given the data is a normal distribution times a prior on ${\sigma^2}$. For this prior, we will choose a conjugate prior. A conjugate prior distribution is a probability distribution that belongs to the same probability distribution as the posterior. Basically, choosing the correct form for the prior distribution makes calculation simpler. For ${\sigma^2}$, we will use an inverse gamma distribution. How do we know what conjugate prior to choose? We could examine the expected form of posterior and choose the proper family or we could do what, in my experience, most people do, look up the conjugate prior in Gelman's book Bayesian Data Analysis.
The prior distribution for ${\sigma^2}$ is
\[P({\sigma ^2}) \propto IG({\alpha _0},{\beta _0})\]
The inverse gamma distribution has two parameters, shape and rate (${\alpha_0, \beta_0}$). These are called hyperparameters. They aren't parameters of the underlying model, but rather belong to the prior distribution. It's possible to provide a prior distribution for these parameters. Possibly, those priors would have distributions, and their parameters would distribution ad infinitum. We won't go that far. We'll choose some parameters to make the prior distribution wide enough to indicate that we only have a weak opinion about the prior distribution of ${\sigma^2}.
We, or at least we can with the help of R, can sample from a multivariate normal distribution to get an estimate of the parameters. However, it still looks complicated. This is where Gibbs sampling comes in. Gibbs sampling gives us a way to examine a complicated joint posterior distribution with many parameters by examining the parameters one at a time.
I won't go into the theory of this, but the idea is simple. Sample a value of one parameter from a conditional distribution of the parameter given the data and values of the other parameters. Given that value, repeat with the next parameter. Keep cycling through the parameters for a large number of iterations and you will have an estimate of the posterior distribution of each parameter.
We have to be careful that we sample from the posterior conditional distribution of the parameters. Rather than derive the posterior distributions, I'll point you to this site where the derivations can be found.
With that in mind, the Gibbs sampling process becomes
This function returns a matrix containing estimates of the coefficients, one per row; and estimates for the variance.
The plots show the posterior distributions from
The regression model is maximum likelihood estimate (MLE) of the relationship between height and weight. That is it finds the set of parameters, ${\beta}$, that maximizes the probability of the weight data, the data likelihood, given the height data.
\[likelihood = P(weight|height,\beta )\]
R gives us the parameters and some more information about how good the fit to data is, but is that the full story? For instance, is there more we can say about ${\beta}$? For example, what is our level of uncertainty about the parameters?
Bayesian regression helps answer these questions. Bayes' theorem says
\[P(Y|X,\Theta) = \frac{{P(X|Y,\Theta )P(Y,\Theta )}}{{P(X)}}\]
where in our case X is the height data and Y is the weight data. ${\Theta}$ represents the parameters. ${P(X|Y,\Theta )}$ is the likelihood. ${P(Y,\Theta )}$ is called the prior probability. ${P(Y|X)}$ is called the posterior probability.
In other words,
\[Posterior \propto Likelihood*Prior\]
The prior represents our initial degree of belief in Y and ${\theta}$ before examining the data.
What is all of this saying about our model of the height and weight data? We will assume that the errors are independent and have equal but unknown variance. We also assume that the height is normally distributed with mean ${\bf{B(X)\beta }}$ and variance ${\sigma ^2{\bf{I}}}$.
Putting this in terms of our height and weight data.
\[{\bf{weight}}|{\bf{height}},{\bf{\beta }},{\sigma ^2} \sim N({\bf{B(height)\beta }},{\sigma ^2}{\bf{I}})\]
In other words, the data is normally distributed around ${\bf{B(height)\beta }}$ with a variance of ${\sigma^2}$.
In order to estimate ${\beta}$ and ${\sigma^2}$, we turn Bayes' rule around
\[P(\beta ,{\sigma ^2}|{\bf{weight}},{\bf{B(height)}}) \propto P({\bf{weight}}|{\bf{B(height)}},\beta ,{\sigma ^2})P(\beta ,{\sigma ^2})\]
${P(\beta ,{\sigma ^2})}$ is our prior belief about the parameters. We don't know much about ${\beta}$ and ${\sigma^2}$, so a common choice for a prior is the uninformative prior that is uniform on (${\beta}$, log(${\sigma}$)), or
\[P(\beta ,{\sigma ^2}) \propto P(\beta )P({\sigma ^2}) \propto P({\sigma ^2})\]
That is we're saying $P(\beta ) \propto 1$, i.e. a flat prior for ${\beta}$. In other words, we don't really have an opinion about ${\beta}$.
Using this prior, we are saying that the joint probability of ${\beta}$ and ${\sigma}$ given the data is a normal distribution times a prior on ${\sigma^2}$. For this prior, we will choose a conjugate prior. A conjugate prior distribution is a probability distribution that belongs to the same probability distribution as the posterior. Basically, choosing the correct form for the prior distribution makes calculation simpler. For ${\sigma^2}$, we will use an inverse gamma distribution. How do we know what conjugate prior to choose? We could examine the expected form of posterior and choose the proper family or we could do what, in my experience, most people do, look up the conjugate prior in Gelman's book Bayesian Data Analysis.
The prior distribution for ${\sigma^2}$ is
\[P({\sigma ^2}) \propto IG({\alpha _0},{\beta _0})\]
The inverse gamma distribution has two parameters, shape and rate (${\alpha_0, \beta_0}$). These are called hyperparameters. They aren't parameters of the underlying model, but rather belong to the prior distribution. It's possible to provide a prior distribution for these parameters. Possibly, those priors would have distributions, and their parameters would distribution ad infinitum. We won't go that far. We'll choose some parameters to make the prior distribution wide enough to indicate that we only have a weak opinion about the prior distribution of ${\sigma^2}.
Gibbs Sampling
We, or at least we can with the help of R, can sample from a multivariate normal distribution to get an estimate of the parameters. However, it still looks complicated. This is where Gibbs sampling comes in. Gibbs sampling gives us a way to examine a complicated joint posterior distribution with many parameters by examining the parameters one at a time.
I won't go into the theory of this, but the idea is simple. Sample a value of one parameter from a conditional distribution of the parameter given the data and values of the other parameters. Given that value, repeat with the next parameter. Keep cycling through the parameters for a large number of iterations and you will have an estimate of the posterior distribution of each parameter.
We have to be careful that we sample from the posterior conditional distribution of the parameters. Rather than derive the posterior distributions, I'll point you to this site where the derivations can be found.
With that in mind, the Gibbs sampling process becomes
- Sample ${\sigma ^2} \sim IG(shape = {\alpha _0} + \frac{N}{2},rate = \frac{1}{2}{({\bf{weight}} - {\bf{B(height)\beta }})^T}({\bf{weight}} - {\bf{B(height)\beta }}) + {\beta _0})$
- Sample $\beta \sim {\rm N}({({\bf{B}}{({\bf{height}})^{\bf{T}}}{\bf{B}}({\bf{height}}))^{ - 1}}{\bf{B}}{({\bf{height}})^{\bf{T}}}{\bf{weight}},{\sigma ^2}({({\bf{B}}{({\bf{height}})^{\bf{T}}}{\bf{B}}({\bf{height}}))^{ - 1}}))$
- Repeat
In more general terms ${\bf{height}} = {\bf{X}}$ and ${\bf{weight}} = {\bf{y}}$.
When we sample, we iterate for a fixed burn-in period to ensure that we are sampling from the posterior distribution, i.e.to remove transient effects. Then we iterate more and save the samples. Since the sample are from the posterior distributions, we can calculate estimates the posterior mean and distributions of each parameter from the samples.
${({\bf{B}}{({\bf{height}})^{\bf{T}}}{\bf{B}}({\bf{height}}))^{ - 1}}{\bf{B}}{({\bf{height}})^{\bf{T}}}{\bf{weight}}$ is the MLE of ${\beta}$. The posterior of ${\beta}$ is centered around the MLE.
This may seem complicated, but it's actually pretty easy to do in R.
gibbs.spline <- function(data, knots=NULL, alpha0=10, beta0=10,
burnin=1000, samples=2000,
plotIt=FALSE) {
# gibbs.spline - Gibbs sampling for Bayesian spline regression
# arguments:
# data - a data frame with columns x and y
# knots - vector of knot positions. If NULL, knots are placed at the inner quantiles
# alpha0, beta0 - hyperparameters for inverse gamma prior for sigma
# burnin, samples - iterations for burnin and sample count
# plotIt - if true, draw a plot of posterior beta and sigma
# returns
# a list - b = posterior estimates of coefficients,
# sigma2 = posterior estimates of sigma^2
require(MASS)
require(splines)
require(invgamma)
iter = burnin + samples
# get B(X) - the basis matrix at each X position.
if (is.null(knots)) {
B <- bs(data$x, df=5) # cubic spline + 2 knots -> 6 coefficients
}
else {
B <- bs(data$x, knots=knots)
}
N <- dim(B)[1] # length of X
B <- cbind(rep(1,N), B) # add a column for the intercept
ncol <- dim(B)[2]
b <- matrix(0, nrow=iter, ncol=ncol) # storage for posteriors
sigma2 <- rep(0, iter)
# calculate inverse and
BpB_inv <- solve(t(B) %*% B)
b_mle <- BpB_inv %*% t(B) %*% data$y
b[1,] <- b_mle
sigma2[1] <- rinvgamma(1, shape=alpha0, rate=beta0)
# iterate
for(i in 2:iter) {
b[i,] <- mvrnorm(1, mu=b_mle, Sigma=sigma2[i-1] * BpB_inv)
temp <- data$y - B %*% t(t(b[i,]))
sigma2[i] <- rinvgamma(1, shape=alpha0+N/2, rate=beta0 + (1/2)*(t(temp) %*% temp))
}
# don't save f burn-in data
b <- b[(burnin+1):iter,]
sigma2 <- sigma2[(burnin+1):iter]
if(plotIt) {
# draw some plots
require(reshape2)
require(ggplot2)
b2 <- melt(b)
names(b2) <- c("b1", "b2", "value")
g1 <- ggplot(b2, aes(x=value))+geom_histogram()+facet_wrap(~b2) + ggtitle('Coefficents')
print(g1)
g2 <- ggplot(data.frame(x=sigma2), aes(x=x))+geom_histogram() + ggtitle('sigma2^2')
print(g2)
}
return(list(b=b, sigma2=sigma2))
}
This function returns a matrix containing estimates of the coefficients, one per row; and estimates for the variance.
The plots show the posterior distributions from
w.fit <- gibbs.spline(data.frame(x=women.height$height, y=women.height$weight), plotIt=TRUE)
The posterior distributions of the coefficients look somewhat normal. We could smooth them out with more samples.
Prediction
Since we have samples drawn from the posterior distribution of ${\beta}$ and ${\sigma^2}$, we can use them to make predictions about weight based on a new set of heights. To do that, we calculate ${\bf{B}}({\bf{height}})$ for the new data, sample ${\beta}$ and ${\sigma^2}$ from the posterior distributions, and then sample new weights from
\[{\bf{weight}} \sim N({\bf{B}}({\bf{height}}){\bf{\beta }},{\sigma ^2}{\bf{I}})\]
This code returns the predictions drawn from the posterior distribution of heights.
gibbs.predict.spline <- function(X, b, sigma2,
samples = 100,
knots=NULL) {
# gibbs.predict.spline - posterior prediction from results of gibbs.spline
# arguments:
# x - a vector, new data
# b - sample matrix from gibbs.spline
# sigma2 - samples of variance from gibbs.spline
# samples - a matrix, each row is a set of samples of length(x);
# number of samples to draw; must be less or equal to than dim(b)[1]
# knots - must be the same set of knots used in gibbs.spline
# returns
# y - a matrix[samples, length(x)] of samples drawn from posterior of y
require(MASS)
require(splines)
len = dim(b)[2]
# get B(X)
if (is.null(knots)) {
B <- bs(X, df=5)
}
else {
B <- bs(X, knots=knots)
}
N <- dim(B)[1]
B <- cbind(rep(1,N), B)
# daw a bunch of samples
y = matrix(0, samples, length(X))
for(i in 1:samples) {
y[i,] = mvrnorm(1, as.vector(B %*% t(t(b[i,]))), sigma2[i] * diag(N))
}
return(y)
}
Given some new height data
> head(height2)
[1] 58.14645 58.22143 58.41581 58.62558 58.73675 58.85710
We can predict a new set of weight values.
> new.weights <- gibbs.predict.spline(height2, w.fit$b, w.fit$sigma2)
> ggplot(data.frame(height=height2, weight=colMeans(new.weights)), aes(x=height, y=weight))+geom_point()+ggtitle('Predicted Means of Data')
The plot shows the mean of the predicted weights.
The code is available for download from here.
No comments:
Post a Comment