- Sam Kean
I loved dinosaurs, I loved space, and I thought maybe I'd be the first paleo-astronaut.
-Bill Maris
A while ago, I wrote about using Gibbs sampling for spline regression. The motivation for that post was this paper BAYSPLINE: A New Calibration for the Alkenone Paleothermometer by Jessica E. Tierney and Martin P. Tingley. (It's behind a paywall. I'll spare you my rant about taxpayer supported research being a profit center for private companies.) There is accompanying software to the paper in Python and MATLAB. In this post, I'll use the the R software presented previously to reproduce the results from the Tierney and Tingley paper.
A few marine species of phytoplankton, most notably E. huxleyi, produce long-chain unsaturated methyl and ethyl n-ketones known as alkenones. Alkenones contain between 35 and 41 carbon atoms and with between one and four double bonds. Alkenone-producing species respond to changes in water temperature by altering the relative proportions of the different alkenones they produce. At higher temperatures more saturated alkenones are produced proportionally. They are preserved in marine sediment, making them a useful proxy for sea surface temperature (SST). The proxy, called ${U_{37}}^k$ is defined as the relative degree of unsaturation of di- versus tri- unsaturated C37 alkenones:
\[{U_{37}}^k = \frac{{{C_{37.2}}}}{{{C_{37.2}} + {C_{37.3}}}}\]
${U_{37}}^k$ can be used to estimate SST. A commonly used calibration due to Miller et al. is
\[{U_{37}}^k = 0.0337T + 0.044\]
T is the sea surface temperature in ${}^ \circ C$.
Tierney and Tingley's Python software provides a MATLAB .mat file that contains a set of SST values and the corresponding ${U_{37}}^k$ values. The file can be read with R's R.matlab package.
This figure shows a plot of ${U_{37}}^k$ vs SST from the Tierney and Tingley package. The red line shows the predicted ${U_{37}}^k$ values from the Miller calibration.
The Miller calibration is linear. The ${U_{37}}^k$ values appear to be not quite linear. There is a bit of curvature at higher temperatures. It was this curvature that inspired Tierney and Tingley to try a form of spline regression.
A Bayesian Spline Model
As pointed out in the previous post, spline regression is often modeled with B-splines also called basis splines. A spline function can be expressed as a linear combination of B-splines. Mathematically, a spline function can be expressed as
\[{S_{n,t}}(x) = \sum\limits_{i = 0}^{knots + 3} {{\beta _i}{B_{i,n}}(x)} \]
The Tierney and Tingley models is then
\[\begin{gathered}
{{\mathbf{U}}_{{\mathbf{37}}}}^{\mathbf{k}} = {\mathbf{B(SST)\beta }} + \varepsilon \hfill \\
\varepsilon \sim N({\mathbf{0}},{\tau ^2}{\mathbf{I}}) \hfill \\
\end{gathered} \]
${\mathbf{\beta }}$ is a vector of model coefficients.
To estimate ${\mathbf{\beta }}$ and $\tau^2$, we use a Bayesian model\[{S_{n,t}}(x) = \sum\limits_{i = 0}^{knots + 3} {{\beta _i}{B_{i,n}}(x)} \]
The Tierney and Tingley models is then
\[\begin{gathered}
{{\mathbf{U}}_{{\mathbf{37}}}}^{\mathbf{k}} = {\mathbf{B(SST)\beta }} + \varepsilon \hfill \\
\varepsilon \sim N({\mathbf{0}},{\tau ^2}{\mathbf{I}}) \hfill \\
\end{gathered} \]
${\mathbf{\beta }}$ is a vector of model coefficients.
\[p({\mathbf{\beta }},{\tau ^2}|{{\mathbf{U}}_{{\mathbf{37}}}}^{\mathbf{k}},{\mathbf{B(SST)}}) \propto p({{\mathbf{U}}_{{\mathbf{37}}}}^{\mathbf{k}}|{\mathbf{B(SST)}},{\mathbf{\beta }},{\tau ^2})p({\mathbf{\beta }},{\tau ^2})\]
We assume that SST is normally distributed given the spline basis and the model parameters.
\[p({{\mathbf{U}}_{{\mathbf{37}}}}^{\mathbf{k}}|{\mathbf{B(SST)}},{\mathbf{\beta }},{\tau ^2}) \sim N({\mathbf{B(SST)\beta }},{\tau ^2}{\mathbf{I}})\]
The priors for ${\mathbf{\beta }}$ and ${\tau ^2}$ are normal and inverse gamma.
\[\begin{gathered}
{\mathbf{\beta }} \sim N({{\mathbf{\mu }}_{\mathbf{0}}},\sigma _0^2I) \hfill \\
{\tau ^2} \sim IG({a_0},{b_0}) \hfill \\
\end{gathered} \]
The prior means for ${{\mathbf{\mu }}_{\mathbf{0}}}$ are the maximum likelihood estimates from an lm() fit of the data. ${\tau^2}$, $a_0$, and $b_0$ are set to large values to keep the priors relatively uninformative.
Following Tierney and Tingley's model, we arrive at
\[\begin{gathered}
{\mathbf{\beta }} \sim N({\mathbf{\Psi }}{{\mathbf{V}}^{{\mathbf{ - 1}}}},{{\mathbf{V}}^{{\mathbf{ - 1}}}}) \hfill \\
{\mathbf{\Psi }} = {{\mathbf{u}}_{\mathbf{0}}}\sigma _0^2 + {\mathbf{B(SST}}{{\mathbf{)}}^{\mathbf{t}}}{{\mathbf{U}}_{{\mathbf{37}}}}^{\mathbf{k}}{\tau ^2} \hfill \\
{{\mathbf{V}}^{{\mathbf{ - 1}}}} = \sigma _0^2{\mathbf{I}} + {\mathbf{B(SST}}{{\mathbf{)}}^{\mathbf{t}}}{\mathbf{B(SST)}}{\tau ^{ - 2}} \hfill \\
{\tau ^2} \sim IG({a_0} + \frac{N}{2},{b_0} + \frac{1}{2}({({\mathbf{SST}} - {\mathbf{B(SST)\beta }})^t}(SST - {\mathbf{B(SST)\beta }})) \hfill \\
\end{gathered} \]
Gibbs Sampling
Given all of the above, we can construct a Gibbs sampler to estimate ${\mathbf{\beta }}$ and ${\tau ^{ 2}}$ by alternately sampling ${\tau ^{ 2}}$ and then ${\mathbf{\beta }}$. Tierney and Tingley don't give us their spline basis matrix. They do state that their spline mode has three knots; knots at the upper and lower limits of the data range, and a knot at SST = 23.6, approximately where the values of ${U_{37}}^k$ appear to flatten. I used the R function bs() and the given knot positions to construct a cubic spline basis.Here's the code for the sampler.
gibbs.spline.tt <- function(data, knots=NULL,
sigma20 = 10, alpha0=10, beta0=10,
burnin=1000, samples=2000,
plotIt=FALSE) {
# gibbs.spline - Gibbs sampling for Bayesian spline regression
# arguments:
# data - a data frame with columns SST and UK37
# knots - vector of knot positions. If NULL, knots are placed at the inner quantiles
# sigma20 - hyperparameter for variance of coefficients
# alpha0, beta0 - hyperparameters for inverse gamma prior for tau
# burnin, samples - iterations for burnin and sample count
# plotIt - if true, draw a histogram of posterior beta and tau
# returns
# a list - b = posterior estimates of coefficients,
# tau2 = posterior estimates of tau^2
# knots = the input or calculated knots vector
# SST, UK37 - data used to generate b
#
# Based on Tierney_Tingley 2018 https://doi.org/10.1002/2017PA003201
#
require(MASS)
require(splines)
require(invgamma)
iter = burnin + samples
# get B(X) - the basis matrix at each X position.
if (is.null(knots)) {
B <- bs(data$SST, df=5) # cubic spline + 2 knots -> 5
knots <- attr(B, "knots")
}
else {
B <- bs(data$SST, knots=knots)
}
N <- dim(B)[1] # length of X
M <- dim(B)[2] # number of beta coefficients
Y <- t(t(data$UK37))
# get LS estimate of priors for mean
# no intercept
mu0 <- coef(lm(UK37 ~ bs(SST, knots=knots) - 1, data))
b <- matrix(0, nrow=iter, ncol=M) # storage for posteriors
tau2 <- rep(0, iter)
tau2[1] <- rinvgamma(1, shape=alpha0, rate=beta0)
# start sampling
for(i in 2:iter) {
PSI <- (1/sigma20) * t(t(mu0)) + (1/tau2[i-1]) * (t(B) %*% Y)
V <- ginv((1/sigma20) * diag(M) + (1/tau2[i-1]) * (t(B) %*% B))
mu_hat <- as.vector(V %*% PSI)
b[i,] <- mvrnorm(1, mu=mu_hat, Sigma=V)
temp <- Y - B %*% t(t(b[i,]))
tau2[i] <- rinvgamma(1, shape=alpha0+N/2, rate=beta0 + (1/2)*(t(temp) %*% temp))
}
# don't save burn-in data
b <- b[(burnin+1):iter,]
tau2 <- tau2[(burnin+1):iter]
if(plotIt) {
# draw some plots
require(ggplot2)
for(i in 1:M) {
g1 <- ggplot(data.frame(x=b[,i]), aes(x=x)) +
geom_histogram(color="black", fill="white", bins=30) +
ggtitle(paste("b[,", i, "]", sep=""))
print(g1)
}
g2 <- ggplot(data.frame(x=tau2), aes(x=x)) +
geom_histogram(color="black", fill="white", bins=30) +
ggtitle('tau^2')
print(g2)
}
return(list(b=b,
tau2=tau2,
knots=knots,
SST=data$SST,
UK37=data$UK37))
}
Give the results from the sampler, we can plot the posterior estimates of ${U_{37}}^k$ using the means of the sampled ${\mathbf{\beta }}$ and ${\tau ^{ - 2}}$.
gibbs.plot <- function(gibbs.fit, title="", plotIt=TRUE) {
# gibbs.plot - plot posterior prediction from results of gibbs.spline
# arguments:
# gibbs.fit - a list returned by gibbs.spline
# b - sample matrix from gibbs.spline
# tau2 - samples of variance from gibbs.spline
# knots - knots used by gibbs.spline
# SST, UK37 - data used to bereate b amdf tau2
#
# returns:
# a list
# UK37_predict - the posterior prediction of UK37
# B - the basis matrix
require(splines)
# get B(X) - the basis matrix at each X position.
B <- bs(gibbs.fit$SST, knots=gibbs.fit$knots)
Y <- B %*% t(t(colMeans(gibbs.fit$b)))
if(plotIt) {
require(ggplot2)
df <- data.frame(SST=gibbs.fit$SST, UK37=gibbs.fit$UK37, predicted=Y)
g <- ggplot(data=df, aes(x=SST, y=UK37)) +
geom_point(aes(col="UK37")) +
geom_point(aes(y=predicted, col="predicted")) +
scale_color_manual(labels = c("Posterior Prediction", "UK37"),
values = c("red", "black")) +
guides(color=guide_legend(title="Data")) +
ggtitle(title)
print(g)
}
return(list(UK37_predict=Y,
B = B))
}
Here's the result.
This figure corresponds well with figure 6a in the Tierney and Tingley paper.
Are Our Assumptions Justified?
In all of this we have assumed that the likelihood of ${U_{37}}^k$ is normally distributed around the posterior likelihood. We should be able to see that the residuals from the posterior prediction are normally distributed.This code will produce a qqnorm plot of the residuals.
qqplot.data.gibbs <- function (gibbs.fit, UK37, main='', plotIt=TRUE) {
# generate qqplot of model m applied to dataframe df
# arguments:
# gibbs.fit - a list returned by gibbs.spline
# b - sample matrix from gibbs.spline
# tau2 - samples of variance from gibbs.spline
# knots - knots used by gibbs.spline
# SST, UK37 - data used to bereate b amdf tau2
# UK37 - predicted posterior UK37 from gibbs.plot
# main - optional title
# plotIT - display plot?
require(ggplot2)
vec = gibbs.fit$UK37 - UK37
# following four lines from base R's qqline()
y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
d = data.frame(UK37=gibbs.fit$UK37, SST=gibbs.fit$SST)
d = cbind(d, setNames(qqnorm(vec, plot.it=FALSE), c("Theoretical", "Sample")))
g <- ggplot(d) +
geom_point(aes(x=Theoretical, y=Sample)) +
geom_abline(slope = slope, intercept = int) +
ggtitle(main)
if (plotIt) {
print(g)
}
return(g)
}
Here's the qqplot.
If the residuals were perfectly normal, we would see a straight line. It looks like the residuals are long-tailed. Perhaps the assumption of normality needs to be examined further.
Putting this all together to run the code in R:
> tt.fit <- gibbs.spline.tt(df.tt, knots=c(23.6))
> tt.predicts <- gibbs.plot(tt.fit)
> qqplot.data.gibbs(tt.fit, tt.predicts$UK37_predict)
All of this code is available here. You can get the data from here.
No comments:
Post a Comment