Gaussian Processes

The Gaussian Distribution

Gaussian distributions, better knows as normal distributions, are important in all types of science. The terms normal and Gaussian distributions are used interchangeably. They are often used to represent real valued variables whose underlying distributions are not known. The central limit theorem says that in most situations, when independent random variables are added, their sum tends toward a normal distribution. If you don't know anything else, a normal distribution is often the default guess for the underlying distribution.

The Gaussian distribution for a single variable is defined

\[f(x|\mu ,{\sigma ^2}) = \frac{1}{{\sqrt {2\pi {\sigma ^2}} }}{e^ - }^{\frac{{{{(x - \mu )}^2}}}{{2{\sigma ^2}}}}\]
where $ \mu $ is the mean of the distribution and $\sigma $ is the standard deviation. ${\sigma ^2}$ is the variance. The distribution is typically written $X \sim N(\mu ,{\sigma ^2})$.

Plotting the normal distribution shows the familiar bell curve. The standard normal distribution is a normal distribution with $\mu=0$ and $\sigma=1$.


Multivariate Gaussian Distribution

We can generalize the one dimensional normal distribution to higher dimensions. Such a distribution is called a multivariate normal or multivariate Gaussian. The single dimension normal distribution describes the distribution of a single random variable. The multivariate normal describes a random vector variable.

The multivariate normal distribution for a k-dimensional vector,  ${\mathbf{x}} = \left( {{x_1},{x_2},...{x_k}} \right)$ is given by

\[f({\mathbf{x}}|{\mathbf{\mu }},{\mathbf{\Sigma }}) = \frac{1}{{\sqrt {{{(2\pi )}^k}\left| {\mathbf{\Sigma }} \right|} }}{e^{ - \frac{1}{2}{{({\mathbf{x}} - {\mathbf{\mu }})}^T}{{\mathbf{\Sigma }}^{ - 1}}({\mathbf{x}} - {\mathbf{\mu }})}}\]
${\mathbf{\mu }} = {\mathbf{E}}[x] = (E[{x_1}],E[{x_2}],...E[{x_k}])$ is vector mean of ${\mathbf{x}}$. ${\mathbf{\Sigma }}$ is called the covariance matrix ${\mathbf{\Sigma }} = {\mathbf{E}}[{({\mathbf{x}} - {\mathbf{\mu }})^T}({\mathbf{x}} - {\mathbf{\mu }})] = [\operatorname{cov} ({x_i},{x_j})]1 \leqslant i,j \leqslant k$. The covariance matrix is positive definite. ${\left| {\mathbf{\Sigma }} \right|}$ is the determinant of the covariance matrix. ${\Sigma ^{ - 1}}$ is the inverted covariance matrix. T indicates a vector/matrix transpose. The covariance matrix represents how the random variables vary with one another.

The plot below shows a multivariate normal with parameters ${\mu _x} = 0;{\mu _y} = 0;\Sigma = \left[ {\begin{array}{*{20}{c}} 1&0 \\ 0&1 \end{array}} \right]$.



We write the expression above as ${\mathbf{X}}\sim{\mathbf{N}}({\mathbf{\mu }},{\mathbf{\Sigma }})$.

It's possible to derive the conditional distribution for full multivariate Gaussian, but for the purposes of example, right now we'll only show the bivariate (two variable) distribution.

\[p(x,y) = N\left( {\left[ {\begin{array}{*{20}{c}} {{\mu _x}} \\ {{\mu _y}} \end{array}} \right],\left[ {\begin{array}{*{20}{c}} {{\Sigma _x}}&{{\Sigma _{xy}}} \\ {\Sigma _{xy}^T}&{{\Sigma _y}} \end{array}} \right]} \right)\]

The bivariate conditional distribution is given by $p(x|y) = N({\mu _x} + {\Sigma _{xy}}\Sigma _y^{ - 1}(y - {\mu _y}),{\Sigma _x} - {\Sigma _{xy}}\Sigma _y^{ - 1}\Sigma _{xy}^T)$.

We can marginalize over one of the variables to get a normal distribution of a single variable.
\[p(x) = \int {p(x,y)dy = N({\mu _x},{\Sigma _y})} \]

Gaussian Processes

Often, when we start to build a statistical model we have some idea about the relationships among the variables. For example, we may assume a linear relationship and fit a linear model with ordinary least squares. Things get more complicated when trying to characterize non-linear relationships. It's possible to fit a non-linear function like a sinusoidal or exponential to the data if you have a good idea of the relationship among the variables. Unless you have some idea of what function to use, it quickly becomes a complicated model selection problem. 

An alternative is to use what is called a  Bayesian non-parametric approach that directly models the underlying function. That is what Gaussian processes do. Actually, the non-parametric part is not quite right, because it uses probability distribution which have parameters. This misnomer is not really important for our purposes. Instead, there are actually infinitely many parameters, because the space of function is infinite.

I'm writing this to try and get at least a vague understanding of Gaussian processes and how to implement them. Some useful resources are the Gaussian Processes Web SiteLuca Ambrogioni's Python notebook, and especially the book Gaussian Processes for Machine Learning by Rasmussen and Williams. Much of the implementation details below come from Chris Fonnesbeck's excellent description Fitting Gaussian Process Models in Python. Another good source is Kevin Murphy's book Machine Learning: A Probabilistic Perspective.

There are many implementations of Gaussian processes for Python, R, and MATLAB. For Python, there are scikit-learn, GPflow, PyMC3, and others. For R, there are GPfitgptk, and many others. The MATLAB Statistics and Machine Learning toolkit has functions for Gaussian process regression methods. I will concentrate on scikit-learn. I have had some problems installing GPflow on Windows subsystem for Linux. Maybe when I get them straightened out, I'll write more.

What is a Gaussian process? 

Gaussian processes are used in regression and classification. In the geosciences, they are called kriging and are commonly used for regression in areas such as hydrology, mining, remote sensing etc. As a classification method, they can be used for tasks such as handwritten character recognition.  I find it a little easier to approach them by thinking about regression.

In standard linear regression, we model the data with a linear equation
\[y\sim\alpha x + \beta \].
In this case, we want to estimate the slope, $\alpha$, and the intercept, $\beta$. We assume Gaussian priors for both $\alpha$ and $\beta$ and calculate the posterior distribution of the parameters.

For non-linear regression, we extend the regression form to
\[y \sim f(x)\]
where f(x) is some function of the data. We want to estimate f(x). Typically, we do this by assuming some particular parametric form for f, and estimate the parameters.

In some cases, particularly in machine learning, discovering a useful parametric function can be difficult and in general, may impose too much of a restriction on the model. Instead, we can assign a prior distribution on the space of functions  f(x). In general. that's complicated. However, if we assume a generalized multivariate normal distribution for the prior distribution of functions, things become a bit simpler. That's what a Gaussian process is.

A Gaussian process assumes for a finite set of points, $p(f({x_1}),f({x_2}),...f({x_N}))$ is jointly normal.

Similar to how we describe a multivariate normal distribution over the data ${\mathbf{X}}\sim{\mathbf{N}}({\mathbf{\mu }},{\mathbf{\Sigma }})$, in a Gaussian process, we say that the function f(x) follows a prior distribution with a mean function, m(x), and covariance function, k(x, x').

\[f(x) \sim GP(m(x),k(x,x'))\]

The mean function determines the mean of f(x) at each data point, and $k(x,x')$ describes the covariance between values of f(x). The covariance function $k(x,x')$ is called a kernel. We will use the  terms interchangeably. It is common to set $m(x) = 0$. Since this is a generalized multivariate normal distribution, we can marginalize over any variables we don't need.

Here's a typical specification of a Gaussian process
\[m(x) = 0\]
\[k(x,x') = {\theta _1}{e^{ - \frac{{{\theta _2}}}{2}\left\| {x - x'} \right\|_2^2}}\]

The covariance function is called a squared exponential. It has two parameters. $\theta_1$ is the output variance. It is the average distance of of f(x) from its mean. It's basically a scale factor. $\theta_2$ is the inverse length. It determines the amount of wiggle in the function. x and x' represent two different data points. $k(x,x')$ is calculated over all pairs of data points. The squared exponential kernel is also called the radial basis kernel within the machine learning community.

Here's one way to calculate the squared exponential kernel.

def kernel(x, y, params):
    return params[0] * np.exp( -0.5 * params[1] * (x.reshape(x.size, 1) - y.reshape(1, y.size))**2)


There are many other kernels listed here. Choosing the proper covariance function is very important for effective machine learning. The kernel encodes our assumptions about the similarity between data points; i.e. that points with similar $\mathbf{x}$ values should have similar y values.

Gaussian Process Regression

How does all of this help us model the data? We would like to find a function that gives a somewhat adequate fit to the data. A Gaussian process defines a prior over the function space. Once we have some data, we can use Bayes rule to obtain a posterior distribution over the functions. Sounds hard since we have an infinite space of possible functions. However, we actually only need to define the function over a finite set of points ${\mathbf{x_1}}...{\mathbf{x_n}}$. We assume that the joint distribution over the functions $f({\mathbf{x_1}}...{\mathbf{x_n}})$ is normal (Gaussian). The distribution has a mean $\mu ({\mathbf{x}})$ and a covariance ${\Sigma _{ij}} = K({{\mathbf{x}}_i},{{\mathbf{x}}_j})$. The covariance is a kernel as defined above.

The kernel defines how the points co-vary, i.e. how they relate to one another. We expect the function output to be related at those points.

We have a prior that is defined as a multivariate Gaussian.

\[f({\mathbf{x}}) \sim GP(\mu ({\mathbf{x}}),K({\mathbf{x}},{\mathbf{x'}}))\]

where $\mu ({\mathbf{x}})$ is the expected value of the function and K is the covariance kernel function.

Given a finite set of data, we expect the prior probability distribution of f to be Gaussian

\[p(f|{\mathbf{X}}) = N({\mathbf{\mu }},{\mathbf{{\rm K}}})\]

Since the space of functions is so flexible, it's common to set $\mu ({\mathbf{x}}) = 0$.

To make this a little more concrete, let's sample some functions from the prior. With the kernel defined above, we can sample using numpy's random.multivariate_normal method.

import numpy as np
import matplotlib.pylab as plt
import gp # gp.py contains kernel and conditional functions

xtest=np.linspace(-5, 5, 50).reshape(-1,1)
params = [1, 2]

K = gp.kernel(xtest, xtest, params)
# assume mu(x) = 0
f = np.random.multivariate_normal(np.zeros(K.shape[0]), K, size=3)

plt.plot(xtest,f.T)
plt.title('Functions Sampled From GP Prior')  
plt.xlabel('x')
plt.ylabel('f')
plt.show()

Predictions

Let's assume we have some noise-free observations at various points. We'll call these points our training set. These come from some underlying process we don't observe, except for these points. We would like to predict the function values at other points.
We will split Gaussian process definition into two parts, the observed values of the function f, and the values ${{f_*}}$ that we want to predict at test points ${{X_*}}$.

 \[\left( {\begin{array}{*{20}{c}}   f \\   {{f_*}} \end{array}} \right) \sim \left( {\left( {\begin{array}{*{20}{c}}   \mu  \\   {{\mu _*}} \end{array}} \right)\left( {\begin{array}{*{20}{c}}   K&{{K_*}} \\   {K_*^T}&{{K_{**}}} \end{array}} \right)} \right) = \left( {\left( {\begin{array}{*{20}{c}}   \mu  \\   {{\mu _*}} \end{array}} \right)\left( {\begin{array}{*{20}{c}}   {K(X,X)}&{K(X,{X_*})} \\   {K{{(X,{X_*})}^T}}&{K({X_*},{X_*})} \end{array}} \right)} \right)\]

 Similar to the bivariate conditional distribution above we can calculate the conditional distribution of  ${{f_*}}$.

 \[\begin{gathered}   p({f_*}|{X_*},X,f) = N({\mu _*},{\Sigma _*}) \\   {\mu _*} = \mu ({X_*}) + K_*^T{K^{ - 1}}(f - \mu (X)) \\   {\Sigma _*} = {K_{**}} - K_*^T{K^{ - 1}}{K_*} \\ \end{gathered} \]

 See Rasmussen and Williams chapter 2 if you want the details.

 As an example, let the underlying process be a sine and we'll choose some training points and sample from the conditional distribution. First, we need some code to calculate $\mu$ and $\Sigma$.

def conditional(x_new, x_org, y_org, params, noise = 0):
    K = kernel(x_org, x_org, params) + np.eye(len(x_org)) * noise
    K_star = kernel(x_org, x_new, params)
    K_starstar = kernel(x_new, x_new, params)
    temp = (K_star.T).dot(np.linalg.inv(K))
    mu = temp.dot(y_org)
    sigma = K_starstar - temp.dot(K_star)
    return mu, sigma

Ignore the noise argument for now. Note that we are assuming $\mu ({\mathbf{x}}) = 0$. This is not the most efficient way to calculate these values, but it follows the math to make explicit what we're doing. For a more efficient method using Cholesky decomposition see Kat Bailey's blog.

To sample the test values, we use use random.multivariate_normal as before.

# training points
xtrain = np.array([-4, -3, -2, -1, 1]).reshape(5,1)
ytrain = np.sin(xtrain)

# get mu and sigma from the conditional and sample some function values
mu3, sigma3 = gp.conditional(xtest, xtrain, ytrain, params)
f3 = np.random.multivariate_normal(mu3.squeeze(), sigma3, size=3)
sd = np.std(f3, axis=0) # get the std of the functions

fig, ax = plt.subplots()
ax.plot(xtest, mu3, 'r-.', label='function mean')
ax.plot(xtrain, ytrain, 'bo', label='training points')
ax.plot(xtest, f3.T )
plt.title('Functions Sampled From Conditional GP')  
plt.xlabel('x')
plt.ylabel('f')
plt.gca().fill_between(xtest.flat, mu3.squeeze()-2*sd, mu3.squeeze()+2*sd, color="#dddddd")
ax.legend(loc='upper left')
plt.show()


The shaded area is $\pm 2$ standard deviations from function mean. Notice how the functions are squeezed to perfectly fit the training points and how the uncertainty increases as we move further from the training points. The red dashed line shows the estimated function mean.

What happens when we add some noise. to the underlying function.
\[y = f({\mathbf{x}}) + \varepsilon \]
where  $\varepsilon  \sim {\rm N}(0,\sigma _y^2)$. That is the noise is normally distributed with standard deviation $\sigma_y$.

The covariance of the noisy data becomes
\[{{\mathbf{K}}_{\mathbf{y}}} = {\mathbf{K}} + {\sigma ^2}{\mathbf{I}}\].

We can sample from the conditional distribution as we did before with $K_y$ replacing $K$.

mu3, sigma3 = gp.conditional(xtest, xtrain, ytrain, params, noise=0.1)
f3 = np.random.multivariate_normal(mu3.squeeze(), sigma3, size=3)

Noise is a hyperparameter.


The fit is still good, but the functions are no longer squeezed to the same degree.

scikit-learn

Scikit-learn is a major machine learning library for Python. It contains a large number of learning algorithms for regression and classification including Gaussian Processes. To illustrate its use, we'll use some data from Fitting Gaussian Process Models in Python

%run get_data.py
plt.plot(x, y, 'o')
plt.show()


Scikit-learn provides a number of kernels. A common choice is the Matérn kernel. It is a generalization of the squared exponential kernel. It has an additional parameter that controls the smoothness of the target function. The kernel is defined

\[k({x_i},{x_j}) = {\sigma ^2}\frac{1}{{\Gamma (\nu ){2^{\nu  - 1}}}}{\left( {\gamma \sqrt {2\nu } d\left( {\frac{{{x_i}}}{l},\frac{{{x_j}}}{l}} \right)} \right)^\nu }{K_\nu }\left( {\gamma \sqrt {2\nu } d\left( {\frac{{{x_i}}}{l},\frac{{{x_j}}}{l}} \right)} \right)\]

where $\nu$ is the smoothness parameter, $\Gamma$ is the gamma function and K is a modified Bessel function. ${d\left( {\frac{{{x_i}}}{l},\frac{{{x_j}}}{l}} \right)}$ is the distance between $x_i$ and $x_j$. As $\nu  \to \infty$, the kernel approaches the squared exponential kernel.

With scikit-learn, we can combine kernels easily. It's common to include a constant kernel and in the case of noisy data,  a white noise kernel. Kernels are combined with the '+' operator. The sum of two kernels is a kernel.

from sklearn import gaussian_process
from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel
kernel = ConstantKernel() + Matern(length_scale=2, nu=3/2) + WhiteKernel(noise_level=1)

We create a regressor object with the kernel and then call its fit method.

gauss = gaussian_process.GaussianProcessRegressor(kernel=kernel)
gauss.fit(x.reshape(-1,1),y) # fit expects a 2d array for x

 fit returns the configuration used to fit the data.

GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
             kernel=1**2 + Matern(length_scale=2, nu=1.5) + WhiteKernel(noise_level=1),
             n_restarts_optimizer=0, normalize_y=False,
             optimizer='fmin_l_bfgs_b', random_state=None)

The predict method generates predicted output, the $\mu$ and $\Sigma$ values, for a new set of x values.

x_pred = np.linspace(-6, 6).reshape(-1,1)
y_pred, sigma = gp.predict(x_pred, return_std=True)

plt.figure(figsize=(10,8))
plt.plot(x, y, 'o')
plt.plot(x_pred, y_pred, 'r-.', label='predicted mu')
plt.fill(np.concatenate([x_pred, x_pred[::-1]]),
         np.concatenate([y_pred - 2*sigma,
                        (y_pred + 2*sigma)[::-1]]),
         alpha=.5, fc='grey', ec='None', label='+/- 2*std')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.xlim(-6, 6)
plt.ylim(-3, 3)
plt.legend(loc='lower left');
plt.show()

Parameters

As you may have noticed, kernels have parameters. In addition, noisy data will have a standard deviation of the noise. These parameters are one reason it's a bit of a misnomer to say that a Gaussian processes is a non-parametric Bayesian approach. How do we determine the parameters? There's always guessing, but unless you have some insight into the underlying source of the data, guessing isn't likely to be effective. You could also try a grid of values and compare loss functions for each set of parameters. This is effective, but slow. A more effective approach is to use an empirical Bayes method or try to maximize the log likelihood of the data using a non-linear optimization method.

In the scikit-learn example above. the fit function uses the L-BFGS-B optimization algorithm to find values for the parameters. These are reported by kernel_ attribute.

gauss.kernel_
0.00316**2 + Matern(length_scale=1.11, nu=1.5) + WhiteKernel(noise_level=0.0912)


In addition, the choice of kernel is very important. The choice among models with differing hyperparameters and kernels is a problem in model selection. We won't go into the details here. See Rasmussen and Williams chapter 5 for more information.

Complexity

One drawback with Gaussian processes is that they scale poorly with increasing data size. Solving for the posterior distribution requires $O({N^3})$ operations.

You can download the code here.

No comments:

Post a Comment