Bayesian Optimization

Suppose you have a box that generates numbers. Inside the box is some mechanism that generates the numbers when you ask for them. You give the box an X value and it spits out a y. You can't open the box and you don't have a description of the method used to generate numbers. You would like to model what is going on inside the box. In particular you would like to find the optimum (maximum or minimum) of the possible output and you would like to get an idea of the internal generating function.

It's not too hard to accomplish getting the optimum and an approximation to the function if it's not expensive to generate a number from the box. You could generate 10,000 or a million values in some range and plot the y values. Pick the maximum for the optimum and the plot gives you a picture of the generating function.

What if the generating those Y values is expensive. For example if instead of a box, we had to go into a lab and do an experiment to get a value. We would like to find an approximate optimum and function plot with as small cost as possible. This is where Bayesian optimization shines.

Bayesian optimization is a probabilistic global optimization strategy. It doesn't assume convexity or any particular functional form. It can be used to optimize an unknown (or at least minimally known) function with multiple local optima. It's used in AI to optimize hyperparameters

Bayesian optimization requires two basic procedures: a surrogate model and an acquisition function. The surrogate is an approximate model of the black box function. It is designed to be sampled efficiently. The acquisition function is a method of selecting the next sample.

To make this more concrete. let's suppose we sample a small number of X values at random from a uniform distribution and extract some y values from the box. Our goal is to find an approximate probability distribution for the black box function f

In other words, we want

\[P(f|D) = \frac{{P(D|f)P(f)}}{{P(D)}}\]

D is the data we have seen so far.

We can simplify a bit by eliminating the normalization $P(D)$.

\[P(f|D) \propto P(D|f)P(f)\]

$P(f|D)$ is our current estimate of the probability distribution generated by f given the current data.

Objective Function

There are a number of articles and packages for Bayesian optimization in R. For example, here, here, and here. The code that follows was inspired by this excellent blog post. The code from that post is written in Python. What follows is a recasting of the approach into R.

Here is our black box function. noise_level is the standard deviation of Gaussian noise. We will assume that the range of the function is [0, 1].

# Black box function with noise
#' objective - generate data for optimization
#'
#' @param x - x x 1 matrix of floats
#' @param noise_level - sd of Gaussian noise 
#'
#' @returns a black box function of floats
objective <- function(x, noise_level = 0.1) {
  x <- as.numeric(x)
  noise = rnorm(length(x), mean = 0, sd = noise_level)
  return(sin(12 * pi * x) / (1+x) + 2 * cos(7 * pi * x)* x ^5 + noise)
}

We can plot the optimization function and some mildly noisy data.

> X <- seq(0, 1, length.out = 100)
> y <- objective(X)
> y_exact <- objective(X, noise_level = 0)
> df_opt <- data.frame(x = X, y = y, y_exact = y_exact)
> ggplot(df_opt) + geom_line(aes(x = x, y = y_exact)) + geom_point(aes(x = x, y = y)) + ggtitle('Objective Function)
> 

We see that there are multiple local optima with a maximum somewhere near x = 0.9.

Surrogate Function

The surrogate is an approximation to the objective function. It summarizes $P(f|D)$. The two most common methods used to approximate the objective are a Gaussian Process or a random forest. For our example we will use a Gaussian process from the R package GPfit.

Given a Gaussian process model and some data X, the surrogate is simple. It uses the model to predict new y values. 

# Surrogate prediction function
#' surrogate - approximate the objective, P(f | data)
#'
#' @param model - a GP model
#' @param X - an n x 1 matrix
#'
#' @returns a list of the predicted y values and standard deviation
surrogate <- function(model, X) {
  pred <- predict(model, X)   # predict from GPfit
  return(list(mean=pred$Y_hat, sd=sqrt(pred$MSE)))
}

The Acquisition Function

The acquisition function retrieves y values from the surrogate for specific X samples. This can be accomplished in a number of ways. For example, we could use a simple random sample. In practice there are three common methods: probability of improvement (PI), expected improvement (EI), and upper confidence bound (UCB). PI is not as efficient as the other two, but it's easy to calculate. We will use PI.

\[PI = \frac{{cdf(\mu  - {\mu _{best}})}}{\sigma }\]

${cdf}$ is the normal cumulative distribution function.  $mu$ is the mean of surrogate function for the sample, and ${{\mu _{best}}}$ is mean of the surrogate for the best sample so far.

# Acquisition function: Probability of Improvement
#' acquisition - calculate the probability that a sample worth evaluating given 
#' the current data.
#' This function calculates the expected improvement (PI) of adding new samples
#'
#' @param X - an n x 1 matrix
#' @param Xsamples an n x 1 matrix of samples
#' @param model - the current GP model
#'
#' @returns the expected improvement (PI) from Xsamples
acquisition <- function(X, model) {
  Xsamples <- matrix(runif(100), ncol=1)
  
  yhat <- surrogate(model, X)$mean
  best <- max(yhat)  # best so far
  
  pred <- surrogate(model, Xsamples) # get the mean y and sd of the predictions
  mu <- pred$mean
  std <- pred$sd
  probs <- pnorm((mu - best) / (std + 1e-9))  # 1e-09 to avoind divide by 0
  
  ix <- which.max(probs)
  return(Xsamples[ix, 1])
}

Putting It All Together

To perform the optimization, we will generate a small set of samples at random. We then loop for a fixed number of iterations using the acquisition function to chose a new point, add it to the data and update the model. Finally we will uses the surrogate to estimate the new function mean.

In addition, we will calculate the objective for comparison and plot the fit so far.

Here's the whole thing.

# Load libraries
library(GPfit)
library(ggplot2)

# Black box function with noise
#' objective - generate data for optimization
#'
#' @param x - x x 1 matrix of floats
#' @param noise_level - sd of Gaussian noise 
#'
#' @returns a black box function of floats
objective <- function(x, noise_level = 0.1) {
  x <- as.numeric(x)
  noise = rnorm(length(x), mean = 0, sd = noise_level)
  return(sin(12 * pi * x) / (1+x) + 2 * cos(7 * pi * x)* x ^5 + noise)
}

# Surrogate prediction function
#' surrogate - approximate the objective, P(f | data)
#'
#' @param model - a GP model
#' @param X - an n x 1 matrix
#'
#' @returns a list of the predicted y values and standard deviation
surrogate <- function(model, X) {
  pred <- predict(model, X)
  return(list(mean=pred$Y_hat, sd=sqrt(pred$MSE)))
}

# Acquisition function: Probability of Improvement
#' acquisition - calculate the probability that a sample worth evaluating given 
#' the current data.
#' This function calculates the expected improvement (PI) of adding new samples
#'
#' @param X - an n x 1 matrix
#' @param Xsamples an n x 1 matrix of samples
#' @param model - the current GP model
#'
#' @returns the expected improvement (PI) from Xsamples
acquisition <- function(X, model) {
  Xsamples <- matrix(runif(100), ncol=1)
  
  yhat <- surrogate(model, X)$mean
  best <- max(yhat)  # best so far
  
  pred <- surrogate(model, Xsamples) # get the mean y and sd of the predictions
  mu <- pred$mean
  std <- pred$sd
  probs <- pnorm((mu - best) / (std + 1e-9))  # 1e-09 to avoind divide by 0
  
  ix <- which.max(probs)
  return(Xsamples[ix, 1])
}

# Plotting function
#' plot_model - plot the current model
#'  The objective is plotted without noise for comparison
#'
#' @param X - an n 1 matrix
#' @param y - current y predictions
#' @param model - GP model
#' @param title - optional title for plot
#'
#' @returns a ggplot object for printing
plot_model <- function(X, y, model, title = NULL) {
  # get the current surrogate model
  Xsamples <- seq(0, 1, length.out=1000)
  Xsamples_matrix <- matrix(Xsamples, ncol=1)
  preds <- surrogate(model, Xsamples_matrix)$mean
  
  df <- data.frame(x=as.vector(X), y=as.vector(y)) # sampled points so far
  pred_df <- data.frame(x=Xsamples, y=preds)       # predicted model
  y_obj <- objective(Xsamples, noise = 0)          # objective function
  df_obj <- data.frame(x = Xsamples, y = y_obj)
  
  p <- ggplot() +
        geom_point(data=df, aes(x=x, y=y, color='Samples')) +
        geom_line(data=pred_df, aes(x=x, y=y, color='Prediction')) +
        geom_line(data = df_obj, aes(x = x, y =y, color = 'Objective')) +
        scale_color_manual(name = 'Bayesian Optimization',
                           breaks = c('Samples', 'Prediction', 'Objective'),
                           values = c('Samples' = 'blue', 'Prediction' = 'red', 'Objective' = 'green'))
  if(! is.null(title)) {
    p <- p + ggtitle(title)
  }
  
  return(p)
}

# Initial data
# set.seed(42)
X <- matrix(runif(5), ncol=1)
# y <- apply(X, 1, objective)
y <- objective(X)

# Fit initial GP model
model <- GP_fit(X, y)

# Initial plot
print(plot_model(X, y, model))

# Optimization loop
# plot the mode every 10 iterations
for (i in 1:100) {
  x_next <- acquisition(X, model)
  y_next <- objective(x_next)
  X <- rbind(X, x_next)
  y <- c(y, y_next)
  model <- GP_fit(X, y)
  est <- surrogate(model, matrix(x_next, ncol=1))$mean
  cat(sprintf(">x=%.3f, f()=%.6f, actual=%.3f\n", x_next, est, y_next))
  
  if(i %% 10 == 0) {
    p <- plot_model(X, y, model, title = paste('Iteration:', i))
    # png(paste('/mnt/c/Users/bbth/OneDrive/analytic_garden/Bayes_opt/plots/plot_', i, '.png'))
    print(p)
    # dev.off()
  }
}

# Final plot
print(plot_model(X, y, model, title ='Final'))

# Best result
ix <- which.max(y)
cat(sprintf("Best Result: x=%.3f, y=%.3f\n", X[ix], y[ix]))

Here are the plots. As the acquisition function add more points, the surrogate quickly fit the function.




 
Here's the final plot. The predictions fit the function well and there are a large number of samples around the optimal peak. The final optimum is x=0.864, y=1.654.



No comments:

Post a Comment