Suppose you have a box that generates numbers. Inside the box is some mechanism that generates the numbers when you ask for them. You give the box an X value and it spits out a y. You can't open the box and you don't have a description of the method used to generate numbers. You would like to model what is going on inside the box. In particular you would like to find the optimum (maximum or minimum) of the possible output and you would like to get an idea of the internal generating function.
It's not too hard to accomplish getting the optimum and an approximation to the function if it's not expensive to generate a number from the box. You could generate 10,000 or a million X values in some range and plot the y values. Pick the maximum for the optimum and the plot gives you a picture of the generating function.
What if the generating those Y values is expensive. For example if instead of a box, we had to go into a lab and do an experiment to get a value. We would like to find an approximate optimum and function plot with as small cost as possible. This is where Bayesian optimization shines.
Bayesian optimization is a probabilistic global optimization strategy. It doesn't assume convexity or any particular functional form. It can be used to optimize an unknown (or at least minimally known) function with multiple local optima. It's used in AI to optimize hyperparameters.
Bayesian optimization requires two basic procedures: a surrogate model and an acquisition function. The surrogate is an approximate model of the black box function. It is designed to be sampled efficiently. The acquisition function is a method of selecting the next sample.
To make this more concrete. let's suppose we sample a small number of X values at random from a uniform distribution and extract some y values from the box. Our goal is to find an approximate probability distribution for the black box function f.
In other words, we want
\[P(f|D) = \frac{{P(D|f)P(f)}}{{P(D)}}\]Objective Function
# Black box function with noise #' objective - generate data for optimization #' #' @param x - x x 1 matrix of floats #' @param noise_level - sd of Gaussian noise #' #' @returns a black box function of floats objective <- function(x, noise_level = 0.1) { x <- as.numeric(x) noise = rnorm(length(x), mean = 0, sd = noise_level) return(sin(12 * pi * x) / (1+x) + 2 * cos(7 * pi * x)* x ^5 + noise) }
> X <- seq(0, 1, length.out = 100) > y <- objective(X) > y_exact <- objective(X, noise_level = 0) > df_opt <- data.frame(x = X, y = y, y_exact = y_exact) > ggplot(df_opt) + geom_line(aes(x = x, y = y_exact)) + geom_point(aes(x = x, y = y)) + ggtitle('Objective Function) >
We see that there are multiple local optima with a maximum somewhere near x = 0.9.
Surrogate Function
# Surrogate prediction function #' surrogate - approximate the objective, P(f | data) #' #' @param model - a GP model #' @param X - an n x 1 matrix #' #' @returns a list of the predicted y values and standard deviation surrogate <- function(model, X) { pred <- predict(model, X) # predict from GPfit return(list(mean=pred$Y_hat, sd=sqrt(pred$MSE))) }
The Acquisition Function
# Acquisition function: Probability of Improvement #' acquisition - calculate the probability that a sample worth evaluating given #' the current data. #' This function calculates the expected improvement (PI) of adding new samples #' #' @param X - an n x 1 matrix #' @param Xsamples an n x 1 matrix of samples #' @param model - the current GP model #' #' @returns the expected improvement (PI) from Xsamples acquisition <- function(X, model) { Xsamples <- matrix(runif(100), ncol=1) yhat <- surrogate(model, X)$mean best <- max(yhat) # best so far pred <- surrogate(model, Xsamples) # get the mean y and sd of the predictions mu <- pred$mean std <- pred$sd probs <- pnorm((mu - best) / (std + 1e-9)) # 1e-09 to avoind divide by 0 ix <- which.max(probs) return(Xsamples[ix, 1]) }
Putting It All Together
# Load libraries library(GPfit) library(ggplot2) # Black box function with noise #' objective - generate data for optimization #' #' @param x - x x 1 matrix of floats #' @param noise_level - sd of Gaussian noise #' #' @returns a black box function of floats objective <- function(x, noise_level = 0.1) { x <- as.numeric(x) noise = rnorm(length(x), mean = 0, sd = noise_level) return(sin(12 * pi * x) / (1+x) + 2 * cos(7 * pi * x)* x ^5 + noise) } # Surrogate prediction function #' surrogate - approximate the objective, P(f | data) #' #' @param model - a GP model #' @param X - an n x 1 matrix #' #' @returns a list of the predicted y values and standard deviation surrogate <- function(model, X) { pred <- predict(model, X) return(list(mean=pred$Y_hat, sd=sqrt(pred$MSE))) } # Acquisition function: Probability of Improvement #' acquisition - calculate the probability that a sample worth evaluating given #' the current data. #' This function calculates the expected improvement (PI) of adding new samples #' #' @param X - an n x 1 matrix #' @param Xsamples an n x 1 matrix of samples #' @param model - the current GP model #' #' @returns the expected improvement (PI) from Xsamples acquisition <- function(X, model) { Xsamples <- matrix(runif(100), ncol=1) yhat <- surrogate(model, X)$mean best <- max(yhat) # best so far pred <- surrogate(model, Xsamples) # get the mean y and sd of the predictions mu <- pred$mean std <- pred$sd probs <- pnorm((mu - best) / (std + 1e-9)) # 1e-09 to avoind divide by 0 ix <- which.max(probs) return(Xsamples[ix, 1]) } # Plotting function #' plot_model - plot the current model #' The objective is plotted without noise for comparison #' #' @param X - an n 1 matrix #' @param y - current y predictions #' @param model - GP model #' @param title - optional title for plot #' #' @returns a ggplot object for printing plot_model <- function(X, y, model, title = NULL) { # get the current surrogate model Xsamples <- seq(0, 1, length.out=1000) Xsamples_matrix <- matrix(Xsamples, ncol=1) preds <- surrogate(model, Xsamples_matrix)$mean df <- data.frame(x=as.vector(X), y=as.vector(y)) # sampled points so far pred_df <- data.frame(x=Xsamples, y=preds) # predicted model y_obj <- objective(Xsamples, noise = 0) # objective function df_obj <- data.frame(x = Xsamples, y = y_obj) p <- ggplot() + geom_point(data=df, aes(x=x, y=y, color='Samples')) + geom_line(data=pred_df, aes(x=x, y=y, color='Prediction')) + geom_line(data = df_obj, aes(x = x, y =y, color = 'Objective')) + scale_color_manual(name = 'Bayesian Optimization', breaks = c('Samples', 'Prediction', 'Objective'), values = c('Samples' = 'blue', 'Prediction' = 'red', 'Objective' = 'green')) if(! is.null(title)) { p <- p + ggtitle(title) } return(p) } # Initial data # set.seed(42) X <- matrix(runif(5), ncol=1) # y <- apply(X, 1, objective) y <- objective(X) # Fit initial GP model model <- GP_fit(X, y) # Initial plot print(plot_model(X, y, model)) # Optimization loop # plot the mode every 10 iterations for (i in 1:100) { x_next <- acquisition(X, model) y_next <- objective(x_next) X <- rbind(X, x_next) y <- c(y, y_next) model <- GP_fit(X, y) est <- surrogate(model, matrix(x_next, ncol=1))$mean cat(sprintf(">x=%.3f, f()=%.6f, actual=%.3f\n", x_next, est, y_next)) if(i %% 10 == 0) { p <- plot_model(X, y, model, title = paste('Iteration:', i)) # png(paste('/mnt/c/Users/bbth/OneDrive/analytic_garden/Bayes_opt/plots/plot_', i, '.png')) print(p) # dev.off() } } # Final plot print(plot_model(X, y, model, title ='Final')) # Best result ix <- which.max(y) cat(sprintf("Best Result: x=%.3f, y=%.3f\n", X[ix], y[ix]))