# Bayesian Optimization of Black Box Functions – Part 3

## Gaussian Process Regression

I realize at this point you’re probably thinking “Why all the background information, just get to the method already!” But the good news is that Gaussian Process Regression is the majority of the Bayesian Optimization method, and once it is understood, the entire method follows shortly after.

Sidenote: I won’t be explaining everything about Gaussian Processes here, because even though they are really cool, you don’t need the entirety to properly understand Bayesian Optimization. However, if you do want to learn more about then, then I recommend watching these helpful videos, or these helpful videos from UBC. If you want some code that uses both constrained and unconstrained Gaussian Processes, as well as some of my initial implementations of Bayesian Optimization, look here.

Suppose you have a black box function, where we’ve already evaluated three points. Better yet, here:

*In Bayesian Optimization we often randomly pick two points to start with.

For this example, we want to predict the output at x=4. Looking at the already evaluated points, we can use the pattern recognition software in our brain to make some pretty good predictions as to where the output will lie. If you’re like me, you probably would put it right here:

However, we aren’t certain that’s where x=4 will evaluate to. After all, this is a prediction. So we’d like an element of uncertainty to add to this, something we can tune to get ranges like this:

This is where Gaussian Process Regression comes in. It assumes that the black box function we are predicting is a Gaussian Process, which means it is continuous, doesn’t have any sharp edges, and overall tends to look like these:

Now, back to our prediction of where the test input, x=4, will evaluate. Using Gaussian Process Regression, we end up with predictions that are quite similar to what we would expect, and this turns out to work very well! We can relate the prediction of our center point to Gaussian Distributions by thinking of it as the mean, and the given room for error on top and bottom as the mean with the variance added or subtracted. With this in mind, all that we need is a way to get from our other known inputs and outputs to a mean and variance for a new/test input.

We can do this using some linear algebra and our idea of covariance from earlier. In order to complete the equations for them however, we will need four elements:

### 1. Get a Covariance Matrix for our already-evaluated inputs

We’ve already seen a matrix when covering grid search, but here’s another example:

$\begin{bmatrix} 1 & 2 & 4\\ 2 & 1 & 3\\ 4 & 3 & 1 \\ \end{bmatrix}$

Which actually has an interesting property: If we “transpose” this matrix, like in this example,

(Credit to Wikimedia)

we end up with the same exact matrix. This is because it’s symmetric along the diagonal, which is why this variety of matrix is known as a symmetric matrix. In addition, you can see from the gif that we notate this operation with $A^T$ from the original matrix $A$

We’re going to need a matrix for this step. This is where our idea of covariance comes into play. While I explained covariance as if it was only relevant for the inputs directly next to a target input (as in with our octopus arm), we can in fact calculate the covariance of any two inputs, no matter how far apart.

If we do this for every possible pair of inputs on our graph, we can get what is known as a covariance matrix. With a covariance matrix, we use a matrix so that we can look at the coordinates of a value in the matrix to know the two inputs between which the covariance was calculated. So, if we’re doing this on our already evaluated inputs, we’d expect a matrix like so:

$\begin{bmatrix} (1, 1) & (1, 2) & (1, 6) \\ (2, 1) & (2, 2) & (2, 6) \\ (6, 1) & (6, 2) & (6, 6) \\ \end{bmatrix}$

*Coordinates added for clarity

So, we have our pairs of inputs. Now for computing the covariance! For this we’ll be using a covariance function, a function that computes the covariance between two inputs (These are also sometimes referred to as kernels). One of the most common is the following:

$cov(x_1, x_2) = e^{\left( \frac{-1}{2\ell^2} * \frac{||x_1 - x_2||^2}{1} \right) }\ or\ e^{\left( \frac{-||x_1 - x_2||^2}{2\ell^2} \right) }$

I realize this is a bit complex, so let’s break it down:

• $||x_1 - x_2||^2$ – The distance between two points on a 2D graph is $d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$ (If you think about this in terms of the pythagorean theorem, it makes a lot more sense). Since we know that our $x_1, x_2$ variables are vectors of coordinates (They would be vectors of length 1, even if not multivariate), we are doing the equivalent of subtracting each element from its counterpart by subtracting the entire vector from the other $x_1 - x_2$. From there, we just have to square each element, get the sum of all the results, and then get the square root. Luckily, our vector normalization $||v||$ actually does just this, e.g.: $\\ ||(2, 1, 3)|| = \sqrt{2^2 + 1^2 + 3^2} \\$ By doing this, we are able to get the distance between two inputs, regardless of the dimension. Since we square at the end, we actually get rid of our square root operation.
• $\frac{-1}{2\ell^2}$ – This could have been $\frac{-1}{4\ell^2}$ just as easy, however 2 tends to work quite well in practice. We have $\ell^2$ as an easy way to change the amount of covariance our function generates, to change how wiggly or bendy our generated points will be (I usually set this to 1).
• $e^{-...}$ – Since everything we are raising $e$ to the power of will always be $\geq 0$ (distance can’t be negative, just equal to 0 at the lowest. This is why $\ell$ is squared, so that we can’t have a negative value for it), we add a negative to the front so that we will always get a value from 0 to 1 as the output (view the graph of $e^{-x} \geq 0$ if you don’t believe me).

With all of this combined, we end up with a function that gets the distance, where the higher the distance, the lower the value we are raising $e$ to; therefore, the closer our covariance for these two inputs is to 0 – the less they care about each other. The inverse is also true, the closer they are, the closer our covariance for these two inputs is to 1 – the more they care about each other.

This function is one of the most common and basic covariance functions, so let’s move on to update the idea of our covariance matrix for our known inputs:

$K = \begin{bmatrix} cov(1, 1) & cov(1, 2) & cov(1, 6) \\ cov(2, 1) & cov(2, 2) & cov(2, 6) \\ cov(6, 1) & cov(6, 2) & cov(6, 6) \\ \end{bmatrix}$

And here is where some of the concepts I presented earlier regarding our example symmetric matrix come into play. Because of the way covariance functions work, this matrix will have several properties:

• It will be symmetric, e.g. $(2-1)^2 = (1-2)^2$
• It will have the same values along the diagonal, e.g. $6-6 = 2-2$
• Our matrix will be positive semi-definite (for more about this, look here.). This is much more complicated, and you don’t need to worry about what this means unless you start inventing your own covariance functions.

So now we have a covariance matrix for our already-known evaluations.

### 2. Get a Covariance Vector with our already-known inputs and test input

Now that you already know how to get a matrix, this step is easy, it’s just

$K_* = \begin{bmatrix} cov(1, 4) \\ cov(2, 4) \\ cov(6, 4) \\ \end{bmatrix}$

We denote as $K_*$ because our test input can also be written as $x_*$, with the subscript $*$.

### 3. Transpose $K_*$

Simple:

$K_*^T = \begin{bmatrix} cov(1, 4) \\ cov(2, 4) \\ cov(6, 4) \\ \end{bmatrix}^T = \begin{bmatrix} cov(1, 4) & cov(2, 4) & cov(6, 4) \\ \end{bmatrix}$

Sidenote: Most math libraries have these matrix/vector operations already built-in.

### 4. Get a Covariance Scalar with our test input

Scalars are just one dimensional vectors, so this is:

$K_{**} = \begin {bmatrix} cov(4, 4) \\ \end {bmatrix}$

There we go, done!

Now that our legendary quest for the four elements necessary to compute the test mean ($\mu_*$) and test variance ($\sigma_*^2$) of our test input ($x_* = 4$) is complete, let’s assemble what we’ve got.

$\begin{bmatrix} \begin{bmatrix} cov(1, 1) & cov(1, 2) & cov(1, 6) \\ cov(2, 1) & cov(2, 2) & cov(2, 6) \\ cov(6, 1) & cov(6, 2) & cov(6, 6) \end{bmatrix}& \begin{bmatrix} cov(1, 4) \\ cov(2, 4) \\ cov(6, 4) \end{bmatrix} \\ \begin{bmatrix} cov(1, 4) & cov(2, 4) & cov(6, 4) \\ \end{bmatrix} & \begin {bmatrix} cov(4, 4) \\ \end {bmatrix} \end {bmatrix}$

Which is equivalent to:

$\begin{bmatrix} K & K_* \\ K_*^T & K_{**} \\ \end{bmatrix}$

And since we’re assuming our evaluations are drawn from Gaussian Processes, as I explained back in the definition of this optimization method, we can actually write our known evaluations and predicted evaluation data as being drawn from the distribution specified by a mean and variance:

$\begin{bmatrix} f \\ f_* \end{bmatrix} \sim \mathcal{N}( \begin{bmatrix}\mu \\ \mu_* \end{bmatrix} ,\, \begin{bmatrix} K & K_* \\ K_*^T & K_{**} \\ \end{bmatrix})$

DON’T PANIC!

This formula is not as bad as it looks. There’s just a lot going on, so let’s break it down, left to right:

$\begin{bmatrix} f \\ f_* \end{bmatrix}$

This is a vector where the top element contains the evaluations of our known black box function inputs, and the bottom element is actually not known yet. Using this formula we will be getting our mean $\mu_*$ and variance $\sigma_*^2$ for this bottom element, our predicted evaluation.

$\mathcal{N}( \begin{bmatrix}\mu \\ \mu_* \end{bmatrix} ,\, \begin{bmatrix} K & K_* \\ K_*^T & K_{**} \\ \end{bmatrix})$

This is just saying that whatever is on the left of the ~, they are drawn from a distribution with the specified mean and variance. With this in mind, we see that we have

$\begin{bmatrix}\mu \\ \mu_* \end{bmatrix}$

where the mean should be. Since we are drawing two things from our distribution, our vector of known and test evaluations, we need two means to go with them.

Fortunately, we can just leave the mean as 0 for both of them. It will also make things a lot easier for calculating the full, test mean later. As for variance, we see that we have this:

$\begin{bmatrix} K & K_* \\ K_*^T & K_{**} \\ \end{bmatrix}$

Where the top left corner element is for our known evaluations, since it’s the covariance of the known inputs, and everything else is both or exclusively the test input, in the case of the bottom right corner.

It may seem pointless to have this formula at first, but because we have it set up exactly like this, we can make use of some advanced statistical equations that state this:

$For\ two\ points\ drawn\ from\ a\ multivariate\ distribution: \\$

$\mu_* = \mu_* + K_*^T K^{-1} (f-\mu) \\ \sigma_* = K_{**} - K_*^T K^{-1} K_*$

Note: the $\mu_*$ on the right hand side of the above equation is the mean which is going to be 0, which we already know. It is not the test mean we are trying to solve for, as that would not make much sense.

And substituting in the values of our equation, remembering that our means are 0:

$\mu_* = K_*^T K^{-1} f \\ \sigma_* = K_{**} - K_*^T K^{-1} K_*$

Sidenote: $K^{-1}$ is similar to getting the reciprocal of a number, except for matrices. If you want to know more about this, I suggest here.

So now we’ve got an equation for our mean and variance, as well as all the variables necessary! Using all of this, our values work out to:

$\mu_* = 1.11417128 \\ \sigma_*^2 = 0.95541772$

Which if we graph, comes to be:

Where our bars above and below the point are when we add and subtract the variance to the mean, respectively. If we tune some of the parameters of our covariance function, and repeat our mean / variance calculations across the entire domain, we end up with this:

And that’s all there is to it! The only thing we have left to do to get to full Bayesian Optimization is to add an acquisition function. An acquisition function is a function that balances mean and variance to choose the next point to evaluate, from graphs like the one above.

## Final Step: Bayesian Optimization

I use the acquisition function known as upper confidence bound. In the case we are maximizing, like in this example, we get a value for every point in our domain that we’ve computed the mean and variance for via the following function:

$= \mu_* + \kappa \sigma_*^2$

(I use $\kappa = 1.5$ )

And then we choose the index of the highest value in the resulting array/vector:

$= argmax(\mu_* + \kappa \sigma_*^2)$

Then, repeat until you have used all the evaluations you want!

Note: $\kappa$ is referred to as the confidence interval for upper confidence bound.

Another note: This is the form for a maximization problem (e.g. accuracy), however the standard deviation term would be negative for a minimization problem (e.g. cost).

## Final Note

That’s it for this series on Bayesian Optimization. Below, you can find an appendix containing some examples of the power of Bayesian Optimization, covariance and acquisition functions, as well as links to all the resources I drew on to learn all of this. If you look at my code on the Github, you will see that I started by doing this with simple python and numpy, looping through each test input in the domain to get means and variances. I then slowly upgraded as much as I could to theano (a machine learning library), and as a result changed many of my loops to vector and matrix calculations. I recommend looking back into the commit history or in the experimentation repository for the earlier versions. I also didn’t use some of the more complicated covariance and acquisition functions from the beginning.