Bayesian Optimization of Black Box Functions – Part 2

Skip the explanation and go directly to the Github code.

Earlier, we went over what black box optimization was, some of the inherent problems with it, and some of the conventional methods of black box optimization. In this post, we’ll be starting with Gaussian Processes, a fundamental part of Bayesian Optimization. If you aren’t familiar with black box optimization yet, then I strongly recommend looking at the previous post in this series.

Gaussian Distributions

Bayesian Optimization is called Bayesian Optimization because Bayesian methods tend to use Gaussian Distributions. So we have to know what Gaussian Distributions are. If you’ve ever taken an IQ Test, the results are given in this form, also known as a “Bell Curve” or “Normal Distribution”:

2000px-iq_distribution-svg

(Credit to Wikipedia)

As you may be able to tell from the graph, the IQ points are on the x axis, and the percentages that obtained each score are on the y axis. Because an average IQ Tests’ score is 100, our distribution is centered on the 100 score. We can also see that there is an equal chance of someone scoring higher than 145 (0.1%), or lower than 55 (0.1%). These distributions are extremely useful for representing data, as well as for such things as Bayesian Statistics, where our method of optimization lies.

There are a few things we can learn about Distributions from this graph. First off, we can see that the average score, or mean, is 100. In mathematical notation, this is referred to as

\mu = 100

We can also see that our percentage/score intervals are 15 points wide. We can see that the farther away the interval is from the mean, the lower the chance is that someone will score inside of it (within one interval on the right side = 34.1%, within one interval on either side = 34.1 * 2 = 68.2%, and so on). This interval is known as the standard deviation for our distribution, notated as

\sigma = 15

Once we know these two variables, we can generate any distribution we want. You can actually do just that using arbitrary values with this calculator.

While the standard deviation \sigma is really easy to understand in terms of interpreting distributions, it’s actually not what we use to generate them computationally and in much of our work using them. Instead, we use something called the variance, which represents how much our data samples vary from the mean. Simple, right? Fortunately I don’t have to break my promise that we only need two variables, as the variance is actually:

\sigma ^2

Yep, it’s just the standard deviation squared. So when you plug in a value for standard deviation into the aforementioned calculator, the calculator is really quickly squaring it to get the variance, since we don’t actually use the standard deviation anywhere in the function for a distribution. You don’t need to understand the entire derivation of the function for a distribution in order to understand distributions, but if you were wondering, the formula is as follows:

Given\ a\ mean\ \mu\ and\ variance\ \sigma ^2:

f(x | \mu, \sigma ^2 ) = \frac{e^{-\frac{(x-\mu)^2}{2\sigma^2}}}{\sqrt{2\sigma ^2 \pi}}

Don’t worry about how we get this formula. I’ve put it up so you can see that we don’t use the standard deviation, just the variance and mean. However, here’s the wikipedia page in case you want to look it up.

Sidenote: it follows from the fact that our variance is \sigma^2  that our standard deviation is \sigma = \sqrt{\sigma^2}  , the square root of the variance by algebra.

The term normal distribution will come up later, so it’s important to know the difference, and luckily it’s quite simple. People notate these differently, but I find it simpler as the following:

Normal Distribution – Has mean 0 and variance 1, and no other values for these. (It should be noted that the standard deviation is also 1 because \sqrt{1^2} = 1 ) . This is sometimes called the Standard Normal Distribution.

Gaussian Distribution – Can have any mean or variance, and as a result the Normal Distribution is actually a Gaussian Distribution.

Here are some graphs showing Gaussian Distributions with different values for the mean and variance:

distributions1

 

With which we can easily see the intuitive effect that changing the value of the mean and variance has on each graph. If you want the code for these to try/do them yourself, here you go:

import numpy as np

import matplotlib
import matplotlib.pyplot as plt

font = {'size': 32}
matplotlib.rc('font', **font)

x = np.arange(-5, 5, .01)

dist = lambda x, mean, variance: (np.exp((-(x-mean)**2)/(2*variance)))/np.sqrt(2*variance*np.pi)

plt.plot(x, dist(x, 0, 0.2), label="mu=0.0, variance=0.2")
plt.plot(x, dist(x, 0, 1.0), label="mu=0.0, variance=1.0")
plt.plot(x, dist(x, 0, 5.0), label="mu=0.0, variance=5.0")
plt.plot(x, dist(x, 3, 1.0), label="mu=3.0, variance=1.0")

plt.xlabel("X-Axis")
plt.ylabel("Y-Axis")

plt.legend(bbox_to_anchor=(1, 1), loc=1, borderaxespad=0.)
plt.axis([-5, 5, 0, 1])
plt.show()

Again, optional. Now there are only two more concepts to know before we can move on to Gaussian Process Regression, which is the majority of Bayesian Optimization.

Last Two Concepts:

1. Covariance

Covariance is a measure of how much two values change together, and it’s measured on a scale from 0 to 1. We saw earlier with our graphs of distributions with different variances how the higher the variance was, the farther away points would be from each other (and vice versa). The same idea goes for covariance, except in this instance it is with pairs of inputs for a graph.

It is easier to imagine it with this example. Let’s say we have a graph represented by an octopus arm, where each input on the graph is a joint in the arm. If we had a low covariance between two joints, then our arm would be more rigid and less bendy, with covariance closer to 0. If we had a high covariance, then our arm would be less rigid and more bendy, with covariance closer to 1.

This is because if our covariance is closer to 1, then each joint is more loose and cares less about the position of the joints to the left and right of it. Because of this, it bends more. If our covariance is alternatively closer to 0, then each joint is more rigid and cares more about the position of the joints to the left and right of it. So, it bends less.

An example of a real life object with a covariance of 0 would be a pencil, since they (usually) don’t bend at all. Alternatively, one with a covariance of 1 would be a usb cable, since they bend a lot.

If we have the covariance of a single point, we can actually get the variance of a distribution for that point. More on this in part three.

2. Multi-Variable Functions

In the upcoming section i’m going to throw the word multi-variable / multivariate around, and in my previous post I talk about how you may have more than one input to a function. This is easy to think about when we only have two inputs to the function, as it just gives a graph like:

2000px-3d-function-7-svg

(Credit to Wikimedia)

(This is known as a three-dimensional graph, by the way) However we immediately run into a wall when we have more than two inputs to a function. How are we supposed to visualize a four-dimensional graph? A thirteen-dimensional graph?

It’s easy – we don’t!  (I’m sorry if you spent time trying, by the way)

Since our human brains are more or less limited to visualizing in only three dimensions, there are a couple of fortunate tricks we can use to think about larger dimensional graphs with ease (notice I said think about, not visualize). One of these is the idea of vectors as coordinates. Let’s say we have a vector, like this:

[3, 4]

We could also think of this as the coordinates (3, 4) on a 2D graph, the value 3 on the x-axis, and the value 4 on the y-axis. We can think of it like this as well:

[x_1, x_2]

Where we just substitute x for x_1 , and y for x_2  . What about this vector?

[3, 4, -1, 2, 1] \\

or,

\begin{bmatrix}  x_1 & x_2 & x_3 & x_4 & x_5  \end{bmatrix}

These are coordinates for a 5D graph, and we can see how the first four represent the inputs and the fifth represents the output value, no? And what’s great is we can do this without having to think of a 5D graph, whatever that would look like.

In summary, the trick to visualizing greater than 3D graphs is to not visualize them, but think about them just in terms of number of dimensions and vectors for coordinates in them.

Finally, there will be times when thinking about the entire graph is helpful, in which case we can usually get away with thinking about them in terms of two or three dimensions, so that most of the time a Gaussian Distribution in five dimensions (or 13, or 69) would look and behave like one in two or three dimensions, except with four inputs instead of two or one.

Now that we’ve covered all that, we will start on Gaussian Process Regression and Bayesian Optimization in the final post of this series. Thanks for reading!

PART THREE

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s