Bayesian Optimization of Black Box Functions – Part 1

Skip the explanation and go directly to the Github code.

Foreword

My goal here is to provide complete and overarching explanations and implementations of algorithms useful in Machine Learning and Artificial Intelligence Research and Development, but if you don’t care about understanding it, or already understand it, then you can view my (hopefully) well-commented code on the Github page. With that said, let’s begin!

What are Black Box Functions?

(Credit to Wikipedia)

Black box functions are very much like this picture shows, no different than a normal function, except we don’t know what the function is. So for instance, if we had a function like this:

f(x) = x^2

Then we can look at it and easily know that it’s simply going to raise any input to the power 2, for any inputs. However, if this were a black box function:

f(x) = ???

We have no idea what the operation(s) performed on the input(s) are, and therefore the function is a black box function.

What is Black Box Optimization?

Optimization of black box functions requires knowing the difference between normal functions and black box ones. Now that we know that, we can move on.

Generally in optimization, you want to find the global maximum, or global minimum (however often times a local maximum/minimum will do just fine). For instance, if we are a seller of Halloween Prop Skeletons, and we know that our sales relative to the number of skeletons produced is modeled by the function

f(x) = -(x-420)^2 + 360

Which looks like:

known_function_optimization

Really simple, right? For two-dimensional known functions, we can actually find the global maximum or minimum with complete certainty, and with this function, anyone can look at this graph and see that the company will maximize their profits by producing 420 skeletons.

Sidenote: Optimizing known functions is not as easy when it comes to functions with multiple inputs, where the dimension is greater than two (e.g. three-dimensional graphs), and uses a method known as gradient descent. However, I won’t be covering that in this post. Black Box Optimization is a problem no matter how many dimensions there are however, and Bayesian Black Box Optimization works regardless of dimension.

Black Box Optimization would be when (more realistically) the company doesn’t know an exact function for what their profit will be, so they plug in 100 values from 370-470, and end up with something like this:

unknown_function_optimization1

This time, we don’t have a known function, according to our definition of black box functions. So we could try and get a function to represent this, but the end goal is to find the maximum or minimum of the function, so we instead just go for that. Unfortunately, we can’t just look at this and conclude the best number of skeletons to produce is 420, not with complete certainty. For an example of why we can’t have complete certainty, let’s take an example function:

f(x) = ???

As shown earlier.

So we start by plugging in some values, and we then get some outputs, since that seems to be the easiest way to do things:

f(0) = 0, f(1) = 1, f(2) = 4, f(4) = 16

Hey, this looks just like our x^2 function from earlier! After all, we can look at this and see that it matches the behavior perfectly, with the points we’ve tested. But just to be sure, let’s plug in a few more, to be sure!

f(5) = 16, f(6) = 16, f(42) = 16 ...

Uh-oh, suddenly our idea that the function was x^2 has been blown out of the water. But if we had stopped testing points after our first four proved our hypothesis correct, we would have naively continued, with no idea we were completely incorrect.

One of the key problems with black box optimization is that we can never know with 100% certainty what our function’s best input is; to do so would mean testing an often-infinite number of input values. This means we have to draw the line somewhere, since we can’t test an infinite number. If we had decided that our arbitrary number of evaluations was four, and we picked these points, we would have been unaware of the true nature of this function (which by the way is actually this):

3\left(\frac{1}{1+e^{\left(-\left(200x-320\right)\right)}}\right)+\left(\frac{1}{1+e^{\left(-\left(95x-50\right)\right)}}\right)+12\left(\frac{1}{1+e^{\left(-\left(300x-950\right)\right)}}\right)

(Yes, I fooled you, but it was for your own good)

Another of the main reasons black box optimization is so tricky is the amount of time it can take to evaluate an input. If our skeleton manufacturer could only get the amount of profits once every Halloween, it would take us 100 years to get the graph we now have, at which point your body would likely be just a skeleton. These examples may seem strange, but they illustrate the pivotal problems:

Key Problems of Black Box Optimization:

1. Cost may be high – it may take a long time for every evaluation

2. Number of inputs to test may be high – there may be an enormous amount of possible inputs for our function.

Sidenote: Our second pivotal problem is often times magnified by the number of dimensions. For instance, our company may have the number of skeletons produced, number of jack o lanterns produced, and number of bags of candy corn produced as inputs, with one axis for profit.

If we only had three options and it took milliseconds to test each one, we likely wouldn’t even call it black box optimization, because the problem of black box optimization is only really prevalent when one of them is costly, e.g. long time to test and low input number could still be a problem, low time to test but large input number could also still be a problem; as well as if they are both prevalent.

 

Conventional Methods of Black Box Optimization

1. Hand – Tuning

This is simply going through and choosing our next input based on the last result or past results, done by the choice of the person tuning it.

2. Grid Search

What we did earlier with the skeleton manufacturer example is actually the method of black box optimization known as grid search. In our 2D example, we would represent the inputs we tested as:

x = [370, 371, ..., 470]

Or, in our x^2 example:

x = [0, 1, 2, 4]

With this example it’s not obvious why it would be called grid search, after all it’s just a row. But when we have multiple dimensions for inputs (as is often the case in black box optimization) , such as with:

f(x_1, x_2) = x_1^2 + x_2^2 \\  x_1 = [1, 2, 3] \\  x_2 = [5, 6]

And when we get all the possible configurations of inputs, it becomes like this:

\begin{bmatrix} (1, 5) & (1, 6) \\  (2, 5) & (2, 6) \\  (3, 5) & (3, 6) \\  \end{bmatrix}

As you can see, this looks much more like a grid, which is where the search type gets its name. So grid search is when we generate all the possible combinations of inputs, given a level of detail.

Level of detail – How precise we want our measurement, in the case of all numbers 1 – 10, if we had our level of detail = 10, we’d get 10 elements total (1, 2, 3, 4.., 10). We could have a lower number (e.g. 2 = 1, 10), or a higher number (e.g. 100 = 0.1, 0.2, 0.3, …, 10.0). 

It’s quite possibly the simplest form of black box optimization, other than just hand-tuning. However, there are a few inherent problems with it:

Problems with Grid Search

1. The search space quickly expands – The size of our “grid” is equal to getting the product of all our independent number of inputs. e.g. If we have a level of detail of 100, and have five different inputs, we end up with: 100 * 100 * 100 * 100 * 100 = 100^5 = 10^{10} = 10,000,000,000. That’s 10 billion different configurations, and it’s not an unrealistic scenario. I have this exact situation with one of my reinforcement learning bots, and I’d like to have more inputs or higher level of detail.

2. It’s not intuitive – Since our search method has no predictive ability, we aren’t gaining any knowledge of the problem. In order to do so, we are just picking the configuration that gave the best results.

3. Random Search 

This is very similar to grid search, except instead of searching through every combination in a grid, we randomly choose from each of our domains to make a random combination of inputs, test these, and repeat the number of times we specify.

This sounds crazy, but often times it works really well, because of the idea that some of our parameters have higher impact on the result than other parameters, which is almost always the case. Here’s a diagram showing this exact case and why Random Search often does better than grid search in such problems (Credit to James Bergstra and Yoshua Bengio’s paper on the topic)

scikitlearn8

This is actually quite nice, but since it randomly searches(as a non-deterministic algorithm) we can’t run the same program and get similar results every time.

While random search does a really good job most of the time, I personally don’t like it. The reason for my dislike is that I’d prefer a method that would give the equivalent results of a good random search run, and not have such a massive amount of randomness in it. We’d like one where we will get more or less the same results every run (a deterministic algorithm).

Thankfully, there are many such algorithms that achieve this, albeit at the cost of being much more complicated than hand-tuning, grid search, or random search. For now, I will be covering Bayesian Optimization in part two of this post series.

PART TWO

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s