A **sparse** matrix is one in which most of the elements are zero. The opposite of one of these is a **dense** matrix, where most of the elements are nonzero. We can actually measure this, using sparsity and density:

**Sparsity**: Number of zero elements / number of elements

**Density**: Number of nonzero elements / number of elements

And we can also see that since these are both between 0 and 1

These happen all the time in CS and math, sometimes in scientific or engineering applications, but also in graph or network theory. For my uses, it seems most useful in graph theory for adjacency matrices. While adjacency matrices only have 1s and 0s, we can have applications in science and engineering which have nonzero elements other than 1, so we will talk about both.

Often times, these sparse matrices are huge. When it comes to adjacency matrices, this is especially the case, since they grow in size exponentially w.r.t. the number of nodes in the graph, e.g.

*For n nodes, an adjacency matrix of the graph will be of shape (n*^{2}* , n*^{2}*).** This would subsequently have n*^{2 }** n*^{2}* = n*^{4}* entries.*

This can also be thought of as:

*For an image of shape (n, m), an adjacency matrix of the image / graph will be of shape **( (n * m)*^{2}*, (n * m)*^{2}* )**, which would subsequently have n*^{4}*m*^{4}* entries.*

So we see that adjacency matrices scale horribly in terms of space complexity. Because of this, it’s often beneficial (if not necessary) to take advantage of the sparse structure of sparse matrices in order to store them more efficiently. There are quite a few for the general case that help a lot to reduce storage cost, while still allowing efficient compression and decompression.

**Compression Algorithms**

**Special Cases **

**Diagonal**: If a matrix only has nonzero elements on it’s diagonal, it’s pretty obvious that you can just compress it via putting the diagonal elements in a vector, going from a *(n, n) matrix -> (n,) vector*

**Symmetric: **These happen as adjacency matrices of undirected graphs, and can be stored efficiently as an adjacency list.

**Banded: **These are matrices where the non-zero entries are in a diagonal band, with zero or more entries in diagonals along the side of this main diagonal.

The applications of these, and examples of these, can be found on the relatively minimal wikipedia page: https://en.wikipedia.org/wiki/Band_matrix

But it’s pretty obvious how to compress these, like these examples:

**General Cases**

Some of these offer efficient modification, some offer efficient access and matrix operations. The algorithms tend to be good at one but not the other, but this isn’t too bad considering in graph theory you usually won’t be modifying your graph once you’re doing this – or at least you won’t be modifying your graph that often in the case of a network structure.

There are more specifics on how to handle modification of graphs later in the wikipedia page: https://en.wikipedia.org/wiki/Sparse_matrix

Group 1 – Efficient Modification:

- DOK – Dictionary of Keys
- LIL – List of Lists
- COO – Coordinate List

Group 2 – Efficient access and matrix ops:

- CSR – Compressed Sparse Row
- CSC – Compressed Sparse Column

**Dictonary of Keys** – dictionary or hashmap that maps (row, column) pairs to value of elements, only done on elements that are nonzero. While efficient for randomly generating sparse matrices (since you can just randomly generate new keys and values), it is inefficient for going through the values in lexicographical (i.e. alphabetical or numerical) order, as imagined by the unordered dictionary / hashmap structure.

**List of Lists – **Stores one list per row, with each entry being the column index and value, e.g.:

for:

This is efficient and inefficient in similar ways to DOK.

**Coordinate List – **Goes one step further from LIL and stores a list of just (row, column, value) tuples. Sorting by row index and then column index would also serve to improve access times, and this is another good format good for random construction.

**Compressed Sparse Row – beginning of Group 2 Algorithms**

This one is a pain. Pretty sure this is the most complicated one. It’s similar to COO but compresses the row indices, which is why it’s named that way.

It makes up for its more complex structure than the group 2 formats by allowing for fast row access and matrix – vector multiplications (meaning you can multiply the matrix by a vector).

This is much easier explained by example:

This always gives three vectors as the result, and the arbitrary names given to each of these tend to differ. However, it makes sense to think about it like this:

*A = name of matrix, gives the values*

*IA -> i is used for row indices, and I here is for the rows*

*JA -> j is used for col indices, and J here is for the columns*

So I like the names here. Let’s break down each of these vectors:

A = the values in the matrix, obtained by going left-right top-bottom through it.

IA = Starts off as just [0], and for each row *i, *

*IA[i] = IA[i-1] + Number of nonzero elements in this row*

So we can see how for this matrix, we ended up with [0, 0, 2, 3, 4]:

[0], because we start with [0],

[0, 0], because first row has nothing,

[0, 0, 2], second row has 2 entries

[0, 0, 2, 3], third row has 1 entry

[0, 0, 2, 3, 4], fourth row has 1 entry.

A convenient side effect of this is that the last element in IA is always the number of nonzero elements in the matrix.

JA = the column indices of each element in A, as we encounter them. So 5: 0, 8: 1, 3: 2, 6: 1

While this looks kinda large, it ends up saving tons of space the larger it gets. However, if you’re super picky, you can find when it saves on memory when

*NNZ (number of non-zero elements) *

*We can also compute the length of each row using IA[i+1] – IA[i] for a row i*

We can use this to regenerate the array like so:

A gives values

IA splits values into rows: [0], [5, 8], [3], [6]

JA puts values in appropriate locations: [0, 0, 0, 0], [5, 8, 0, 0], [0, 0, 3, 0], [0, 6, 0, 0]

Which is nice, because it means we can do this one row at a time in our sparse matrix, and if we are doing a matrix-vector multiplication it means we can take that row and multiply it by the appropriate vector or scalar, depending on if we are multiplying a row or column vector. We can then store the result as either a scalar or a new row in a new sparse matrix (or just a normal matrix), and then go on to the next row and repeat.

Obviously this is automated in libraries already, but now you understand how it works. This is what makes it efficient for matrix-vector multiplications, as opposed to the group 1 algorithms which would require iterating through a ton in order to generate the row.

**Compressed Sparse Column (CSC / CCS) – **This is a similar idea to CSR (hence the name), and is an alternative to COO. Instead of looping left-right top-bottom as in CSR, we loop top-bottom left-right. So, our new example would now be:

I did mention the notation often differed, and this is the common notation used for these three matrices in CSC.

We see that we get col_ptr the same way we got IA, except we sum over the number of non zero elements in the columns instead of rows. We can also see that row_ind is the same as JA, except for row indices instead of column indices. So it’s not much different.

The CSC format is equivalent to the CSR format, for A^{T}

There isn’t much gain in using one or the other, they both have about the same efficiency on arithmetic and matrix-vector ops for the same reasons that I described in the CSR section. MATLAB seems to use this by default, but scipy offers support for every method i’ve described here.

So yea, scipy supports all of these: https://docs.scipy.org/doc/scipy-0.19.0/reference/sparse.html

Good luck, have fun!

-DE

]]>

Best input found: 2, Global Maximum: 2

For this one, I used the hyperparameters for my policy gradient learner as inputs to the function. With the mini-batch size, learning rate, learning rate exponential decay rate, and discount factor, it had four different inputs to tune, with a level of detail of just 10 for each input. It ran each input three times and averaged the output to increase reliability, and each run took from one to five minutes.

I used the cartpole reinforcement learning environment provided by OpenAI, ran for 400 epochs, with 200 timesteps. I used the Matern covariance function (credit to Jasper Snoek, Hugo Larochelle and Ryan P. Adams from various universities for their paper in which they detail this covariance function), and the Upper confidence bound covariance function with confidence interval of 1.5.

It used a mini batch size range of 1 to 100, learning rate range of 1 to 10, learning rate decay rate range of 0 to 10 (where the value is plugged into -x/epochs, I have found this tends to give nice exponential decay curves), and discount factor range of 0 to 1.

For this comparison of methods, I did the following for each. I averaged over six runs. The subplots represent cost after each epoch, average timesteps survived over each mini batch after each epoch, and max timesteps survived over each mini batch after each epoch

I ran my grid search alternative algorithm on it for as long as it wanted, which was an excess of 200 evaluations. With that, it obtained the following results with the final parameters:

I then ran my bayesian optimization bot on it for 20 evaluations, and obtained the following:

So we can see that in a tenth of the evaluations, Bayesian Optimization won out by a longshot, and achieved parameters good enough to beat the cartpole environment on OpenAI.

** = Used this for cartpole test and has been best for me so far*

** – Used this for cartpole test and has been best for me so far with , also keep in mind this is the form for a maximization problem (e.g. accuracy), however the standard deviation term would be negative for a minimization problem (e.g. cost).*

** – is the cumulative distribution function up to the point given. is our

Resources & Sources

- Several videos from mathematicalmonk on youtube, from whence I learned the first five of these covariance functions and about Unconstrained Gaussian Processes.
- A couple videos from the University of British Columbia’s Computer Science youtube lectures, from whence I learned about Constrained Gaussian Processes, including the necessary covariance matrices/vectors and the Multivariate Gaussian Distribution theorems. Go here to learn all about the math behind constraining Gaussian Processes for regression.

*Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets,*By Aaron Klein, Stefan Falkner, and Frank Hutter of the University of Freiburg, and also by Simon Bartels and Philipp Hennig of the Max Planck Institute for Intelligent Systems. This is a good overarching paper detailing the method.*Practical Bayesian Optimization of Machine Learning Algorithms,*By Jasper Snoek, Hugo Larochelle and Ryan P. Adams of the University of Toronto, Universit´e de Sherbrooke and Harvard University, respectively. I learned about the Matérn Covariance Function from this paper, though it also gives a good overarching explanation.

*A Tutorial on Gaussian Processes*, By Zoubin Ghahramani of the University of Cambridge(UK). I learned more about different free parameters in covariance functions such as lengthscales from these slides, though they also give a more intermediate tutorial on Gaussian Processes.*Gaussian Processes,*By Daniel McDuff of MIT. Good information on Covariance functions, covariance, and Gaussian Process regression again.*A Tutorial on Bayesian Optimization for Machine Learning,*By Ryan P. Adams of Harvard University. This gives a great visualization of different acquisition functions, how we use Gaussian Processes for Bayesian Optimization, as well as different types of covariance functions. It also speaks on a variant of the Matérn Covariance Function that I couldn’t get working, and integrating out Gaussian Process Hyper Parameters, which I didn’t implement.*fernando on Github’s Bayesian Optimization Visualization iPython Notebook,*whom had a really nice test black box function which I used at the top of this appendix.*Guassian Random Vectors,*By Robert G. Gallager of MIT. Go here for anything and everything about Gaussian Random Vectors, with a lot of math (you’ve been warned).*Regression,**Covariance Functions,*By Robert G. Gallager of MIT*.*Go here for anything and everything about covariance functions, with a lot of math (you’ve been warned).*Gaussian Processes Covariance Functions and Classification,*By Carl Edward Rasmussen of the Max Planck Institute for Biological Cybernetics. This is another overarching slide run-through.*CSE515 Lecture Notes,*By Roman Garnett of Washington University in St. Louis. This quickly and succinctly covers the most common acquisition functions, and I learned them first from here.

*You can use any of my stuff (but it wouldn’t hurt to remember my name), and everything I’ve written in this series comes from these sources, or from my code that is openly available on my Github. Good luck, have fun. *

-Dark Element

]]>I realize at this point you’re probably thinking “Why all the background information, just get to the method already!” But the good news is that Gaussian Process Regression is the majority of the Bayesian Optimization method, and once it is understood, the entire method follows shortly after.

*Sidenote: I won’t be explaining everything about Gaussian Processes here, because even though they are really cool, you don’t need the entirety to properly understand Bayesian Optimization. However, if you do want to learn more about then, then I recommend watching these helpful videos, or these helpful videos from UBC. If you want some code that uses both constrained and unconstrained Gaussian Processes, as well as some of my initial implementations of Bayesian Optimization, look here.*

Suppose you have a black box function, where we’ve already evaluated three points. Better yet, here:

**In Bayesian Optimization we often randomly pick two points to start with.*

For this example, we want to predict the output at x=4. Looking at the already evaluated points, we can use the pattern recognition software in our brain to make some pretty good predictions as to where the output will lie. If you’re like me, you probably would put it right here:

However, we aren’t *certain* that’s where x=4 will evaluate to. After all, this is a prediction. So we’d like an element of *uncertainty* to add to this, something we can tune to get ranges like this:

This is where Gaussian Process Regression comes in. It assumes that the black box function we are predicting is a Gaussian Process, which means it is continuous, doesn’t have any sharp edges, and overall tends to look like these:

Now, back to our prediction of where the test input, x=4, will evaluate. Using Gaussian Process Regression, we end up with predictions that are quite similar to what we would expect, and this turns out to work very well! We can relate the prediction of our center point to Gaussian Distributions by thinking of it as the mean, and the given room for error on top and bottom as the mean with the variance added or subtracted. With this in mind, all that we need is a way to get from our other known inputs and outputs to a mean and variance for a new/test input.

We can do this using some linear algebra and our idea of covariance from earlier. In order to complete the equations for them however, we will need four elements:

We’ve already seen a matrix when covering grid search, but here’s another example:

Which actually has an interesting property: **If we “transpose” this matrix, like in this example,**

*(Credit to Wikimedia)*

we end up with the same exact matrix. This is because it’s symmetric along the diagonal, which is why this variety of matrix is known as a **symmetric matrix.** In addition, you can see from the gif that we notate this operation with from the original matrix

We’re going to need a matrix for this step. This is where our idea of covariance comes into play. While I explained covariance as if it was only relevant for the inputs directly next to a target input (as in with our octopus arm), we can in fact calculate the covariance of any two inputs, no matter how far apart.

If we do this for every possible pair of inputs on our graph, we can get what is known as a **covariance matrix.** With a covariance matrix, we use a matrix so that we can look at the coordinates of a value in the matrix to know the two inputs between which the covariance was calculated. So, if we’re doing this on our already evaluated inputs, we’d expect a matrix like so:

**Coordinates added for clarity*

So, we have our pairs of inputs. Now for computing the covariance! For this we’ll be using a **covariance function**, a function that computes the covariance between two inputs (These are also sometimes referred to as *kernels*). One of the most common is the following:

I realize this is a bit complex, so let’s break it down:

- – The distance between two points on a 2D graph is (If you think about this in terms of the pythagorean theorem, it makes a lot more sense). Since we know that our variables are vectors of coordinates (They would be vectors of length 1, even if not multivariate), we are doing the equivalent of subtracting each element from its counterpart by subtracting the entire vector from the other . From there, we just have to square each element, get the sum of all the results, and then get the square root. Luckily, our vector normalization actually does just this, e.g.: By doing this, we are able to get the distance between two inputs, regardless of the dimension. Since we square at the end, we actually get rid of our square root operation.
- – This could have been just as easy, however 2 tends to work quite well in practice. We have as an easy way to change the amount of covariance our function generates, to change how wiggly or bendy our generated points will be (I usually set this to 1).
- – Since everything we are raising to the power of will always be (distance can’t be negative, just equal to 0 at the lowest. This is why is squared, so that we can’t have a negative value for it), we add a negative to the front so that we will always get a value from 0 to 1 as the output (view the graph of if you don’t believe me).

With all of this combined, we end up with a function that gets the distance, where the higher the distance, the lower the value we are raising to; therefore, the closer our covariance for these two inputs is to 0 – the less they care about each other. The inverse is also true, the closer they are, the closer our covariance for these two inputs is to 1 – the more they care about each other.

This function is one of the most common and basic covariance functions, so let’s move on to update the idea of our covariance matrix for our known inputs:

And here is where some of the concepts I presented earlier regarding our example **symmetric matrix** come into play. Because of the way covariance functions work, this matrix will have several properties:

- It will be symmetric, e.g.
- It will have the same values along the diagonal, e.g.
- Our matrix will be positive semi-definite (for more about this, look here.). This is much more complicated, and you don’t need to worry about what this means unless you start inventing your own covariance functions.

So now we have a covariance matrix for our already-known evaluations.

Now that you already know how to get a matrix, this step is easy, it’s just

We denote as because our test input can also be written as , with the subscript .

Simple:

*Sidenote: Most math libraries have these matrix/vector operations already built-in.*

Scalars are just one dimensional vectors, so this is:

There we go, done!

Now that our legendary quest for the four elements necessary to compute the test mean () and test variance () of our test input () is complete, let’s assemble what we’ve got.

Which is equivalent to:

And since we’re assuming our evaluations are drawn from Gaussian Processes, as I explained back in the definition of this optimization method, we can actually write our known evaluations and predicted evaluation data as being **drawn from the distribution specified by a mean and variance**:

**DON’T PANIC!**

This formula is not as bad as it looks. There’s just a lot going on, so let’s break it down, left to right:

This is a vector where the top element contains the evaluations of our known black box function inputs, and the bottom element is actually not known yet. Using this formula we will be getting our mean and variance for this bottom element, our predicted evaluation.

This is just saying that whatever is on the left of the ~, they are drawn from a distribution with the specified mean and variance. With this in mind, we see that we have

where the mean should be. Since we are drawing two things from our distribution, our vector of known and test evaluations, we need two means to go with them.

Fortunately, we can just leave the mean as 0 for both of them. It will also make things a lot easier for calculating the full, test mean later. As for variance, we see that we have this:

Where the top left corner element is for our known evaluations, since it’s the covariance of the known inputs, and everything else is both or exclusively the test input, in the case of the bottom right corner.

**It may seem pointless to have this formula at first, **but because we have it set up exactly like this, we can make use of some advanced statistical equations that state this:

*Note: the on the right hand side of the above equation is the mean which is going to be 0, which we already know. It is not the test mean we are trying to solve for, as that would not make much sense.*

And substituting in the values of our equation, remembering that our means are 0:

*Sidenote: is similar to getting the reciprocal of a number, except for matrices. If you want to know more about this, I suggest here.*

So now we’ve got an equation for our mean and variance, as well as all the variables necessary! Using all of this, our values work out to:

Which if we graph, comes to be:

Where our bars above and below the point are when we add and subtract the variance to the mean, respectively. If we tune some of the parameters of our covariance function, and repeat our mean / variance calculations across the entire domain, we end up with this:

And that’s all there is to it! The only thing we have left to do to get to full Bayesian Optimization is to add an **acquisition function.** An acquisition function is a function that balances mean and variance to choose the next point to evaluate, from graphs like the one above.

I use the acquisition function known as **upper confidence bound.** In the case we are maximizing, like in this example, we get a value for every point in our domain that we’ve computed the mean and variance for via the following function:

(I use )

And then we choose the index of the highest value in the resulting array/vector:

Then, repeat until you have used all the evaluations you want!

*Note: is referred to as the confidence interval for upper confidence bound.*

*Another note: This is the form for a maximization problem (e.g. accuracy), however the standard deviation term would be negative for a minimization problem (e.g. cost).*

That’s it for this series on Bayesian Optimization. Below, you can find an appendix containing some examples of the power of Bayesian Optimization, covariance and acquisition functions, as well as links to all the resources I drew on to learn all of this. If you look at my code on the Github, you will see that I started by doing this with simple python and numpy, looping through each test input in the domain to get means and variances. I then slowly upgraded as much as I could to theano (a machine learning library), and as a result changed many of my loops to vector and matrix calculations. I recommend looking back into the commit history or in the experimentation repository for the earlier versions. I also didn’t use some of the more complicated covariance and acquisition functions from the beginning.

Earlier, we went over what black box optimization was, some of the inherent problems with it, and some of the conventional methods of black box optimization. In this post, we’ll be starting with Gaussian Processes, a fundamental part of Bayesian Optimization. If you aren’t familiar with black box optimization yet, then I strongly recommend looking at the previous post in this series.

Bayesian Optimization is called Bayesian Optimization because **Bayesian methods tend to use Gaussian Distributions.** So we have to know what Gaussian Distributions are. If you’ve ever taken an IQ Test, the results are given in this form, also known as a “Bell Curve” or “Normal Distribution”:

*(Credit to Wikipedia)*

As you may be able to tell from the graph, the IQ points are on the x axis, and the percentages that obtained each score are on the y axis. Because an average IQ Tests’ score is 100, our distribution is centered on the 100 score. We can also see that there is an equal chance of someone scoring higher than 145 (0.1%), or lower than 55 (0.1%). These distributions are extremely useful for representing data, as well as for such things as *Bayesian Statistics, *where our method of optimization lies.

There are a few things we can learn about Distributions from this graph. First off, we can see that the average score, or **mean**, is 100. In mathematical notation, this is referred to as

We can also see that our percentage/score intervals are 15 points wide. We can see that the farther away the interval is from the mean, the lower the chance is that someone will score inside of it (within one interval on the right side = 34.1%, within one interval on either side = 34.1 * 2 = 68.2%, and so on). This interval is known as the **standard deviation** for our distribution, notated as

Once we know these two variables, we can generate any distribution we want. You can actually do just that using arbitrary values with this calculator.

While the standard deviation is really easy to understand in terms of interpreting distributions, it’s actually not what we use to generate them computationally and in much of our work using them. Instead, we use something called the **variance**, which represents how much our data samples vary from the mean. Simple, right? Fortunately I don’t have to break my promise that we only need two variables, as the variance is actually:

Yep, it’s just the standard deviation squared. So when you plug in a value for standard deviation into the aforementioned calculator, the calculator is really quickly squaring it to get the variance, since we don’t actually use the standard deviation anywhere in the function for a distribution. You don’t need to understand the entire derivation of the function for a distribution in order to understand distributions, but if you were wondering, the formula is as follows:

**Don’t worry about how we get this formula. **I’ve put it up so you can see that we don’t use the standard deviation, just the variance and mean. However, here’s the wikipedia page in case you want to look it up.

*Sidenote: it follows from the fact that our variance is * *that our standard deviation is * *, the square root of the variance by algebra.*

The term **normal distribution **will come up later, so it’s important to know the difference, and luckily it’s quite simple. People notate these differently, but I find it simpler as the following:

**Normal Distribution – **Has mean 0 and variance 1, and no other values for these. (It should be noted that the standard deviation is also 1 because ) . This is sometimes called the *Standard Normal Distribution*.

**Gaussian Distribution **– Can have any mean or variance, and as a result the Normal Distribution is actually a Gaussian Distribution.

Here are some graphs showing Gaussian Distributions with different values for the mean and variance:

With which we can easily see the intuitive effect that changing the value of the mean and variance has on each graph. If you want the code for these to try/do them yourself, here you go:

import numpy as np import matplotlib import matplotlib.pyplot as plt font = {'size': 32} matplotlib.rc('font', **font) x = np.arange(-5, 5, .01) dist = lambda x, mean, variance: (np.exp((-(x-mean)**2)/(2*variance)))/np.sqrt(2*variance*np.pi) plt.plot(x, dist(x, 0, 0.2), label="mu=0.0, variance=0.2") plt.plot(x, dist(x, 0, 1.0), label="mu=0.0, variance=1.0") plt.plot(x, dist(x, 0, 5.0), label="mu=0.0, variance=5.0") plt.plot(x, dist(x, 3, 1.0), label="mu=3.0, variance=1.0") plt.xlabel("X-Axis") plt.ylabel("Y-Axis") plt.legend(bbox_to_anchor=(1, 1), loc=1, borderaxespad=0.) plt.axis([-5, 5, 0, 1]) plt.show()

Again, optional. Now there are only two more concepts to know before we can move on to Gaussian Process Regression, which is the majority of Bayesian Optimization.

Covariance is a measure of how much two values change together, and it’s measured on a scale from 0 to 1. We saw earlier with our graphs of distributions with different variances how the higher the variance was, the farther away points would be from each other (and vice versa). The same idea goes for covariance, except in this instance it is with pairs of inputs for a graph.

It is easier to imagine it with this example. Let’s say we have a graph represented by an octopus arm, where each input on the graph is a joint in the arm. If we had a **low covariance** between two joints, then our arm would be **more rigid and less bendy**, with covariance closer to 0. If we had a **high covariance**, then our arm would be **less rigid and more bendy,** with covariance closer to 1.

This is because if our covariance is closer to 1, then each joint is more loose and cares less about the position of the joints to the left and right of it. Because of this, it bends more. If our covariance is alternatively closer to 0, then each joint is more rigid and cares more about the position of the joints to the left and right of it. So, it bends less.

An example of a real life object with a covariance of 0 would be a pencil, since they (usually) don’t bend at all. Alternatively, one with a covariance of 1 would be a usb cable, since they bend a lot.

If we have the covariance of a single point, we can actually get the variance of a distribution for that point. More on this in part three.

In the upcoming section i’m going to throw the word multi-variable / multivariate around, and in my previous post I talk about how you may have more than one input to a function. This is easy to think about when we only have two inputs to the function, as it just gives a graph like:

*(Credit to Wikimedia)*

(This is known as a three-dimensional graph, by the way) However we immediately run into a wall when we have more than two inputs to a function. How are we supposed to visualize a four-dimensional graph? A thirteen-dimensional graph?

**It’s easy – we don’t!** (I’m sorry if you spent time trying, by the way)

Since our human brains are more or less limited to visualizing in only three dimensions, there are a couple of fortunate tricks we can use to think about larger dimensional graphs with ease (notice I said think about, not visualize). One of these is the idea of vectors as coordinates. Let’s say we have a vector, like this:

We could also think of this as the coordinates (3, 4) on a 2D graph, the value 3 on the x-axis, and the value 4 on the y-axis. We can think of it like this as well:

Where we just substitute x for , and y for . What about this vector?

or,

These are coordinates for a 5D graph, and we can see how the first four represent the inputs and the fifth represents the output value, no? And what’s great is we can do this without having to think of a 5D graph, whatever that would look like.

In summary, the trick to visualizing greater than 3D graphs is to not visualize them, but think about them just in terms of number of dimensions and vectors for coordinates in them.

Finally, there will be times when thinking about the entire graph is helpful, in which case we can usually get away with thinking about them in terms of two or three dimensions, so that most of the time a Gaussian Distribution in five dimensions (or 13, or 69) would look and behave like one in two or three dimensions, except with four inputs instead of two or one.

Now that we’ve covered all that, we will start on Gaussian Process Regression and Bayesian Optimization in the final post of this series. Thanks for reading!

]]>

My goal here is to provide complete and overarching explanations and implementations of algorithms useful in Machine Learning and Artificial Intelligence Research and Development, but if you don’t care about understanding it, or already understand it, then you can view my (hopefully) well-commented code on the Github page. With that said, let’s begin!

*(Credit to Wikipedia)*

Black box functions are very much like this picture shows, no different than a normal function, except we don’t know what the function is. So for instance, if we had a function like this:

Then we can look at it and easily know that it’s simply going to raise any input to the power 2, for any inputs. However, if this were a black box function:

We have no idea what the operation(s) performed on the input(s) are, and therefore the function is a black box function.

Optimization of black box functions requires knowing the difference between normal functions and black box ones. Now that we know that, we can move on.

Generally in optimization, you want to find the global maximum, or global minimum (however often times a local maximum/minimum will do just fine). For instance, if we are a seller of Halloween Prop Skeletons, and we know that our sales relative to the number of skeletons produced is modeled by the function

Which looks like:

Really simple, right? For two-dimensional known functions, we can actually find the global maximum or minimum with complete certainty, and with this function, anyone can look at this graph and see that the company will * maximize* their profits by producing 420 skeletons.

*Sidenote: Optimizing known functions is not as easy when it comes to functions with multiple inputs, where the dimension is greater than two (e.g. three-dimensional graphs), and uses a method known as gradient descent. However, I won’t be covering that in this post. Black Box Optimization is a problem no matter how many dimensions there are however, and Bayesian Black Box Optimization works regardless of dimension.*

* Black Box Optimization *would be when (more realistically) the company doesn’t know an exact function for what their profit will be, so they plug in 100 values from 370-470, and end up with something like this:

This time, we don’t have a known function, according to our definition of black box functions. So we could try and get a function to represent this, but the end goal is to find the maximum or minimum of the function, so we instead just go for that. Unfortunately, we can’t just look at this and conclude the best number of skeletons to produce is 420, not with complete certainty. For an example of why we can’t have complete certainty, let’s take an example function:

As shown earlier.

So we start by plugging in some values, and we then get some outputs, since that seems to be the easiest way to do things:

Hey, this looks just like our function from earlier! After all, we can look at this and see that it matches the behavior perfectly, * with the points we’ve tested.* But just to be sure, let’s plug in a few more, to be sure!

Uh-oh, suddenly our idea that the function was has been blown out of the water. But if we had stopped testing points after our first four proved our hypothesis correct, we would have naively continued, with no idea we were completely incorrect.

One of the key problems with black box optimization is that we can never know with 100% certainty what our function’s best input is; to do so would mean testing an often-infinite number of input values. This means we have to draw the line somewhere, since we can’t test an infinite number. If we had decided that our arbitrary number of evaluations was four, and we picked these points, we would have been unaware of the true nature of this function (which by the way is actually this):

(Yes, I fooled you, but it was for your own good)

Another of the main reasons black box optimization is so tricky is the amount of time it can take to evaluate an input. If our skeleton manufacturer could only get the amount of profits once every Halloween, it would take us 100 years to get the graph we now have, at which point your body would likely be just a skeleton. These examples may seem strange, but they illustrate the pivotal problems:

**1. Cost may be high – it may take a long time for every evaluation**

**2. Number of inputs to test may be high – there may be an enormous amount of possible inputs for our function.**

*Sidenote: Our second pivotal problem is often times magnified by the number of dimensions. For instance, our company may have the number of skeletons produced, number of jack o lanterns produced, and number of bags of candy corn produced as inputs, with one axis for profit.*

If we only had three options and it took milliseconds to test each one, we likely wouldn’t even call it black box optimization, because the problem of black box optimization is only really prevalent when one of them is costly, e.g. long time to test and low input number could still be a problem, low time to test but large input number could also still be a problem; as well as if they are both prevalent.

This is simply going through and choosing our next input based on the last result or past results, done by the choice of the person tuning it.

What we did earlier with the skeleton manufacturer example is actually the method of black box optimization known as grid search. In our 2D example, we would represent the inputs we tested as:

Or, in our example:

With this example it’s not obvious why it would be called grid search, after all it’s just a row. But when we have multiple dimensions for inputs (as is often the case in black box optimization) , such as with:

And when we get all the possible configurations of inputs, it becomes like this:

As you can see, this looks much more like a grid, which is where the search type gets its name. So grid search is when we generate all the possible combinations of inputs, given a level of detail.

*Level of detail – How precise we want our measurement, in the case of all numbers 1 – 10, if we had our level of detail = 10, we’d get 10 elements total (1, 2, 3, 4.., 10). We could have a lower number (e.g. 2 = 1, 10), or a higher number (e.g. 100 = 0.1, 0.2, 0.3, …, 10.0). *

It’s quite possibly the simplest form of black box optimization, other than just hand-tuning. However, there are a few inherent problems with it:

**1. The search space quickly expands** – The size of our “grid” is equal to getting the product of all our independent number of inputs. e.g. If we have a level of detail of 100, and have five different inputs, we end up with: . That’s 10 billion different configurations, and it’s not an unrealistic scenario. I have this exact situation with one of my reinforcement learning bots, and I’d like to have more inputs or higher level of detail.

**2. It’s not intuitive **– Since our search method has no predictive ability, we aren’t gaining any knowledge of the problem. In order to do so, we are just picking the configuration that gave the best results.

This is very similar to grid search, except instead of searching through every combination in a grid, we *randomly* choose from each of our domains to make a random combination of inputs, test these, and repeat the number of times we specify.

This sounds crazy, but often times it works really well, because of the idea that some of our parameters have higher impact on the result than other parameters, which is almost always the case. Here’s a diagram showing this exact case and why Random Search often does better than grid search in such problems (Credit to James Bergstra and Yoshua Bengio’s paper on the topic)

This is actually quite nice, but since it randomly** **searches(as a non-deterministic algorithm) we can’t run the same program and get similar results every time.

While random search does a really good job most of the time, I personally don’t like it. The reason for my dislike is that I’d prefer a method that would give the equivalent results of a good random search run, and not have such a massive amount of randomness in it. We’d like one where we will get more or less the same results every run (a deterministic algorithm).

Thankfully, there are many such algorithms that achieve this, albeit at the cost of being much more complicated than hand-tuning, grid search, or random search. For now, I will be covering Bayesian Optimization in part two of this post series.