Examples of the Power of Bayesian Black Box Optimization
1. 10 Evaluations of a sample Black Box Function
Best input found: 2, Global Maximum: 2
2. Policy Gradient Reinforcement Learning
For this one, I used the hyperparameters for my policy gradient learner as inputs to the function. With the mini-batch size, learning rate, learning rate exponential decay rate, and discount factor, it had four different inputs to tune, with a level of detail of just 10 for each input. It ran each input three times and averaged the output to increase reliability, and each run took from one to five minutes.
I used the cartpole reinforcement learning environment provided by OpenAI, ran for 400 epochs, with 200 timesteps. I used the Matern covariance function (credit to Jasper Snoek, Hugo Larochelle and Ryan P. Adams from various universities for their paper in which they detail this covariance function), and the Upper confidence bound covariance function with confidence interval of 1.5.
It used a mini batch size range of 1 to 100, learning rate range of 1 to 10, learning rate decay rate range of 0 to 10 (where the value is plugged into -x/epochs, I have found this tends to give nice exponential decay curves), and discount factor range of 0 to 1.
For this comparison of methods, I did the following for each. I averaged over six runs. The subplots represent cost after each epoch, average timesteps survived over each mini batch after each epoch, and max timesteps survived over each mini batch after each epoch
I ran my grid search alternative algorithm on it for as long as it wanted, which was an excess of 200 evaluations. With that, it obtained the following results with the final parameters:
I then ran my bayesian optimization bot on it for 20 evaluations, and obtained the following:
So we can see that in a tenth of the evaluations, Bayesian Optimization won out by a longshot, and achieved parameters good enough to beat the cartpole environment on OpenAI.
More Covariance Functions
1. Dot Product
2. Brownian Motion
3. Squared Exponential
5. Example Periodic Function
6. Matérn Covariance Function*
* = Used this for cartpole test and has been best for me so far
More Acquisition Functions
1. Upper Confidence Bound*
2. Probability of Improvement**
3. Expected Improvement
* – Used this for cartpole test and has been best for me so far with , also keep in mind this is the form for a maximization problem (e.g. accuracy), however the standard deviation term would be negative for a minimization problem (e.g. cost).
** – is the cumulative distribution function up to the point given. is our
Resources & Sources
Videos and Video Lectures
- Several videos from mathematicalmonk on youtube, from whence I learned the first five of these covariance functions and about Unconstrained Gaussian Processes.
- A couple videos from the University of British Columbia’s Computer Science youtube lectures, from whence I learned about Constrained Gaussian Processes, including the necessary covariance matrices/vectors and the Multivariate Gaussian Distribution theorems. Go here to learn all about the math behind constraining Gaussian Processes for regression.
- Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets, By Aaron Klein, Stefan Falkner, and Frank Hutter of the University of Freiburg, and also by Simon Bartels and Philipp Hennig of the Max Planck Institute for Intelligent Systems. This is a good overarching paper detailing the method.
- Practical Bayesian Optimization of Machine Learning Algorithms, By Jasper Snoek, Hugo Larochelle and Ryan P. Adams of the University of Toronto, Universit´e de Sherbrooke and Harvard University, respectively. I learned about the Matérn Covariance Function from this paper, though it also gives a good overarching explanation.
Presentation Slides and Miscellaneous
- A Tutorial on Gaussian Processes, By Zoubin Ghahramani of the University of Cambridge(UK). I learned more about different free parameters in covariance functions such as lengthscales from these slides, though they also give a more intermediate tutorial on Gaussian Processes.
- Gaussian Processes, By Daniel McDuff of MIT. Good information on Covariance functions, covariance, and Gaussian Process regression again.
- A Tutorial on Bayesian Optimization for Machine Learning, By Ryan P. Adams of Harvard University. This gives a great visualization of different acquisition functions, how we use Gaussian Processes for Bayesian Optimization, as well as different types of covariance functions. It also speaks on a variant of the Matérn Covariance Function that I couldn’t get working, and integrating out Gaussian Process Hyper Parameters, which I didn’t implement.
- fernando on Github’s Bayesian Optimization Visualization iPython Notebook, whom had a really nice test black box function which I used at the top of this appendix.
- Guassian Random Vectors, By Robert G. Gallager of MIT. Go here for anything and everything about Gaussian Random Vectors, with a lot of math (you’ve been warned).
- Regression, By Robert G. Gallager of MIT. Go here for anything and everything about Gaussian Regression, with a lot of math (you’ve been warned).
- Covariance Functions, By Robert G. Gallager of MIT. Go here for anything and everything about covariance functions, with a lot of math (you’ve been warned).
- Gaussian Processes Covariance Functions and Classification, By Carl Edward Rasmussen of the Max Planck Institute for Biological Cybernetics. This is another overarching slide run-through.
- CSE515 Lecture Notes, By Roman Garnett of Washington University in St. Louis. This quickly and succinctly covers the most common acquisition functions, and I learned them first from here.
You can use any of my stuff (but it wouldn’t hurt to remember my name), and everything I’ve written in this series comes from these sources, or from my code that is openly available on my Github. Good luck, have fun.