Bayesian Learning via Stochastic Gradient Langevin Dynamics
Welling, M. and Teh, Y.W. 2011. Bayesian learning via stochastic gradient langevin dynamics. Proceedings of the 28th international conference on machine learning (icml-11) (2011), 681–688.
Conclusion
It is possible to sample from the posterior distribution using gradient information calculated from mini-batches.
Limitation and scope
The step-size has to either converge to \(0\) or be sufficiently small.1Such that the samples approach those of the posterior distribution.
Reasoning structure
The paper arrives at its conclusion in a constructive manner. I.e., they
- Describe an algorithm that uses gradients on mini-batches.
- Prove that the algorithm (eventually) allows you to sample from
the posterior distribution.2By proving sufficient conditions, under
which samples from their algorithm approximate the posterior samples from the
Langevin Monte Carlo algorithm. This relies on the fact that
- the injected noise3 \(\eta_t\). dominates the stochastic gradient noise,4 \(h_t(\theta_t)\). , 5So that the algorithm may well approximate Langevin Monte Carlo. and
- the step-size6 \(\epsilon_t\). is sufficiently small or converges to \(0\).7So that the Metropolis-Hastings rejection step in Langevin Monte Carlo may be ignored.
- Establish criterion to ensure the algorithm has entered its
posterior sampling phase.
- This is when the injected noise3 \(\eta_t\). dominates the stochastic gradient noise4 \(h_t(\theta_t)\). in all directions.
- The criterion relies on the batch-size8 \(n\). being sufficiently large for the central limit theorem to be applicable.
- Provide experimental validation of the above fact on a few examples where the data has multiple modes.9I.e., where sampling from the posterior distribution has considerable utility over a point estimate.
Details
The algorithm
The paper begins by noting the equation for stochastic optimization \(\require{physics}\) \(\require{ams}\)
\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2}\Bigl( \grad{\log p(\theta_t)} + \frac{N}{n} \sum_{i=1}^{n} \grad{\log p(x_{ti}|\theta_t)} \Bigr) \end{equation}and the equation for Langevin Monte Carlo (LMC)
\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2}\Bigl( \grad{\log p(\theta_t)} + \sum_{i=1}^{N} \grad{\log p(x_{ti}|\theta_t)} \Bigr) + \eta_t \end{equation}where \(\eta_t \sim N(0, \epsilon_t)\), in a form such that they are directly comparable.
Having done so, it implicitly observes (and completes) the following commutative diagram:10I.e., Similar to how stochastic gradient descent (SGD) replaces gradient updates in gradient descent (GD) with stochastic gradients calculated over mini-batches, the paper posits (and proves) the existence of a stochastic version of Langevin Monte Carlo (LMC) that replaces gradients calculated over the whole dataset with stochastic gradients calculated over mini-batches.
Their Stochastic Gradient Langevin Dynamics algorithm is simply:
\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2}\Bigl( \grad{\log p(\theta_t)} + \frac{N}{n} \sum_{i=1}^{n} \grad{\log p(x_{ti}|\theta_t)} \Bigr) + \eta_t \end{equation}where \(\eta_t \sim N(0, \epsilon_t)\) and \(\epsilon_t\) satisfies the usual criteria for stochastic optimization:
\begin{equation} \sum_{t=1}^{\infty} \epsilon_t = \infty \mathrm{,} \quad \sum_{t=1}^{\infty} \epsilon_{t}^2 < \infty \end{equation}Or, when using preconditioning via a symmetric preconditioning matrix \(M\):
\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2} M \Bigl( \grad{\log p(\theta_t)} + \frac{N}{n} \sum_{i=1}^{n} \grad{\log p(x_{ti}|\theta_t)} \Bigr) + \eta_t \end{equation}where \(\eta_t \sim N(0, \epsilon_t M)\).
Using the algorithm to sample from the posterior
The proof that the SGLD algorithm (eventually) produces samples from the posterior distribution relies on it (eventually) producing samples that approach a sequence generated by Langevin Monte Carlo (LMC).11What the paper terms Langevin dynamics. It does this by first establishing when the samples approach the proposals by LMC proposals, and then establishing when the Metropolis-Hastings rejection rate approaches \(0\).
Reformulating the SGLD equation as below
\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2}\bigl( g(\theta_t) + h_t(\theta_t) \bigr) + \eta_t \end{equation}where, \(g(\theta_t)\) is the gradient needed by Langevin monte carlo and \(h_t(\theta_t)\) is the error introduced by using stochastic gradients instead. When the injected noise3 \(\eta_t\). dominates the stochastic gradient noise4 \(h_t(\theta_t)\). , the paper shows that the samples produced by SGLD approach the proposals of LMC. Additionally, when the step-size6 \(\epsilon_t\). is small enough the Metropolis-Hastings acceptance probability approaches 1. Thus, when both these conditions are met the samples of SGLD approach those of LMC.12Where it is known that LMC produces samples from the posterior distribution.
The paper establishes that when the batch-size is large enough for the central limit theorem to be applicable and the step-size is small enough to ensure that the sampling threshold13 \(\alpha\). is \( \ll 1\) then the samples approach those of LMC.
\begin{equation} \frac{\epsilon_t N^2}{4n} \lambda_\mathrm{max} \bigl( M^\frac{1}{2} V_s M^\frac{1}{2} \bigr) = \alpha \ll 1 \end{equation}where \(M\) is a symmetric preconditioning matrix, \(V_s\)14As noted by Chris in personal communication, \(rank(V_s) \le n\) where \(n\) is the batch-size. This is because of the sub-additivity of rank and because \(V_s\) is made up of a sum of \(n\) rank-1 matrices. is the empirical covariance of the scores \(s\) at iteration \(t\), and the score \(s_{ti}\) of data item \(i\) at iteration \(t\) is given by
\begin{equation} s_{ti} = \grad \log p(x_{ti}|\theta_t) + \frac{1}{N} \grad \log p(\theta_t) \end{equation}An important observation that the paper makes regarding using samples from SGLD to estimate the posterior statistics, is that while the sample average is consistent it is a higher-variance estimator. The paper notes in Section 4.2:
Observe however that because the step size decreases, the mixing rate of the Markov chain decreases as well, and the simple sample average will over-emphasize the tail end of the sequence where there is higher correlation among the samples, resulting in higher variance in the estimator. Instead we propose to use the step sizes to weight the samples.
Historical Note
This paper was mentioned in the recent post on batch size vs learning rate where it was arguably misattributed as being the paper to introduce the idea of the Langevin equation to the computer science world. While I don’t feel well-equipped to trace the origins of this idea through history, I’ll make an attempt regardless. Future versions of this post may update this section as I become aware of new information.
The original Langevin equation15 Langevin, P. 1908. Sur la théorie du mouvement brownien. Compt. rendus. 146, (1908), 530–533. described Brownian motion:
\begin{equation} m \dv{v}{t} = -\lambda v + \eta(t) \end{equation}A variation of this continuous form of Langevin equation was applied to the study of non-equilibrium statistical mechanics:16 Graham, R. 1973. Statistical theory of instabilities in stationary nonequilibrium systems with applications to lasers and nonlinear optics. Springer tracts in modern physics: Ergebnisse der exakten naturwissenschaftenc; volume 66. Springer. 1–97.
\begin{equation} \pdv{\phi(x, t)}{t} = - \pdv{V}{\phi(x, t)} + \eta(x, t) \end{equation}The connection between the continuous form of the Langevin equation and Monte Carlo17Monte Carlo as in MCMC. techniques (in the context of computer simulation in gauge theories) was made in 1981 by Georgio Parisi and others.18 Parisi, G., Wu, Y.S. and others 1981. Perturbation theory without gauge fixing. Sci. sin. 24, 4 (1981), 483–496.
However, what is perhaps more relevant to the present topic is the discrete form of the Langevin equation. Langevin Monte Carlo19 Rossky, P.J., Doll, J.D. and Friedman, H.L. 1978. Brownian dynamics as smart monte carlo simulation. The journal of chemical physics. 69, 10 (1978), 4628–4633. , 20 Gottlieb, S., Liu, W., Toussaint, D. and Sugar, R. 1987. Testing an exact algorithm for simulation of fermionic qcd. Physical review d. 35, 8 (1987), 2611. and HMC21Equivalently, Hybrid or Hamiltonian Monte Carlo. were developed in the quantum physics community in the 70s and 80s.22 Duane, S. 1985. Stochastic quantization versus the microcanonical ensemble: Getting the best of both worlds. Nuclear physics b. 257, (1985), 652–662. , 23 Duane, S. and Kogut, J.B. 1985. Hybrid stochastic differential equations applied to quantum chromodynamics. Physical review letters. 55, 25 (1985), 2774.
At least by 199024 Kennedy, A. 1990. The theory of hybrid stochastic algorithms. Probabilistic methods in quantum field theory and quantum gravity. Springer. 209–223. the fact that Langevin Monte Carlo arises as a one-step approximation of HMC had become apparent in the Theoretical Physics community. Perhaps the credit for introducing this fact to the Computer Science and Machine Learning world goes to Radford Neal, who is both cited by the present paper and has a greater number of total citations per Google Scholar.25 Neal, R.M. and others 2011. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo. 2, 11 (2011), 2.
Comments
Comments can be left on twitter, mastodon, as well as below, so have at it.
New post!
— The Weary Travelers blog (@wearyTravlrsBlg) July 30, 2023
Paper outline: Bayesian Learning via Stochastic Gradient Langevin Dynamicshttps://t.co/9mEIeRLQKD
Reply here if you have comments.
Footnotes:
Such that the samples approach those of the posterior distribution.
By proving sufficient conditions, under which samples from their algorithm approximate the posterior samples from the Langevin Monte Carlo algorithm.
\(\eta_t\).
\(h_t(\theta_t)\).
So that the algorithm may well approximate Langevin Monte Carlo.
\(\epsilon_t\).
So that the Metropolis-Hastings rejection step in Langevin Monte Carlo may be ignored.
\(n\).
I.e., where sampling from the posterior distribution has considerable utility over a point estimate.
I.e., Similar to how stochastic gradient descent (SGD) replaces gradient updates in gradient descent (GD) with stochastic gradients calculated over mini-batches, the paper posits (and proves) the existence of a stochastic version of Langevin Monte Carlo (LMC) that replaces gradients calculated over the whole dataset with stochastic gradients calculated over mini-batches.
What the paper terms Langevin dynamics.
Where it is known that LMC produces samples from the posterior distribution.
\(\alpha\).
As noted by Chris in personal communication, \(rank(V_s) \le n\) where \(n\) is the batch-size. This is because of the sub-additivity of rank and because \(V_s\) is made up of a sum of \(n\) rank-1 matrices.
Langevin, P. 1908. Sur la théorie du mouvement brownien. Compt. rendus. 146, (1908), 530–533.
Graham, R. 1973. Statistical theory of instabilities in stationary nonequilibrium systems with applications to lasers and nonlinear optics. Springer tracts in modern physics: Ergebnisse der exakten naturwissenschaftenc; volume 66. Springer. 1–97.
Parisi, G., Wu, Y.S. and others 1981. Perturbation theory without gauge fixing. Sci. sin. 24, 4 (1981), 483–496.
Rossky, P.J., Doll, J.D. and Friedman, H.L. 1978. Brownian dynamics as smart monte carlo simulation. The journal of chemical physics. 69, 10 (1978), 4628–4633.
Gottlieb, S., Liu, W., Toussaint, D. and Sugar, R. 1987. Testing an exact algorithm for simulation of fermionic qcd. Physical review d. 35, 8 (1987), 2611.
Equivalently, Hybrid or Hamiltonian Monte Carlo.
Duane, S. 1985. Stochastic quantization versus the microcanonical ensemble: Getting the best of both worlds. Nuclear physics b. 257, (1985), 652–662.
Duane, S. and Kogut, J.B. 1985. Hybrid stochastic differential equations applied to quantum chromodynamics. Physical review letters. 55, 25 (1985), 2774.
Kennedy, A. 1990. The theory of hybrid stochastic algorithms. Probabilistic methods in quantum field theory and quantum gravity. Springer. 209–223.
Neal, R.M. and others 2011. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo. 2, 11 (2011), 2.