Bayesian Learning via Stochastic Gradient Langevin Dynamics

Conclusion

It is possible to sample from the posterior distribution using gradient information calculated from mini-batches.

Limitation and scope

The step-size has to either converge to \(0\) or be sufficiently small.¹¹Such that the samples approach those of the posterior distribution.

Reasoning structure

The paper arrives at its conclusion in a constructive manner. I.e., they

Describe an algorithm that uses gradients on mini-batches.
Prove that the algorithm (eventually) allows you to sample from the posterior distribution.²²By proving sufficient conditions, under which samples from their algorithm approximate the posterior samples from the Langevin Monte Carlo algorithm. This relies on the fact that
- the injected noise³³ \(\eta_t\). dominates the stochastic gradient noise,⁴⁴ \(h_t(\theta_t)\). ^,⁵⁵So that the algorithm may well approximate Langevin Monte Carlo. and
- the step-size⁶⁶ \(\epsilon_t\). is sufficiently small or converges to \(0\).⁷⁷So that the Metropolis-Hastings rejection step in Langevin Monte Carlo may be ignored.
Establish criterion to ensure the algorithm has entered its posterior sampling phase.
- This is when the injected noise³³ \(\eta_t\). dominates the stochastic gradient noise⁴⁴ \(h_t(\theta_t)\). in all directions.
- The criterion relies on the batch-size⁸⁸ \(n\). being sufficiently large for the central limit theorem to be applicable.
Provide experimental validation of the above fact on a few examples where the data has multiple modes.⁹⁹I.e., where sampling from the posterior distribution has considerable utility over a point estimate.

Details

The algorithm

The paper begins by noting the equation for stochastic optimization \(\require{physics}\) \(\require{ams}\)

\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2}\Bigl( \grad{\log p(\theta_t)} + \frac{N}{n} \sum_{i=1}^{n} \grad{\log p(x_{ti}|\theta_t)} \Bigr) \end{equation}

and the equation for Langevin Monte Carlo (LMC)

\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2}\Bigl( \grad{\log p(\theta_t)} + \sum_{i=1}^{N} \grad{\log p(x_{ti}|\theta_t)} \Bigr) + \eta_t \end{equation}

where \(\eta_t \sim N(0, \epsilon_t)\), in a form such that they are directly comparable.

Having done so, it implicitly observes (and completes) the following commutative diagram:¹⁰¹⁰I.e., Similar to how stochastic gradient descent (SGD) replaces gradient updates in gradient descent (GD) with stochastic gradients calculated over mini-batches, the paper posits (and proves) the existence of a stochastic version of Langevin Monte Carlo (LMC) that replaces gradients calculated over the whole dataset with stochastic gradients calculated over mini-batches.

Their Stochastic Gradient Langevin Dynamics algorithm is simply:

\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2}\Bigl( \grad{\log p(\theta_t)} + \frac{N}{n} \sum_{i=1}^{n} \grad{\log p(x_{ti}|\theta_t)} \Bigr) + \eta_t \end{equation}

where \(\eta_t \sim N(0, \epsilon_t)\) and \(\epsilon_t\) satisfies the usual criteria for stochastic optimization:

\begin{equation} \sum_{t=1}^{\infty} \epsilon_t = \infty \mathrm{,} \quad \sum_{t=1}^{\infty} \epsilon_{t}^2 < \infty \end{equation}

Or, when using preconditioning via a symmetric preconditioning matrix \(M\):

\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2} M \Bigl( \grad{\log p(\theta_t)} + \frac{N}{n} \sum_{i=1}^{n} \grad{\log p(x_{ti}|\theta_t)} \Bigr) + \eta_t \end{equation}

where \(\eta_t \sim N(0, \epsilon_t M)\).

Using the algorithm to sample from the posterior

The proof that the SGLD algorithm (eventually) produces samples from the posterior distribution relies on it (eventually) producing samples that approach a sequence generated by Langevin Monte Carlo (LMC).¹¹¹¹What the paper terms Langevin dynamics. It does this by first establishing when the samples approach the proposals by LMC proposals, and then establishing when the Metropolis-Hastings rejection rate approaches \(0\).

Reformulating the SGLD equation as below

\begin{equation} \Delta \theta_t = \frac{\epsilon_t}{2}\bigl( g(\theta_t) + h_t(\theta_t) \bigr) + \eta_t \end{equation}

where, \(g(\theta_t)\) is the gradient needed by Langevin monte carlo and \(h_t(\theta_t)\) is the error introduced by using stochastic gradients instead. When the injected noise³³ \(\eta_t\). dominates the stochastic gradient noise⁴⁴ \(h_t(\theta_t)\). , the paper shows that the samples produced by SGLD approach the proposals of LMC. Additionally, when the step-size⁶⁶ \(\epsilon_t\). is small enough the Metropolis-Hastings acceptance probability approaches 1. Thus, when both these conditions are met the samples of SGLD approach those of LMC.¹²¹²Where it is known that LMC produces samples from the posterior distribution.

The paper establishes that when the batch-size is large enough for the central limit theorem to be applicable and the step-size is small enough to ensure that the sampling threshold¹³¹³ \(\alpha\). is \( \ll 1\) then the samples approach those of LMC.

\begin{equation} \frac{\epsilon_t N^2}{4n} \lambda_\mathrm{max} \bigl( M^\frac{1}{2} V_s M^\frac{1}{2} \bigr) = \alpha \ll 1 \end{equation}

where \(M\) is a symmetric preconditioning matrix, \(V_s\)¹⁴¹⁴As noted by Chris in personal communication, \(rank(V_s) \le n\) where \(n\) is the batch-size. This is because of the sub-additivity of rank and because \(V_s\) is made up of a sum of \(n\) rank-1 matrices. is the empirical covariance of the scores \(s\) at iteration \(t\), and the score \(s_{ti}\) of data item \(i\) at iteration \(t\) is given by

\begin{equation} s_{ti} = \grad \log p(x_{ti}|\theta_t) + \frac{1}{N} \grad \log p(\theta_t) \end{equation}

An important observation that the paper makes regarding using samples from SGLD to estimate the posterior statistics, is that while the sample average is consistent it is a higher-variance estimator. The paper notes in Section 4.2:

Observe however that because the step size decreases, the mixing rate of the Markov chain decreases as well, and the simple sample average will over-emphasize the tail end of the sequence where there is higher correlation among the samples, resulting in higher variance in the estimator. Instead we propose to use the step sizes to weight the samples.

Historical Note

This paper was mentioned in the recent post on batch size vs learning rate where it was arguably misattributed as being the paper to introduce the idea of the Langevin equation to the computer science world. While I don’t feel well-equipped to trace the origins of this idea through history, I’ll make an attempt regardless. Future versions of this post may update this section as I become aware of new information.

The original Langevin equation¹⁵¹⁵ Langevin, P. 1908. Sur la théorie du mouvement brownien. Compt. rendus. 146, (1908), 530–533. described Brownian motion:

\begin{equation} m \dv{v}{t} = -\lambda v + \eta(t) \end{equation}

A variation of this continuous form of Langevin equation was applied to the study of non-equilibrium statistical mechanics:¹⁶¹⁶ Graham, R. 1973. Statistical theory of instabilities in stationary nonequilibrium systems with applications to lasers and nonlinear optics. Springer tracts in modern physics: Ergebnisse der exakten naturwissenschaftenc; volume 66. Springer. 1–97.

\begin{equation} \pdv{\phi(x, t)}{t} = - \pdv{V}{\phi(x, t)} + \eta(x, t) \end{equation}

The connection between the continuous form of the Langevin equation and Monte Carlo¹⁷¹⁷Monte Carlo as in MCMC. techniques (in the context of computer simulation in gauge theories) was made in 1981 by Georgio Parisi and others.¹⁸¹⁸ Parisi, G., Wu, Y.S. and others 1981. Perturbation theory without gauge fixing. Sci. sin. 24, 4 (1981), 483–496.

However, what is perhaps more relevant to the present topic is the discrete form of the Langevin equation. Langevin Monte Carlo¹⁹¹⁹ Rossky, P.J., Doll, J.D. and Friedman, H.L. 1978. Brownian dynamics as smart monte carlo simulation. The journal of chemical physics. 69, 10 (1978), 4628–4633. ^,²⁰²⁰ Gottlieb, S., Liu, W., Toussaint, D. and Sugar, R. 1987. Testing an exact algorithm for simulation of fermionic qcd. Physical review d. 35, 8 (1987), 2611. and HMC²¹²¹Equivalently, Hybrid or Hamiltonian Monte Carlo. were developed in the quantum physics community in the 70s and 80s.²²²² Duane, S. 1985. Stochastic quantization versus the microcanonical ensemble: Getting the best of both worlds. Nuclear physics b. 257, (1985), 652–662. ^,²³²³ Duane, S. and Kogut, J.B. 1985. Hybrid stochastic differential equations applied to quantum chromodynamics. Physical review letters. 55, 25 (1985), 2774.

At least by 1990²⁴²⁴ Kennedy, A. 1990. The theory of hybrid stochastic algorithms. Probabilistic methods in quantum field theory and quantum gravity. Springer. 209–223. the fact that Langevin Monte Carlo arises as a one-step approximation of HMC had become apparent in the Theoretical Physics community. Perhaps the credit for introducing this fact to the Computer Science and Machine Learning world goes to Radford Neal, who is both cited by the present paper and has a greater number of total citations per Google Scholar.²⁵²⁵ Neal, R.M. and others 2011. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo. 2, 11 (2011), 2.

Comments

Comments can be left on twitter, mastodon, as well as below, so have at it.

New post!

Paper outline: Bayesian Learning via Stochastic Gradient Langevin Dynamicshttps://t.co/9mEIeRLQKD

Reply here if you have comments.
— The Weary Travelers blog (@wearyTravlrsBlg) July 30, 2023

To view the Giscus comment thread, enable Giscus and GitHub’s JavaScript or navigate to the specific discussion on Github.