Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioffe and Szegedy, ICML 2015¹¹As of writing, this paper has just over \(50,000\) citations.

If you’re training neural nets, you’re probably using either Batch Normalization, or Layer Normalization²²See our outline of Layer Normalization. Back when this was written, Vanishing and Exploding gradients were still a major problem, and solving it arguably enabled many of the advances we’ve seen since then. Given that Batch size and Learning rate are equivalent, having better control over numerical stability may also let us crank up the batch size even further, allowing even faster training³³At least, if you’re training on a whole data-center.. But apart from solving an outstanding issue, Batch Normalization is, like Residual units, an easy drop-in addition to a model that gets an immediate improvement on ImageNet⁴⁴Architecturally, it’s an easy change, but you have to do other things to take advantage. See the Experiments section., (with that improvement distributed between faster training and better accuracy).

Let’s dig in!

So that’s the paper. What did I take away from it?

Batch Normalization is one of those ideas that once you’ve heard it it’s kind of obvious. It’s not that complicated to motivate, easy to implement, and provides clear benefits. Given that, this is one of the few papers that didn’t really need to be even eight pages to get its idea across… but see the next two points.
On first reading Section \(2\), it seemed like the paper was making a straw man argument to motivate their approach by first suggesting alternatives that would not work – to either update the weight matrices directly to produce the desired effect, or to update the gradients to produce the desired effect – yet, the paper cites five other papers that do just that. Good ideas really are obvious once you’ve heard them.
It’s also not obvious until you hear it that plain whitening is not enough. You also need to learn a non-zero mean and a non-unit variance because of how they interact with the non-linearity.
In a completely different way, this paper uses the original motivation from the ResNet paper – that any unit should be able to learn the Identity function. This is yet another unglamorous good idea that is easy to violate without noticing.
This paper also hits on another theme that gets my interest – symmetries. A major gap in the current understanding of Deep Neural Nets is a complete characterization of all the equivalent ways to represent the same function¹⁷¹⁷See our outlines of Visualizing the loss landscape and Sharp Minima Can Generalize For Deep Nets for discussion on this.. Batch Normalization is interesting because it induces symmetries in a way that makes it easier for gradient descent to consider novel functions.

Final thoughts

Lately, Large Language Models like GPT-4 and Claude have been demolishing notions of what machine learning can do. It’s hard not to be impressed at their abilities to generalize patterns that few people expected were even present at all in Trillion token-scale internet natural language corpora. Yet, abilities like these were always predicted by machine learning theory, going back to the late 1990s. The limitations were on data, which was ultimately solved by TCP/IP and HTTP/HTML; on compute, which was ultimately solved by NVidia¹⁸¹⁸Or less proximally by TSMC, or even less proximally by ASML., but even then there was still one remaining problem: numerical difficulties. Without a solution to the last one, there is no hope of realizing AI through machine learning, no matter the compute power that can be brought to bear. To me, that makes it the most interesting of the three problems because you simply can’t throw money at the problem.

As we’ve seen in past paper reviews, there is no single solution to numerical difficulties. Rather, results begin to appear when the last bad idea is removed. You can have all the brilliance in the world, and all the resources, but the right wrong idea will still lead you into mediocrity, because sometimes bad ideas look really good. In hindsight, the major advances aren’t necessarily brought about by new ideas¹⁹¹⁹As we saw here, Batch Normalization solves a lot of the same problems as Adam, ReLU, Glorot Initialization and Dropout. As bad ideas are removed, good ideas tend to be rediscovered more frequently because they are all that’s left., but the seemingly insignificant, or moderately significant change that quietly dropped something harmful, leading to a new plateau where everyone will live until the next ground-breaking bad idea is removed.

Thanks for reading to the end!

Comments

Comments can be left on twitter, mastodon, as well as below, so have at it.

New post!

Paper review: The Batch Normalization paperhttps://t.co/kPwFlBppUF

Reply here if you have comments.
— The Weary Travelers blog (@wearyTravlrsBlg) October 2, 2023

To view the Giscus comment thread, enable Giscus and GitHub’s JavaScript or navigate to the specific discussion on Github.

Footnotes:

As of writing, this paper has just over \(50,000\) citations.

See our outline of Layer Normalization

At least, if you’re training on a whole data-center.

⁴

Architecturally, it’s an easy change, but you have to do other things to take advantage. See the Experiments section.

⁵

Improving predictive inference under covariate shift by weighting the log-likelihood function
Shimodaira H., J. Statistical Planning and Inference, 2008

⁶

As we’ll see, the problem that Batch Normalization tackles isn’t Covariate Shift, it’s actually Univariate Shift.
See also our outline of a paper that rebuts this notion.

⁷

Adam: A Method for Stochastic Optimization
Kingma & Ba Arxiv 2014
Technically, Adam will also provide lower objective function values in addition to faster training.

⁸

Understanding the difficulty of training deep feedforward neural networks
Glorot & Bengio AISTATS 2010

⁹

Rectified Linear Units Improve Restricted Boltzmann Machines
Nair, Hinton ICML 2010

¹⁰

Improving neural networks by preventing co-adaptation of feature detectors
Hinton et al. arxiv 2012

¹¹

Though, the paper cites several others that tried to collect a lot more than just two statistics per output unit!

¹²

The Identity function is \(f(x) = x\). The need for being able to express the Identity function was the central motivation of the the ResNet paper.

¹³

If you’ve ever used Batch Normalization you’ve probably seen a warning that any layer feeding into a Batch Normalization unit should have its bias disabled… because Batch Normalization is learning its own.

¹⁴

For the sake of technical completeness, there is one detail – for numerical stability, a small \(\epsilon\) is added to the variance estimate.

¹⁵

Going Deeper with Convolutions
Szegedy et al., CVPR 2015

¹⁶

This was before Adam became the de facto optimizer. The paper doesn’t give training details, but the Inception paper says they used “asynchronous” SGD. Surely Batch Normalization helps with Adam as well, but it makes one wonder how much.

¹⁷

See our outlines of Visualizing the loss landscape and Sharp Minima Can Generalize For Deep Nets for discussion on this.

¹⁸

Or less proximally by TSMC, or even less proximally by ASML.

¹⁹

As we saw here, Batch Normalization solves a lot of the same problems as Adam, ReLU, Glorot Initialization and Dropout. As bad ideas are removed, good ideas tend to be rediscovered more frequently because they are all that’s left.

The Weary Travelers

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Summary