The Weary Travelers

A blog for computer scientists


Date: 2023-10-15
Author: Chris
Chris.png

Why do loss functions have to be additive, anyway?

By now there are literally thousands of machine learning papers, probably hundreds of thousands, that have an equation that looks like the following: \[ \theta = \mbox{argmin}_\theta \sum_{(x, y)\in {\mathcal D}} {\mathcal L}(f(x; \theta), y), \] where \(\theta\) is a parameter vector for the model, \({\mathcal D}\) is the collection of supervised labeled paris \((x, y)\), and \({\mathcal L}\) is whatever loss function you want to use. Why do they write it that way? So that they can make a statement that’s as general as possible, or at any rate, a statement that doesn’t contain any unnecessary specificity. Given that strong preference for generality, why aren’t there thousands of papers writing, \[ \theta = \mbox{argmin}_\theta \; {\mathcal L}(f; \theta| {\mathcal D}), \] where \(\mathcal{L}\) is an even more general loss function that takes a whole dataset?

What is an Additive loss?

Such a loss function is called Additive because it’s just a sum of loss function evaluations on individual training examples. Additive loss functions don’t just show up in proposed models for training either – they underpin every theoretical learning result I’ve ever heard of, or at least all those that I can remember. For instance, this paper1Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
Pedro Domingos, Arxiv 2020
Pedro Domingos does interesting research, but we don’t endorse many of his other opinions…
says that every model trained using Gradient Descent has a kernel representation. The argument works by using a result from the Neural Tangent Kernel2Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Jacot et al., NeurIPS 2018
This paper will be the subject of a future post.
to show that indeed every model trained in the “usual way” has a realization as a kernel. But wait! Hidden in the premises of Theorem 1 is the following:

Suppose the model \(y = f_w(x)\), with \(f\) a differentiable function of \(w\), is learned from a training set \({(x_i , y^∗_i)}^{m}_{i=1}\) by gradient descent with differentiable loss function \(L = \sum_i L(y^∗_i, yi)\) and learning rate \(\epsilon\).

Lo and behold, there it is! The additive loss. Is it there by reflex just because people always write that? It turns out that the proof of Theorem 1 really does use that fact, which means that the claim might not even be true without it! The fifth line of the proof says it very plainly:

Applying the additivity of the loss and the chain rule of differentiation:

Maybe, just maybe, if we optimized something else, we’d get something more expressive than just a kernel machine!

What does an Additive loss imply?

The analog of additive losses to human learning is the Quiz. That is, in a quiz, (or test, exam, homework, etc.,) you are asked a bunch of questions, someone grades them one by one, and then you get a score. But is this the only way to measure someone’s understanding, and is maximizing your grade the only way to learn?

Imagine if lawyers cross-examining a witness had to calculate a score, and the jury decided the case on whether the score was high enough. The only way to catch someone lying would be if they got less than \(100\%\) on their quiz. But, it often happens that people, even experts, get minor details wrong. You wouldn’t be able to try someone that way. Instead, what they do is to check whether or not the person on the witness stand said something that contradicted something else that they had said previously, regardless of whether it was correct in some sense or not. They are evaluating a loss function \({\mathcal L}(f; \theta| {\mathcal D})\) that only considers whether the witness’s answers are self consistent in some way, so long as they are otherwise plausibly accurate. How would we design a \({\mathcal L}(f; \theta| {\mathcal D})\) that measures not only the external consistency of the function and the dataset \({\mathcal D}\), but also its internal consistency among all of the pairs of training examples? For one thing, you would have to know what it means for a model to commit to something, so that you could then catch it violating that commitment later. Explainability3Explainability is the property that each prediction made by the model can be causally ascribed to a certain feature or set of features in the input. This implies that the model has rules, even if it isn’t “rules based” per se. offers one way, but that leaves open the question of how to do this for black-box models.

What’s so bad about Additive Losses?

Additive losses are everywhere. Why? They are easy to write, they are easy to code, they are guaranteed to have linear run time in the size of the dataset, and they are easy to analyze. One could even argue that machine learning has gotten pretty far with them. What more could you ask?

One thing you can ask is for them to replicate something about human experience, and as I argued above – they do! But, there are also things in human experience that are not additive, or at least, that are not quizzes. Now that LLMs have more or less fulfilled, or started to fulfill, the promise of machine learning4See the final thoughts section of our Batch Normalization review for more on this., we should be asking how an algorithm might capture reasoning. Reasoning is strictly disjunctive, meaning that you can reject a hypothesis if a single item is out of place, and additive losses will simply never be that. Additivity implies fungibility, which implies the ability to compensate for errors. An Additive loss function treats one whopper of an error as being the same as a hundred rounding errors, but this is not at all how we evaluate our own or other peoples’ behavior. Non-linear loss functions try to prevent this from happening, but they will always break down in various ways.

Conscious learning5As opposed to habituation, which probably really is captured by Stochastic Gradient Descent. is also often retrospective, meaning that we will pick a particular example of something that bothers us, and do a complex, backtracking search until we find a solution that explains that instance, and, that reconciles with our other beliefs. Reasoning even includes the process of determining how much error is fungible with zero. SGD with Additive loss doesn’t offer anything like that. Even so-called “in-context reasoning”6In-context reasoning is where you give an LLM such as GPT3.5 a set of premises, and a question that requires some amount of reasoning to answer. If it succeeds, then we can conclude that all of the information needed to derive the answer was contained in its context window. But did it “reason”? is just evidence of a model interpolating past human episodes, but there is no reason to think it will ever truly extrapolate. The problem is that it has seen answers to many such questions, and replicating a behavior does not imply any reasoning has taken place. That was the moral of the story of Clever Hans.

Alternatives

In some very specific problem domains, especially in computer vision, there are indeed examples of loss terms that intend to minimize the discrepancy between pairs of examples7A quick google scholar search turned up the following:
See More for Scene: Pairwise Consistency Learning for Scene Classification
Chen et al., NeurIPS 2021

Improving Representation Consistency with Pairwise Loss for Masked Face Recognition
Qian et al. ICCV workshops 2021

On the Value of Pairwise Constraints in Classification and Consistency
Zhang and Yan, ICML 2007
This one does something slightly different – rather than incorporating consistency as a loss term, it incorporates it as constraints. This has a very similar effect, and for the right Lagrange multiplier they can be made equivalent.
. These examples are a bit milquetoast because they still have an Additive loss doing all of the heavy lifting. They reflect the beginnings of a realization that there are important things that an Additive loss simply cannot express. But what else can a non-additive loss function do?

Anything that is additive is effectively computing an average8Sums and averages contain essentially the same information because they are only different by a known scalar factor.. This means that Additive loss functions are training models to be correct on average. Surely, if the average goes to \(0\) then any other measure must also be controlled, right? Well, not quite. For one thing, if it doesn’t get all the way there, then you don’t have that kind of control. And for another, this is the training error we’re talking about, not the test error, which almost never goes to \(0\). If you had to bank on the average case, the median case, or the worst case9I have to point out that the average / worst case we’re talking about here is the average / worst over your training set. In machine learning theory, there are worst case bounds that are taken over all training sets, assuming you’re planning to control the average case on each, and that is another matter altogether., would you be comfortable putting all of your eggs in the average case basket all of the time? In real life, the cost of wrong predictions can be non-uniform, and sometimes you may care about e.g. the average False Positive and worst False Negative case, or vice versa, or something else.

For that matter, how would you even minimize the median training residual, rather than the average? Let’s start by noticing that the median and worst case are quantiles. Well, one way to be non-additive is to make a histogram of the residuals, and then compute the gradient with respect to the quantile you care about. How would you do that? I think one way you could do it is that you could compute a false residual for each example at or above the quantile you care about, that would only move it down to the next lower bin, and compute gradients for the false residuals.

One might ask if having corrective perturbations for each example still captures Additivity, just with different labels than the ground truth labels. The difference is that the perturbation you need to make for \(f(x_i)\) is not a function of \(x_i\) alone. In other words, yes, this loss adds something up, and there’s one term for each example, but that can’t be described by the equation at the top.

Perhaps the combination of using the entire set of residuals to a) select a particular subset of examples to correctively perturb, and b) gradient descending the examples towards something other than the literal ground truth labels – something that’s a function of the whole training set – is materially distinct from an Additive loss. Will that yield a model of reasoning? Probably not. This example of a histogram-based loss is meant to show that there is at least one non-trivial example of a Non-Additive loss that has a plausible justification, and seems like it could be implemented in a straightforward way. And at least it would be a departure from the local neighborhood of Additive losses.

That’s it. That’s the idea.

A wild conjecture

Here’s another really wild idea – We already know that if you optimize the average case over your training set, then the worst case is tractable to analyze. That’s how machine learning theory has worked from the beginning. But suppose it were to turn out that but if you optimize the worst case over your training set, then the average case is tractable to analyze. After all, the \(\ell_1\) and \(\ell_\infty\) norm are duals to each other

We’re not solving problems here, just asking questions. If you have any other ideas, let us know in the comments.

Comments

Comments can be left on twitter, mastodon, as well as below, so have at it.

To view the Giscus comment thread, enable Giscus and GitHub’s JavaScript or navigate to the specific discussion on Github.

Footnotes:

1

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
Pedro Domingos, Arxiv 2020
Pedro Domingos does interesting research, but we don’t endorse many of his other opinions…

2

Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Jacot et al., NeurIPS 2018
This paper will be the subject of a future post.

3

Explainability is the property that each prediction made by the model can be causally ascribed to a certain feature or set of features in the input. This implies that the model has rules, even if it isn’t “rules based” per se.

4

See the final thoughts section of our Batch Normalization review for more on this.

5

As opposed to habituation, which probably really is captured by Stochastic Gradient Descent.

6

In-context reasoning is where you give an LLM such as GPT3.5 a set of premises, and a question that requires some amount of reasoning to answer. If it succeeds, then we can conclude that all of the information needed to derive the answer was contained in its context window. But did it “reason”?

7

A quick google scholar search turned up the following:
See More for Scene: Pairwise Consistency Learning for Scene Classification
Chen et al., NeurIPS 2021

Improving Representation Consistency with Pairwise Loss for Masked Face Recognition
Qian et al. ICCV workshops 2021

On the Value of Pairwise Constraints in Classification and Consistency
Zhang and Yan, ICML 2007
This one does something slightly different – rather than incorporating consistency as a loss term, it incorporates it as constraints. This has a very similar effect, and for the right Lagrange multiplier they can be made equivalent.

8

Sums and averages contain essentially the same information because they are only different by a known scalar factor.

9

I have to point out that the average / worst case we’re talking about here is the average / worst over your training set. In machine learning theory, there are worst case bounds that are taken over all training sets, assuming you’re planning to control the average case on each, and that is another matter altogether.