The Weary Travelers

A blog for computer scientists


Date: 2023-11-19
Author: Chris
Chris.png

Sense and nonsense can be arbitrarily close together: The Last Inch Problem

Continuing the theme from our earlier post on additive loss functions, this week we’re again revisiting the foundational assumptions of modern Statistical Learning.

The main idea of Statistical Learning is that we can use sampling alone to discover homogeneous regions in input space. That is, given an example on a classification task, the inductive bias is that there must be some ball or other region around it where all other points in that region must have the same label. Given an embedding vector for a word token, all other points nearby must represent the same token. Given a latent embedding vector… you get the idea.

The sampling part is the idea that we can use statistics to discover those regions, which is a bit like using statistics to prove that a coin is fair. That is, you can’t. Proving a coin is unfair is the forte of statistics, but it simply does not afford the ability to prove that a coin truly has \(50-50\) odds, because its axioms don’t allow for that1Statistics can prove that a given coin is probably fair enough, and this is related to the Probably approximately correct theory of learning.. Yet, flipping a coin often is used in high-stakes public events including Sporting Events and even some Political outcomes.

But notice that no one ever bothers to check the actual coin used in such a tossing process2At lest, not by flipping it \(10,000\) times., because the reasoning for that decision never had anything to do with statistics in the first place. That reasoning relies on an altogether different set of metaphysical axioms3What those might be will be a topic for the future. Today, we’re only interested in making a negative point. That said, something like the Scientific Method might be appropriate. That is, there has to be a way to entertain a model that makes homogeneous assertions, as long as it remains useful, and then reject it later. Decision Trees do something like that, but they rarely have a back-tracking mechanism for doing that. Secondly, they operate directly on the observed domain rather than a latent/derived domain. Thirdly, they probably aren’t sufficient for general intelligence because for that one does need statistical reasoning for some things..

Clearly, there have to be limits to the homogeneity principle, and furthermore there’s absolutely no reason to think that the homogeneity radius around every observation is the same, or isotropic, or even non-zero. But that’s why statistics is so useful, isn’t it? We already know that we won’t have local homogeneity around each point, so why not search for the regions that are as close to purity as possible?

The problem is that in interesting problems, the cost of being wrong can often be arbitrarily high in a neighborhood that’s arbitrarily small. In some cases, the cost of being wrong will be higher than the cost of achieving the desired statistical purity that would give one sufficient confidence to ignore it. In other words, Sense and Nonsense can be arbitrarily close4This is a similar, but more specific claim than the one made by Black Swan theory – we are going a step further in saying that in some cases a “Black Swan” can be found arbitrarily close to a white one. OTOH, perhaps the rare, high cost, difficult to predict events we are thinking of don’t quite meet the technical definition of “Black Swans”..

This fact, and the awareness of it, is reflected in some human habits. For instance, when communicating with another person, even, or especially when they are hostile, the most useful example is often precisely the one that lies on a boundary, meaning its radius is \(0\)5Hostile situations are interesting because those are the times when communicating costs is especially important.. More generally, when someone realizes they’re on a precipice, suddenly they start choosing their steps very carefully.

In practice there are many examples of tasks wherein the difference between sense and nonsense can be arbitrarily close to 0:

Ultimately, it depends on whether you can tolerate even a single speck of sand in your machine. Some are built for it, but others fail catastrophically. Machine learning is a bit weird in that for most real world tasks, if you want to be Probably Approximately Correct, then statistics is all you need – you can tolerate sand, and with what appears to be an exponentially lower bounded amount of work, you can shake most of the grit out of your model. But, you’ll never get a Swiss watch that way.

This idea can be formulated somewhat more precisely as the Last Inch Problem:

Using statistical / empirical risk minimization alone, directly on example-wise error, will result in a situation wherein each successive order of magnitude by which the error is reduced will tend to require at least the same order of effort, or more.

In other words, getting the error from \(0.01\) to \(0.001\) is often much easier than getting the error from \(0.001\) to \(0.0001\), and it only gets worse from there.

Examples

What are some examples of this phenomenon?

Autonomous vehicles

The first one that comes to mind, and perhaps the most prominent, is Self-driving cars. All the way back in 2007, the DARPA Grand Challenge showed that while machine learning wasn’t ready yet, it would be soon. After the wave of GPU-driven Deep Learning results after 2012 or so, suddenly it was time to ask, are we there yet? Most of the major self-driving organizations began at about that time:

So are we there yet? It’s 2023, and while there are some vehicles on the road, they’re not exactly trusted yet. Cruise, one of the oldest, recently had a scandal wherein it was discovered that many of their vehicles were covertly being driven by remote safety drivers, and now it seems they’re taking their cars off the road. Uber sold its self-driving research unit to Aurora in 2020. In 2022, they started operating a self-driving service in Las Vegas, using a different service altogether, with safety drivers present. Just this week, a newcomer Vay started operating driverless cars in Las Vegas, but it’s often a matter of time before the accidents start to happen. Elon Musk has been famously insisting since about 2016 that FSD is just a few months away.

The point is that this is exactly the rate of progress one would expect, given the Last Inch Problem. And if you think the Last Inch is hard, just wait until you see the Last Millimeter…

LLMs

Unless you’re living under a rock, you’ve probably been using ChatGPT for tedious, low-stakes tasks, such as drafting emails or getting a quick summary of a topic that you’d like to know about. If you’ve used it long enough, you’ve probably reached a question it doesn’t know the answer to, and instead of answering, it simply repeats the premises of the question, without admitting that it doesn’t know. Or, perhaps you’ve seen it produce a Hallucination6Or perhaps technically it was a Confabulation., and if you’ve pondered the implications of that then you’ve probably started checking everything it says, thoroughly7If you’re not, shame on you.. GPT-4, a vastly more expensive model has been definitively shown to do that… less.

And there it is, another Last Inch problem. OpenAI is known to be working on a successor GPT-5, but there’s little known about the timeline or expected capabilities. Clearly they want it to be more than just an incremental improvement, given what it will likely cost to train, but for that to happen, they will have to solve the Last Inch problem, and that will almost certainly require rethinking how to train Deep Nets.

As with autonomous vehicles, there has to be a level of accuracy that can be considered “good enough”, so it’s still an open question whether a solution may be reached through sheer stubbornness and piles of cash. Claude is designed for legal and business questions, and indeed it does appear to be competitive with humans on many tasks8Note that quantitative reasoning has some of the sharpest teeth among Last Inch problems.:

our latest model scored \(76.5\%\) on the multiple choice section of the Bar exam, up from \(73.0\%\) with Claude 1.3. When compared to college students applying to graduate school, Claude 2 scores above the \(90^\mbox{th}\) percentile on the GRE reading and writing exams, and similarly to the median applicant on quantitative reasoning.

To be clear I am not suggesting LLMs are no good for anything – quite the opposite. But, here’s the thing – you have to check their work, and to do that you have to know something about the work yourself. In other words, you have to be the safety driver.

What to do about it

So the Last Inch problem is pretty tough. For now, the Big Money is betting that they can just slog their way through it until what they have is good enough. For that matter, if there were an Efficient or Effective solution to the problem9For the record I strongly suspect there will be no easy solution as long as additive loss functions are the norm. Any kind of Consistency loss, i.e. a loss term that looks at pairs or tuples of examples, might be a step in the right direction. Simple Regularization probably won’t be sufficient, but the right kind of Structural Regularization might implicitly have a non-additive effect., then many of the fears of societal disruption would suddenly come to a head. Drivers and admin assistants are just the tip of the ice-berg – individually, those are two of the largest job categories at risk, but many other categories of work would also be within range. So in the short term, this problem affords an Avoidance strategy, and perhaps it’s a blessing if it plays out slowly enough for Society to adapt10There is an Accelerationist counterpoint, which seems rational on the surface. But, it keeps some rather questionable company, some of whom have entirely unsavory ideals..

As for myself, I think there are likely to be positive unknown unknowns – LLMs and other intelligent “agents” will become a powerful tool for whomever uses them, but they will ultimately need someone to decide what questions need answering, and figuring out the right questions will always be a harder problem than answering them.

That’s it. That’s the idea.

Comments

Comments can be left on twitter, mastodon, as well as below, so have at it.

To view the Giscus comment thread, enable Giscus and GitHub’s JavaScript or navigate to the specific discussion on Github.

Footnotes:

1

Statistics can prove that a given coin is probably fair enough, and this is related to the Probably approximately correct theory of learning.

2

At lest, not by flipping it \(10,000\) times.

3

What those might be will be a topic for the future. Today, we’re only interested in making a negative point. That said, something like the Scientific Method might be appropriate. That is, there has to be a way to entertain a model that makes homogeneous assertions, as long as it remains useful, and then reject it later. Decision Trees do something like that, but they rarely have a back-tracking mechanism for doing that. Secondly, they operate directly on the observed domain rather than a latent/derived domain. Thirdly, they probably aren’t sufficient for general intelligence because for that one does need statistical reasoning for some things.

4

This is a similar, but more specific claim than the one made by Black Swan theory – we are going a step further in saying that in some cases a “Black Swan” can be found arbitrarily close to a white one. OTOH, perhaps the rare, high cost, difficult to predict events we are thinking of don’t quite meet the technical definition of “Black Swans”.

5

Hostile situations are interesting because those are the times when communicating costs is especially important.

6

Or perhaps technically it was a Confabulation.

7

If you’re not, shame on you.

8

Note that quantitative reasoning has some of the sharpest teeth among Last Inch problems.

9

For the record I strongly suspect there will be no easy solution as long as additive loss functions are the norm. Any kind of Consistency loss, i.e. a loss term that looks at pairs or tuples of examples, might be a step in the right direction. Simple Regularization probably won’t be sufficient, but the right kind of Structural Regularization might implicitly have a non-additive effect.

10

There is an Accelerationist counterpoint, which seems rational on the surface. But, it keeps some rather questionable company, some of whom have entirely unsavory ideals.