Lawrence R. De Geest

Everything is a line

At least from close range. Walk around the Earth and you walk around a sphere, but each individual step is a line, calculating the gradient of the planet. Watch a baseball shoot out of the ballpark and your gaze follows a line. The list go on. Locally, the world is linear.

Local linearisation was on my brain the other night. I was teaching a Causal Machine Learning course at the University of San Francisco, and we got to talking about how all of stats is a linear model. Slightly dramatic take, but then again, a logistic regression is a linear model passed through a link function1, and neural networks are linear models passed through a series of link or “activation” functions.2 Cut through the flesh and bone of ChatGPT and inside you will find the beating heart of a linear model.3

Then I got home and realized I had forgotten an even older and even more obvious example of local linearisation: the Taylor Series Approximation, which arrived on the scene over 200 years before the Neural Network. Take some unwieldy, non-linear function. You can approximate it by cutting it up into little bits and connecting those bits with lines.

More formally, those little bits in the Taylor Approximation are polynomials, stemming from the function’s derivatives evaluated at an arbitrary point:

\[f(x) = f(a) + f'(a)(x - a) + \frac{f''(a)}{2!}(x - a)^2 + \frac{f'''(a)}{3!}(x - a)^3 + \cdots\]

The “order” of an expansion is just the number of polynomials made from successive derivatives. (First order is one derivative, second order is two derivatives, and so on.) So as long as you can keep taking derivatives, you can get better and better approximations of that non-linear function. For example, here is a Taylor Approximation of a sin wave (code):

Very cool, very cool. Then I realized something else: Taylor Approximations and Neural Networks are connected. (No pun intended.) Neural Networks use back propogation to estimate parameters (i.e., to “train”), back propogation uses gradient descent, and gradient descent is just a first-order Taylor Approximation.

Say our function is $f(X)$, and it’s tough to optimize, because it’s so non-linear. In gradient descent we take little steps in the direction of the gradient, the vector of derivatives, the function’s “map” to the optimum. Consider a point $x \in X$, and a nextdoor point $x’$, and let $\Delta x = x - x’$. We can do a first-order Taylor Approximation around the delta:

\[f(x + \Delta x) \approx f(x) + f'(x) \Delta x\]

but depending on the size of $\Delta x$ this expansion may or may not help us get to the optimum. How big or small should we set $\Delta x$? Extremely small steps are inefficient – imagine walking up a hill one toe at a time – so we would like to take as big steps as possible. We can add a penalty term $\vert \vert \Delta x \vert \vert^2$ (the L2 Norm) to punish steps too big, and then add a weighting term $\frac{1}{2 \alpha}$ to regulate the penalty. That gives us an objective function:

\[\text{argmin}_{\Delta x} \quad f(x) + f'(x) \Delta x + \frac{1}{2 \alpha} \vert \vert \Delta x \vert \vert^2\]

and if we take the first-order condition and solve, we get

\[\Delta x = -\alpha f'(x)\]

which is simply gradient descent (with $\alpha$ the step-size, a tunable parameter)!

Continuing this line of thinking (pun intended), if you take a second-order approximation and re-write the regularized objective function

\[\text{argmin}_{\Delta x} \quad f(x) + f'(x) \Delta x + \frac{1}{2} f''(x) \Delta x^2 + \frac{1}{2 \alpha} \vert \vert \Delta x \vert \vert^2\]

and solve we end up with

\[\Delta x = - \frac{f'(x)}{f''(x) + \frac{1}{\alpha}}\]

which is a regularized form of Newton’s Method.

There is something comforting about the idea that life is locally linear. People always say you gotta take things one step at a time.

  1. Ditto Poisson regression, and other Generalized Linear Models 

  2. Hilariously, to me at least, the Wikipedia page has a short section on “Linear Neural Networks” and attributes the first one in recorded history to Gauss – because it’s just linear regression. 

  3. Daniel Roelfs and Jonas Kristoffer Lindeløv wrote great posts about the ubiquity of the linear model. And there is a terrific Cross Validated post explaining how local linearisation drives the Cramer-Rao Lower Bound, which essentially draws a limit to what we can discover and know about the world, as nicely argued by Michael R. Powers