I have a question that's bothered me for quite a while now. In 2018, Michael Jordan (UC Berkeley) wrote a rather interesting essay - https://medium.com/@mijordan3/artificial-intelligence-the-re... (Artificial Intelligence — The Revolution Hasn’t Happened Yet)
In it, he stated the following:
> Indeed, the famous “backpropagation” algorithm that was rediscovered by David Rumelhart in the early 1980s, and which is now viewed as being at the core of the so-called “AI revolution,” first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.
I was wondering whether anyone could point me to the paper or piece of work he was referring to. There are many citations in Schmidhuber’s piece, and in my previous attempts I've gotten lost in papers.
Thanks! This might be it. I looked up Henry J. Kelley on Wikipedia, and in the notes I found a citation to this paper from Stuart Dreyfus (Berkeley): "Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure" (https://gwern.net/doc/ai/nn/1990-dreyfus.pdf).
I am still going through it, but the latter is quite interesting!
> "Since his first work on the subject, the author has found that A. Bryson and Y.-C. Ho [Bryson and Ho, 1969] described the backpropagation algorithm using Lagrange formalism. Although their description was, of course, within the framework of optimal control rather than machine learning, the resulting procedure is identical to backpropagation."
Apologies - I should have been clear. I was not referring to Rumelhart et al., but to pieces of work that point to "optimizing the thrusts of the Apollo spaceships" using backprop.
One thing AI has been great for, recently, has been search for obscure or indirect references like this, that might be one step removed from any specific thing you're searching for, or if you have a tip-of-the-tongue search where you might have forgotten a phrase, or know you're using the wrong wording.
It's cool that you can trace the work of these rocket scientists all the way to the state of the art AI.
I don't know if there is a particular paper exactly, but Ben Recht has a discussion of the relationship between techniques in optimal control that became prominent in the 60's, and backpropagation:
Rumelhart et al wrote "Parallel Distributed Processing"; there's a chapter where he proves that the backprop algorithm maximizes "harmony", which is simply a different formulation of error minimization.
I remember reading this book enthusiastically back in the mid 90s. I don't recall struggling with the proof, it was fairly straightforward. (I was in senior high school year at the time.)
To be fair, any multivariable regulator or filter (estimator) that has a quadratic component (LQR/LQE) will naturally yield a solution similar to backpropagation when an iterative algorithm is used to optimize its cost or error function through a differentiable tangent space.
I believe the reason it works in nonlinear cases is that the derivative is “naturally linear” (to calculate the derivative, you are considering ever smaller regions where the cost function is approximately linear - exactly “how nonlinear” the cost function is elsewhere doesn’t play a role).
What is "this" exactly? Is it a well-known author or website? Or otherwise a reference that one should be familiar with? It looks like a random blog to me... with an opinion declared as fact that's quite easy to refute.
Because it is terribly low-effort. People are here for interesting and insightful discussions with other humans. If they were interested in unverified LLM output… they would ask an LLM?
Who cares if it is low effort? I got lots of upvotes for my link to Claude about this, and pncnmnp seems happy. The downvoted comment from ChatGPT was maybe a bit spammy?
It's a weird thing to wonder after so many people expressed their dislike of the upthread low-effort comment with a down vote (and then another voiced a more explicit opinion). The point is that a reader may want to know that the text they're reading is something a human took the time to write themselves. That fact is what makes it valuable.
> pncnmnp seems happy
They just haven't commented. There is no reason to attribute this specific motive to that fact.
I don't think it's rude, it saves me from having to come up with my own prompt and wade through the back and forth to get useful insight from the LLMs, also saves me from spending my tokens.
Also, I quite love it when people clearly demarcate which part of their content came from an LLM, and specifies which model.
The little citation carries a huge amount of useful information.
The folks who don't like AI should like it too, as they can easily filter the content.
> ... first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.
I think "its" refers to control theory, not backpropagation.
Whatever the facts, the OP comes across as sour grapes. The author, Jürgen Schmidhuber, believes Hopfield and Hinton did not deserve their Nobel Prize in Physics, and that Hinton, Bengio, and LeCun did not deserve their Turing Award. Evidently, many other scientists disagree, because both awards were granted in consultation with the scientific community. Schmidhuber's own work was, in fact, cited by the Nobel Prize committee as background information for the 2024 Nobel.[a] Only future generations of scientists, looking at the past more objectively, will be able to settle these disputes.
For what it's worth, it's a very mainstream opinion in the physics community that Hinton did not at all deserve a nobel prize in physics for his work. But that's because his work, and wasnt impactful at all to the physics community
In a recent talk he made a quip that he had to change some slides because if you have a Nobel prize in physics you should at least get the units right.
That's a joke, right? Turning down community recognition and a million dollars to make an unclear statement about which category the prize was awarded in?
I said nothing of the sort. Being "honest" does not mean you have to give a middle finger to a panel that nominated you. The point of that Nobel was clearly recognition for their achievement; the category choice was mainly irrelevant.
No being honest does not mean that and had you said that I'd have no basis upon which to object to your comment.
You refuted an argument about being honest about accepting an award on the basis that the award pays a lot of money and grants one a great deal of popularity.
If your argument didn't involve money and popularity, then why did you choose those two specific criteria as the justification for accepting this award?
I want to be clear, I am not claiming that Dr. Hinton accepted the award in a dishonest manner or that he did it for money, I am simply refuting your position that money is a valid reason to disregard honesty for accepting a prestigious award.
So we agree; you aren't claiming the award was accepted in a dishonest manner, and I never claimed anything about honesty being an issue. I simply found the idea of Hinton rejecting the award for the "honour of the Nobel [choice of category]" to be a silly idea.
Your indignation here seems a bit unwarranted. You definitely did bring the arguments about recognition and money into play and then basically gaslit someone for responding to them. You could have simply said that you misspoke.
Yeah, I agree with that. When I first saw the announcement, my immediate thought was something like "huh, the Nobel guys sure want to make sure they've given an award for something related to AI which they can describe as foundational." However, I do think Hinton, Bengio, and LeCun deserved their Turing Award.
A good majority of modern physics research depends on ML in some aspect. Look at the list of talks at any physics conference and count the number of talks that mention ML in the title.
And the 'physics community' has not produced any fundamental physics for a while. Look at the last several years of physics nobel prize. You can categorize the last ten years of prizes into two categories: engineering breakthroughs, and confirming important predictions. Both are impportant, but lacking fundamental physics breakthroughs, they are not clearly ahead in impact compared to ML.
I’m confused why confirming important predictions is considered less impactful than ML in physics. Isn’t experimental confirmation exactly what’s required for a Nobel Prize?
Experimental confirmation of X makes X great physics and X worthy of a nobel prize, not the engineering setup needed for the experimental confirmation.
The setup by itself can also general technique that is useful beyond confirming one thing (example LIGO). But then, ML is itself is a more general technique that has enabled a lot more new physics than one new experiment.
I would couple the Experiment and the theory together, and treat them both deserving of the prize, but not sure how it works in practice. As for the general technique of ML, sure, it's important but it seems to me that it's a tool that can be used in Physics, and the specific implementation/use-case is the actual thing that's noteworthy, not the general tool. I wouldn't consider a new mathematical theorem by itself to be physics and deserving of a physics prize, I view general ML the same way.
At least among friends who are studying physics at university, many have had some kind of ML model as part of their thesis project, like an ML model to estimate early universe background radiation. Whether that's actually useful for the field is another question.
I think the unspoken claim here is that the North American scientific establishment takes credit from other sources and elevates certain personas instead of the true innovators who are overlooked. Arguing that the establishment doesn't agree with this idea is kinda pointless.
> Evidently, many other scientists disagree, because both awards were granted in consultation with the scientific community.
That's not a good argument. They do in fact sometimes give awards with which the "scientific community" disagrees. Schmidhuber actually gave object level arguments on why the official justification for the Turing award contained substantial errors.
granted, this goes way back before the Nobel and isn't limited to the trio above. JS is known for frequently and publicly challenging anyone who presents anything on neural networks. He was pestering Ian Goodfellow about who did GANs first in a NEURIPS tutorial in 2016, amongst others
Who didn't? Depending on exactly how you interpret the notion of "inventing backpropagation" it's been invented, forgotten, re-invented, forgotten again, re-re-invented, etc, about 7 or 8 times. And no, I don't have specific citations in front of me, but I will say that a lot of interesting bits about the history of the development of neural networks (including backpropagation) can be found in the book Talking Nets: An Oral History of Neural Networks[1].
I think it’s the move towards GPU-based computing is probably more significant - the constraints put in place by GPU programming (no branching, try not to update tensors in place, etc) sync up with the constraints put in place by differentiable programming.
Once people had a sufficiently compelling reason to write differentiable code, the frameworks around differentiable programming (theano, tensorflow, torch, JAX) picked up a lot of steam.
How do you not have the citations in front of you? They are all in the article? I don't expect that any relevant (re)invention of backprop is missing there. Or, if you really know some reinvention of backprop that is not mentioned here, tell Jürgen Schmidhuber, he is actually very curious to learn about other such instances that he is not aware of yet.
Maybe they are. I'm not here to do a deep research project that involves reading every citation in that article. If it makes you feel better, pretend that what I said was instead:
"I don't have all the relevant citations stored in my short-term memory right this second and I am not interested in writing a lengthy thesis to satisfy pedantic navel-gazers on HN."
Or, if you really know some reinvention of backprop that is not mentioned here,
WTF are you on about? I never made any such claim, or anything remotely close to it.
I thought that is what you mean when you said "a lot of interesting bits about the history of the development of neural networks (including backpropagation) can be found in the book Talking Nets", that there is some relevant reference to backprop which is missing here in the linked article.
I don't really understand your negativity here, and what you are reading into my comment. I never asked you to do a research project? I just thought you might know some other references which are not in the article. If you don't, fine.
Note that I don't expect that any relevant reference is missing here. Schmidhuber always try to be very careful to be very complete and exhaustive cite everything there is on some topic. That is why I was double curious about the possibility that sth is missing, and what it could be.
I thought that is what you mean when you said "a lot of interesting bits about the history of the development of neural networks (including backpropagation) can be found in the book Talking Nets", that there is some relevant reference to backprop which is missing here in the linked article.
Nah, I wasn't trying to imply that that book had anything more than the article, at least in regards to the back-prob question specifically. Just pointing it out as one more good resource for this kind of historical perspective.
I don't really understand your negativity here, and what you are reading into my comment. I never asked you to do a research project? I just thought you might know some other references which are not in the article. If you don't, fine.
No worries. I may be reacting more to a general HN meme than to you in particular. There's a certain brand of pedantry and obsessive nit-picking that is all too common here IMO. It grates on my nerves, so if I ever seem a little salty, it's probably because I thought somebody was doing that thing. It's all good. My apologies for the argumentative tone earlier.
Schmidhuber always try to be very careful to be very complete and exhaustive cite everything there is on some topic.
Agreed. That's one reason I don't get why people are always busting on Jurgen. For the most part, it seems that he can back up the claims he makes, and then some. I've heard plenty of people complain about him, but I'm not sure any of them have ever been able to state any particular sense in which he is actually wrong about anything. :-)
As it is stated, I always thought it came from formulations like Euler-Lagrange procedures in mechanics used in numeric methods for differential geometry. In fact when I recreated the algorithm as an exercise it immediately reminded me of gradient descent for kinematics, with the Jacobian calculation for each layer similar to an iterative pose calculation in generalized coordinates. I never thought it was something "novel".
My favorite take on this is that yes, in fact it is just the chain rule. The usual argument goes that automatic and symbolic differentiation are fundamentally different, so anything particularly old (pre-computers, for example) doesn't count as inventing back prop. But here's my favorite take on equivalences between AD and symbolic diff [0]. I wish there wasn't such importance placed on who invented it for stuff like this. Clearly, someone codifying backprop wasn't a bottleneck in making ML progress, so why's it get so much attention?
The chain rule was explored by Gottfried Wilhelm Leibniz and Isaac Newton in the 17th century. Either of them would have ”invented” backpropagation in an instant. It’s obvious.
Funny enough. For me it was the other way around. I always knew how to compute the chain rule. But really only understood what the chain rule means when I read up on what back propagation was.
That's essentially it. Learning what the chain rule does, and learning what it can be used for, and how to apply it.
Neither are really inventions, they are discoveries, if anything the chain rule leans slightly more to invention than backdrop.
I understand the need for attribution as a means to track the means and validity of discovery, but I intensely dislike it when people act like it is a deed of ownership of an idea.
Obviously, Newton and Leibniz and many other Mathematicians (and other people) understood the chain rule before back propagation. But unfortunately I am very far from a Newton or Leibniz, so it took me a lot longer to grasp why the chain rule is the way it is. And back propagation just made it click for me. I was really just talking about me personally.
What clicked for me was drawing the chain rule as a graph. When I was in school I just applied the chain rule without thinking about it. I really didn't mean this to be some deep insight or anything. Just an anecdotal comment.
It's not at all obvious, as the article points out it was assumed for 40 years that backpropagation was not an efficient approach for training neural networks.
Reverse move automatic differentiation is not integration. It's still differentiation, but just a different method of calculating the derivative than the one you'd think to do by hand. It basically just applies the chain rule in the opposite order from what is intuitive to people.
It has a lot more overhead than regular forwards mode autodiff because you need to cache values from running the function and refer back to them in reverse order, but the advantage is that for function with many many inputs and very few outputs (i.e. the classic example is calculating the gradient of a scalar function in a high dimensional space like for gradient descent), it is algorithmically more efficient and requires only one pass through the primal function.
On the other hand, traditional forwards mode derivatives are most efficient for functions with very few inputs, but many outputs. It's essentially a duality relationship.
I don't think most people think to do either direction by hand; it's all just matrix multiplication, you can multiply them in whatever order makes it easier.
Forward mode automatic differentiation creates a formula for each scalar derivative. If you have a billion parameters you have to calculate each derivative from scratch.
As the name implies, the calculation is done forward.
Reverse mode automatic differentiation starts from the root of the symbolic expression and calculates the derivative for each subexpression simultaneously.
The difference between the two is like the difference between calculating the Fibonacci sequence recursively without memoization and calculating it iteratively. You avoid doing redundant work over and over again.
There are large bodies of work for optimization of state space control theory that I strongly suspect as a lot of crossover for AI, and at least has very similar mathematical structure.
e.g. optimization of state space control coefficients looks something like training a LLM matrix...
There is indeed a lot of crossover, and a lot of neural networks can be written in a state space form. The optimal control problem should be equivalent to training the weights, as you mention.
However, from what I have seen, this isn't really a useful way of reframing the problem. The optimal control problem is at least as hard, if not harder, than the original problem of training the neural network, and the latter has mature and performant software for doing it efficiently. That's not to say there isn't good software for optimal control, but it's a more general problem and therefore off-the-shelf solvers can't leverage the network structure very well.
Some researchers have made interesting theoretical connections like in neural ODEs, but even there the practicality is limited.
The real essence of the piece is that Leibnitz did not schmidhuber [0] Seppo Linnainma (probably because he was dead at the time). Actually it is a nice piece and I was really happy to get my expectations fulfilled when reading to the very end.
I've always found it rather crazy that the power of backpropagation and artificial neural networks was doubted by AI researchers for so long. It's really only since the early 2010s that researchers started to take the field seriously. This is despite the core algorithm (backpropagation) being known for decades.
I remember when I learnt about artificial neural networks at university in the late 00s my professors were really sceptical of them, rightly explaining that they become hard to train as you added more hidden layers.
See, what makes backpropagation and artificial neural networks work are all of the small optimisations and algorithm improvements that were added on top of backpropagation. Without these improvements it's too computationally inefficient to be practical and you have to contend with issues like exploding gradients.
I think Geoffrey Hinton has noted a few times that for people like him who have been working on artificial neural networks for years it's quite surprising that today neural networks just work because for years it was so hard to get them to do anything. In this sense while backpropagation is the foundational algorithm, it's not sufficient on it's own. It was the many improvements that were made on top of backpropagation that actually make artificial neural networks work and take off in the 2010s when some of the core components of modern neural networks started to fall into place.
I remember when I first learnt about neural networks I thought maybe coupling them with some kind of evolutionary approach might be what was needed to make them work. I had absolutely no idea what I was doing of course, but I spent so many nights experimenting with neural networks. I just loved the idea of an artificial "neural network" being able to learn a new problem and spit out an answer. The biggest regret of my life was coming out of university and going into web development because there were basically no AI jobs back then, and no such thing as an AI startup. If you wanted to do AI back then you basically had to be a researcher which didn't interest me at the time.
> I remember when I first learnt about neural networks I thought maybe coupling them with some kind of evolutionary approach might be what was needed to make them work.
I did this in an artificial life simulation. It was pretty fun to see the creatures change from pure random bouncing around to movement that helped them get food and move away from something eating them.
My naive vision was all kinds of advanced movement, like hiding around corners for prey, but it never got close to something like that.
As I worked the evolutionary parameters I began to realize more and more that the process of evolving specific advanced traits requires lots of time and (I think) environmental complexity and compartmentalization of groups of creatures.
There are lots of simple/dumb capabilities that help with survival and they are much much easier to acquire than a more advanced capability like being aware of other creatures and tracking it's movement on the other side of an obstacle.
Apart from backpropagation, the biggest improvements were probably changes in the network architecture. Standard feed-forward MPCs are fairly inefficient. Then there were architectures like CNNs, LSTMs, Transformers. There were also improvements in the activation function and in the gradient descent method (AdamW), but I'm not sure whether these had a substantial impact like CNNs or Transformers. Another factor was training on GPUs.
Today You Learn that the same Shun'ichi Amari who founded information geometry also made early advances to autodifferentiation.
Iterating gradient descent is much, much older, and immediately recognized upon defining what a gradient is (regardless of how one computes this gradient).
AD vs { symbolic differentiation, numeric finite "differentiation" } is about the insight how to compute a numeric gradient efficiently both in terms of (space) memory and time (compute) requirements.
Reverse mode differentiation? No, it can't be that natural since it took until 1970 to be proposed. But also in a sense basic (which you could also guess, since it was introduced in a MSc thesis).
There are mainly 2 forms of AD: forward mode (optimal when the function being differentiated has more outputs than latent parameter inputs) and reverse mode (when it has more latent parameter inputs than outputs). If you don't understand why, you don't understand AD.
If you understand AD, you'd know why, but then you'd also see a huge difference with symbolic differentiation. In symbolic differentiation, input is an expression or DAG, the variables being computed along the way are similar such symbolic expressions (typically computed in reverse order in high school or uni, so the expression would grow exponentially with each deeper nested function, and only at the end are the input coordinates filled into the final expression, to end up with the gradient). Both forward and reverse mode have numeric variables being calculated, not symbolic expressions.
The third "option" is numeric differentiation, but for N latent parameter inputs this requires (N+1) forward evaluations: N of the function f(x1,x2,..., xi + delta, ..., xN) and 1 reference evaluation at f(x1, ..., xN). Picking a smaller delta makes it closer to a real gradient assuming infinite precision, but in practice there will be irregular rounding near the pseudo "infinitesimal" values of real world floats; alternatively take delta big enough, but then its no longer the theoretical gradient.
So symbolic differentiation was destined to fail due to ever increasing symbolic expression length (the chain rule).
Numeric differentiation was destined to fail due to imprecise gradient computation and huge amounts (N+1, many billions for current models) of forward passes to get a single (!) gradient.
AD gives the theoretically correct result with a single forward and backward pass (as opposed to N+1 passes), without requiring billions of passes, or lots of storage to store strings of formulas.
I simply do not agree that you are making a real distinction and I think comments like "If you don't understand why, you don't understand AD" are rude.
AD is just simple application of the pushbacks/pullforwards from differential geometry that are just the chain rule. It is important to distinguish between a mathematical concept and a particular algorithm/computation for implementing it. The symbolic manipulation with an 'exponentially growing nested function' is a particular way of applying the chain rule, but it is not the only way.
The problem you describe with symbolic differentiation (exponential growth of expressions) is not inherent to symbolic differentiation itself, but to a particular naïve implementation. If you represent computations as DAGs and apply common subexpression elimination, the blow-up you mention can be avoided. In fact, forward- and reverse-mode AD can be viewed as particular algorithmic choices for evaluating the same derivative information that symbolic differentiation encodes. If you represent your function as a DAG and propagate pushforwards/pullbacks, you’ve already avoided swell
Some ask: "Isn't backpropagation just the chain rule of Leibniz (1676) [LEI07-10] & L'Hopital (1696)?" No, it is the efficient way of applying the chain rule to big networks with differentiable nodes—see Sec. XII of [T22][DLH]). (There are also many inefficient ways of doing this.) It was not published until 1970 [BP1].
The article says that but it's overcomplicating to the point of being actually wrong. You could, I suppose, argue that the big innovation is the application of vectorization to the chain rule (by virtue of the matmul-based architecture of your usual feedforward network) which is a true combination of two mathematical technologies. But it feels like this and indeed most "innovations" in ML is only considered as such due to brainrot derived from trying to take maximal credit for minimal work (i.e., IP).
I have a question that's bothered me for quite a while now. In 2018, Michael Jordan (UC Berkeley) wrote a rather interesting essay - https://medium.com/@mijordan3/artificial-intelligence-the-re... (Artificial Intelligence — The Revolution Hasn’t Happened Yet)
In it, he stated the following:
> Indeed, the famous “backpropagation” algorithm that was rediscovered by David Rumelhart in the early 1980s, and which is now viewed as being at the core of the so-called “AI revolution,” first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.
I was wondering whether anyone could point me to the paper or piece of work he was referring to. There are many citations in Schmidhuber’s piece, and in my previous attempts I've gotten lost in papers.
Perhaps this:
Henry J. Kelley (1960). Gradient Theory of Optimal Flight Paths.
[1] https://claude.ai/public/artifacts/8e1dfe2b-69b0-4f2c-88f5-0...
Thanks! This might be it. I looked up Henry J. Kelley on Wikipedia, and in the notes I found a citation to this paper from Stuart Dreyfus (Berkeley): "Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure" (https://gwern.net/doc/ai/nn/1990-dreyfus.pdf).
I am still going through it, but the latter is quite interesting!
Count another in the win column for the USA's heavy investment into basic sciences during the space race.
So sad to see the current state. Hopefully we can turn it around.
It is in Applied Optimal Control by Bryson and Ho (1969). Yann LeCun acknowledges this in his 1989 paper on backpropagation:https://new.math.uiuc.edu/MathMLseminar/seminarPapers/LeCunB....
> "Since his first work on the subject, the author has found that A. Bryson and Y.-C. Ho [Bryson and Ho, 1969] described the backpropagation algorithm using Lagrange formalism. Although their description was, of course, within the framework of optimal control rather than machine learning, the resulting procedure is identical to backpropagation."
See Widnall's overview here which discusses some of the ground that crosses over with what has come to be known as backpropagation:
The Minimum-Time Thrust-Vector Control Law in the Apollo Lunar-Module Autopilot (1970)
https://www.sciencedirect.com/science/article/pii/S147466701...
I found this,maybe it helps: https://gwern.net/doc/ai/nn/1986-rumelhart-2.pdf
Apologies - I should have been clear. I was not referring to Rumelhart et al., but to pieces of work that point to "optimizing the thrusts of the Apollo spaceships" using backprop.
Kelley 1960 (the gradient/adjoint flight‑path paper) https://perceptrondemo.com
AIAA 65‑701 (1965) “optimum thrust programming” for lunar transfers via steepest descent (Apollo‑era) https://arc.aiaa.org/doi/abs/10.2514/6.1965-701
Meditch 1964 (optimal thrust programming for lunar landing) https://openmdao.github.io/dymos/examples/moon_landing/moon_...
Smith 1967 & Colunga 1970 (explicit Apollo‑type trajectory/re‑entry optimization using adjoint gradients) https://ntrs.nasa.gov/citations/19670015714
One thing AI has been great for, recently, has been search for obscure or indirect references like this, that might be one step removed from any specific thing you're searching for, or if you have a tip-of-the-tongue search where you might have forgotten a phrase, or know you're using the wrong wording.
It's cool that you can trace the work of these rocket scientists all the way to the state of the art AI.
I don't know if there is a particular paper exactly, but Ben Recht has a discussion of the relationship between techniques in optimal control that became prominent in the 60's, and backpropagation:
https://archives.argmin.net/2016/05/18/mates-of-costate/
Rumelhart et al wrote "Parallel Distributed Processing"; there's a chapter where he proves that the backprop algorithm maximizes "harmony", which is simply a different formulation of error minimization.
I remember reading this book enthusiastically back in the mid 90s. I don't recall struggling with the proof, it was fairly straightforward. (I was in senior high school year at the time.)
They're probably talking about Kalman Filters (1961) and LMS filters (1960).
To be fair, any multivariable regulator or filter (estimator) that has a quadratic component (LQR/LQE) will naturally yield a solution similar to backpropagation when an iterative algorithm is used to optimize its cost or error function through a differentiable tangent space.
So yeah, this was what I was thinking for a while. What about a more nonlinear estimator? Intuitively seems similar to me.
I believe the reason it works in nonlinear cases is that the derivative is “naturally linear” (to calculate the derivative, you are considering ever smaller regions where the cost function is approximately linear - exactly “how nonlinear” the cost function is elsewhere doesn’t play a role).
that makes a lot of sense actually. thank you.
[dead]
[flagged]
it's rude to show people your llm output
Wow, this is the first time I'm hearing such a thing. For clarity:
I pasted the output so a ton of people wouldn't repeat the same question to ChatGPT and burn a ton of CO2 to get the same answer.
I didn't paste the query since I didn't find it interesting.
And I didn't fact check because I didn't have the time. I was walking and had a few seconds to just do this on my phone.
Not sure how this was rude, I certainly didn't intend it to be...
The 'it's rude to show your ai(llm) output' is a reference to this: https://distantprovince.by/posts/its-rude-to-show-ai-output-...
What is "this" exactly? Is it a well-known author or website? Or otherwise a reference that one should be familiar with? It looks like a random blog to me... with an opinion declared as fact that's quite easy to refute.
It got 321 points here: https://news.ycombinator.com/item?id=44617172
”This” is also the novel Blindsight.
Why?
Because it is terribly low-effort. People are here for interesting and insightful discussions with other humans. If they were interested in unverified LLM output… they would ask an LLM?
Who cares if it is low effort? I got lots of upvotes for my link to Claude about this, and pncnmnp seems happy. The downvoted comment from ChatGPT was maybe a bit spammy?
> Who cares if it is low effort?
It's a weird thing to wonder after so many people expressed their dislike of the upthread low-effort comment with a down vote (and then another voiced a more explicit opinion). The point is that a reader may want to know that the text they're reading is something a human took the time to write themselves. That fact is what makes it valuable.
> pncnmnp seems happy
They just haven't commented. There is no reason to attribute this specific motive to that fact.
> The point is that a reader may want to know that the text they're reading is something a human took the time to write themselves.
The reader may also simply want information that helps them.
> They just haven't commented.
Yes, they did.
> The reader may also simply want information that helps them.
The reader will generally want at least a cursory verification that it is information that helps, which dataflow didn't try to do.
Especially when you're looking for specific documents and you don't check if the documents are real. (dataflow's third one doesn't appear to be.)
This I agree with completely.
Yours was a little bit more useful, it you essentially used the LLM as a search engine to find a real article, right?
Directly posting the random text generated by the LLM is more annoying. I mean, they didn’t even vouch or it or verify that it was right.
I don't think it's rude, it saves me from having to come up with my own prompt and wade through the back and forth to get useful insight from the LLMs, also saves me from spending my tokens.
Also, I quite love it when people clearly demarcate which part of their content came from an LLM, and specifies which model.
The little citation carries a huge amount of useful information.
The folks who don't like AI should like it too, as they can easily filter the content.
> ... first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.
I think "its" refers to control theory, not backpropagation.
Whatever the facts, the OP comes across as sour grapes. The author, Jürgen Schmidhuber, believes Hopfield and Hinton did not deserve their Nobel Prize in Physics, and that Hinton, Bengio, and LeCun did not deserve their Turing Award. Evidently, many other scientists disagree, because both awards were granted in consultation with the scientific community. Schmidhuber's own work was, in fact, cited by the Nobel Prize committee as background information for the 2024 Nobel.[a] Only future generations of scientists, looking at the past more objectively, will be able to settle these disputes.
[a] https://www.nobelprize.org/uploads/2024/11/advanced-physicsp...
For what it's worth, it's a very mainstream opinion in the physics community that Hinton did not at all deserve a nobel prize in physics for his work. But that's because his work, and wasnt impactful at all to the physics community
I think Hinton himself has made that observation.
In a recent talk he made a quip that he had to change some slides because if you have a Nobel prize in physics you should at least get the units right.
Honest person would have rejected it and protected the prize's honour
That's a joke, right? Turning down community recognition and a million dollars to make an unclear statement about which category the prize was awarded in?
The argument had to do with honesty while your justification is that money and popularity are worth more than being honest.
Now perhaps Hinton does deserve the award, but certainly it should not be because of the reasons you cite: money and popularity.
I said nothing of the sort. Being "honest" does not mean you have to give a middle finger to a panel that nominated you. The point of that Nobel was clearly recognition for their achievement; the category choice was mainly irrelevant.
No being honest does not mean that and had you said that I'd have no basis upon which to object to your comment.
You refuted an argument about being honest about accepting an award on the basis that the award pays a lot of money and grants one a great deal of popularity.
If your argument didn't involve money and popularity, then why did you choose those two specific criteria as the justification for accepting this award?
I want to be clear, I am not claiming that Dr. Hinton accepted the award in a dishonest manner or that he did it for money, I am simply refuting your position that money is a valid reason to disregard honesty for accepting a prestigious award.
So we agree; you aren't claiming the award was accepted in a dishonest manner, and I never claimed anything about honesty being an issue. I simply found the idea of Hinton rejecting the award for the "honour of the Nobel [choice of category]" to be a silly idea.
Your indignation here seems a bit unwarranted. You definitely did bring the arguments about recognition and money into play and then basically gaslit someone for responding to them. You could have simply said that you misspoke.
It's up to the committee to protect that honour
Yeah, I agree with that. When I first saw the announcement, my immediate thought was something like "huh, the Nobel guys sure want to make sure they've given an award for something related to AI which they can describe as foundational." However, I do think Hinton, Bengio, and LeCun deserved their Turing Award.
> wasnt impactful at all to the physics community
There are two reasons why Hinton got the prize.
A good majority of modern physics research depends on ML in some aspect. Look at the list of talks at any physics conference and count the number of talks that mention ML in the title.
And the 'physics community' has not produced any fundamental physics for a while. Look at the last several years of physics nobel prize. You can categorize the last ten years of prizes into two categories: engineering breakthroughs, and confirming important predictions. Both are impportant, but lacking fundamental physics breakthroughs, they are not clearly ahead in impact compared to ML.
I’m confused why confirming important predictions is considered less impactful than ML in physics. Isn’t experimental confirmation exactly what’s required for a Nobel Prize?
Experimental confirmation of X makes X great physics and X worthy of a nobel prize, not the engineering setup needed for the experimental confirmation.
The setup by itself can also general technique that is useful beyond confirming one thing (example LIGO). But then, ML is itself is a more general technique that has enabled a lot more new physics than one new experiment.
I would couple the Experiment and the theory together, and treat them both deserving of the prize, but not sure how it works in practice. As for the general technique of ML, sure, it's important but it seems to me that it's a tool that can be used in Physics, and the specific implementation/use-case is the actual thing that's noteworthy, not the general tool. I wouldn't consider a new mathematical theorem by itself to be physics and deserving of a physics prize, I view general ML the same way.
Ideally this would be coupled like you say, but often in physics these are increasingly further apart, often by several decades.
And a large number of predictions being made now are unlikely to be ever confirmed.
At least among friends who are studying physics at university, many have had some kind of ML model as part of their thesis project, like an ML model to estimate early universe background radiation. Whether that's actually useful for the field is another question.
I think the unspoken claim here is that the North American scientific establishment takes credit from other sources and elevates certain personas instead of the true innovators who are overlooked. Arguing that the establishment doesn't agree with this idea is kinda pointless.
> Evidently, many other scientists disagree, because both awards were granted in consultation with the scientific community.
That's not a good argument. They do in fact sometimes give awards with which the "scientific community" disagrees. Schmidhuber actually gave object level arguments on why the official justification for the Turing award contained substantial errors.
granted, this goes way back before the Nobel and isn't limited to the trio above. JS is known for frequently and publicly challenging anyone who presents anything on neural networks. He was pestering Ian Goodfellow about who did GANs first in a NEURIPS tutorial in 2016, amongst others
Didn't click the article, came straight to the comments thinking "I bet it's Schmidhuber being salty."
Some things never change.
Do yourself a big favour and read the article before commenting, perhaps?
Hint: Schmidhuber has amassed solid evidence over years of digging.
Who didn't? Depending on exactly how you interpret the notion of "inventing backpropagation" it's been invented, forgotten, re-invented, forgotten again, re-re-invented, etc, about 7 or 8 times. And no, I don't have specific citations in front of me, but I will say that a lot of interesting bits about the history of the development of neural networks (including backpropagation) can be found in the book Talking Nets: An Oral History of Neural Networks[1].
[1]: https://www.amazon.com/Talking-Nets-History-Neural-Networks/...
I think it’s the move towards GPU-based computing is probably more significant - the constraints put in place by GPU programming (no branching, try not to update tensors in place, etc) sync up with the constraints put in place by differentiable programming.
Once people had a sufficiently compelling reason to write differentiable code, the frameworks around differentiable programming (theano, tensorflow, torch, JAX) picked up a lot of steam.
don't undergrad adaptive filters count?
https://en.wikipedia.org/wiki/Adaptive_filter
doesn't need a differentiation of the forward term, but if you squint it looks pretty close
How do you not have the citations in front of you? They are all in the article? I don't expect that any relevant (re)invention of backprop is missing there. Or, if you really know some reinvention of backprop that is not mentioned here, tell Jürgen Schmidhuber, he is actually very curious to learn about other such instances that he is not aware of yet.
They are all in the article?
Maybe they are. I'm not here to do a deep research project that involves reading every citation in that article. If it makes you feel better, pretend that what I said was instead:
"I don't have all the relevant citations stored in my short-term memory right this second and I am not interested in writing a lengthy thesis to satisfy pedantic navel-gazers on HN."
Or, if you really know some reinvention of backprop that is not mentioned here,
WTF are you on about? I never made any such claim, or anything remotely close to it.
I thought that is what you mean when you said "a lot of interesting bits about the history of the development of neural networks (including backpropagation) can be found in the book Talking Nets", that there is some relevant reference to backprop which is missing here in the linked article.
I don't really understand your negativity here, and what you are reading into my comment. I never asked you to do a research project? I just thought you might know some other references which are not in the article. If you don't, fine.
Note that I don't expect that any relevant reference is missing here. Schmidhuber always try to be very careful to be very complete and exhaustive cite everything there is on some topic. That is why I was double curious about the possibility that sth is missing, and what it could be.
I thought that is what you mean when you said "a lot of interesting bits about the history of the development of neural networks (including backpropagation) can be found in the book Talking Nets", that there is some relevant reference to backprop which is missing here in the linked article.
Nah, I wasn't trying to imply that that book had anything more than the article, at least in regards to the back-prob question specifically. Just pointing it out as one more good resource for this kind of historical perspective.
I don't really understand your negativity here, and what you are reading into my comment. I never asked you to do a research project? I just thought you might know some other references which are not in the article. If you don't, fine.
No worries. I may be reacting more to a general HN meme than to you in particular. There's a certain brand of pedantry and obsessive nit-picking that is all too common here IMO. It grates on my nerves, so if I ever seem a little salty, it's probably because I thought somebody was doing that thing. It's all good. My apologies for the argumentative tone earlier.
Schmidhuber always try to be very careful to be very complete and exhaustive cite everything there is on some topic.
Agreed. That's one reason I don't get why people are always busting on Jurgen. For the most part, it seems that he can back up the claims he makes, and then some. I've heard plenty of people complain about him, but I'm not sure any of them have ever been able to state any particular sense in which he is actually wrong about anything. :-)
As it is stated, I always thought it came from formulations like Euler-Lagrange procedures in mechanics used in numeric methods for differential geometry. In fact when I recreated the algorithm as an exercise it immediately reminded me of gradient descent for kinematics, with the Jacobian calculation for each layer similar to an iterative pose calculation in generalized coordinates. I never thought it was something "novel".
My favorite take on this is that yes, in fact it is just the chain rule. The usual argument goes that automatic and symbolic differentiation are fundamentally different, so anything particularly old (pre-computers, for example) doesn't count as inventing back prop. But here's my favorite take on equivalences between AD and symbolic diff [0]. I wish there wasn't such importance placed on who invented it for stuff like this. Clearly, someone codifying backprop wasn't a bottleneck in making ML progress, so why's it get so much attention?
[0] https://emilien.ca/Notes/Notes/notes/1904.02990v4.pdf
The chain rule was explored by Gottfried Wilhelm Leibniz and Isaac Newton in the 17th century. Either of them would have ”invented” backpropagation in an instant. It’s obvious.
Funny enough. For me it was the other way around. I always knew how to compute the chain rule. But really only understood what the chain rule means when I read up on what back propagation was.
That's essentially it. Learning what the chain rule does, and learning what it can be used for, and how to apply it.
Neither are really inventions, they are discoveries, if anything the chain rule leans slightly more to invention than backdrop.
I understand the need for attribution as a means to track the means and validity of discovery, but I intensely dislike it when people act like it is a deed of ownership of an idea.
You don't think the people who invented the chain rule understood what it means?
Obviously, Newton and Leibniz and many other Mathematicians (and other people) understood the chain rule before back propagation. But unfortunately I am very far from a Newton or Leibniz, so it took me a lot longer to grasp why the chain rule is the way it is. And back propagation just made it click for me. I was really just talking about me personally.
What insight did you gain from back propagation that you didn't have from just the formula of the chain rule?
What clicked for me was drawing the chain rule as a graph. When I was in school I just applied the chain rule without thinking about it. I really didn't mean this to be some deep insight or anything. Just an anecdotal comment.
Ah makes sense, I was thinking there was some deeper insight I was missing. Thanks!
It's not at all obvious, as the article points out it was assumed for 40 years that backpropagation was not an efficient approach for training neural networks.
It still is not efficient approach.
> BP's modern version (also called the reverse mode of automatic differentiation)
So... Automatic integration?
Proportional, integrative, derivative. A PID loop sure sounds like what they're talking about.
Reverse move automatic differentiation is not integration. It's still differentiation, but just a different method of calculating the derivative than the one you'd think to do by hand. It basically just applies the chain rule in the opposite order from what is intuitive to people.
It has a lot more overhead than regular forwards mode autodiff because you need to cache values from running the function and refer back to them in reverse order, but the advantage is that for function with many many inputs and very few outputs (i.e. the classic example is calculating the gradient of a scalar function in a high dimensional space like for gradient descent), it is algorithmically more efficient and requires only one pass through the primal function.
On the other hand, traditional forwards mode derivatives are most efficient for functions with very few inputs, but many outputs. It's essentially a duality relationship.
I don't think most people think to do either direction by hand; it's all just matrix multiplication, you can multiply them in whatever order makes it easier.
Im just talking about the general algorithm to write down the derivative of `f(g(h(x)))` using the chain rule.
For vector valued functions, the naive way you would learn in a vector calculus class corresponds to forward mode AD.
Forward mode automatic differentiation creates a formula for each scalar derivative. If you have a billion parameters you have to calculate each derivative from scratch.
As the name implies, the calculation is done forward.
Reverse mode automatic differentiation starts from the root of the symbolic expression and calculates the derivative for each subexpression simultaneously.
The difference between the two is like the difference between calculating the Fibonacci sequence recursively without memoization and calculating it iteratively. You avoid doing redundant work over and over again.
There are large bodies of work for optimization of state space control theory that I strongly suspect as a lot of crossover for AI, and at least has very similar mathematical structure.
e.g. optimization of state space control coefficients looks something like training a LLM matrix...
There is indeed a lot of crossover, and a lot of neural networks can be written in a state space form. The optimal control problem should be equivalent to training the weights, as you mention.
However, from what I have seen, this isn't really a useful way of reframing the problem. The optimal control problem is at least as hard, if not harder, than the original problem of training the neural network, and the latter has mature and performant software for doing it efficiently. That's not to say there isn't good software for optimal control, but it's a more general problem and therefore off-the-shelf solvers can't leverage the network structure very well.
Some researchers have made interesting theoretical connections like in neural ODEs, but even there the practicality is limited.
Yes, in most cases the reduction of supervised learning to optimal control is not interesting.
We can also reduce supervised learning to reinforcement learning, but that doesn't mean we should use RL algorithms to do supervised learning.
We can also reduce sorting a list of integers to SAT, but that doesn't mean we should use a SAT solver to sort lists of integers.
The real essence of the piece is that Leibnitz did not schmidhuber [0] Seppo Linnainma (probably because he was dead at the time). Actually it is a nice piece and I was really happy to get my expectations fulfilled when reading to the very end.
[0] https://www.urbandictionary.com/define.php?term=schmidhubere...
dear lord he's in urban dictionary now?!
Calling the implementation of chain rule "inventing" is most of the problem here.
I've always found it rather crazy that the power of backpropagation and artificial neural networks was doubted by AI researchers for so long. It's really only since the early 2010s that researchers started to take the field seriously. This is despite the core algorithm (backpropagation) being known for decades.
I remember when I learnt about artificial neural networks at university in the late 00s my professors were really sceptical of them, rightly explaining that they become hard to train as you added more hidden layers.
See, what makes backpropagation and artificial neural networks work are all of the small optimisations and algorithm improvements that were added on top of backpropagation. Without these improvements it's too computationally inefficient to be practical and you have to contend with issues like exploding gradients.
I think Geoffrey Hinton has noted a few times that for people like him who have been working on artificial neural networks for years it's quite surprising that today neural networks just work because for years it was so hard to get them to do anything. In this sense while backpropagation is the foundational algorithm, it's not sufficient on it's own. It was the many improvements that were made on top of backpropagation that actually make artificial neural networks work and take off in the 2010s when some of the core components of modern neural networks started to fall into place.
I remember when I first learnt about neural networks I thought maybe coupling them with some kind of evolutionary approach might be what was needed to make them work. I had absolutely no idea what I was doing of course, but I spent so many nights experimenting with neural networks. I just loved the idea of an artificial "neural network" being able to learn a new problem and spit out an answer. The biggest regret of my life was coming out of university and going into web development because there were basically no AI jobs back then, and no such thing as an AI startup. If you wanted to do AI back then you basically had to be a researcher which didn't interest me at the time.
> I remember when I first learnt about neural networks I thought maybe coupling them with some kind of evolutionary approach might be what was needed to make them work.
I did this in an artificial life simulation. It was pretty fun to see the creatures change from pure random bouncing around to movement that helped them get food and move away from something eating them.
My naive vision was all kinds of advanced movement, like hiding around corners for prey, but it never got close to something like that.
As I worked the evolutionary parameters I began to realize more and more that the process of evolving specific advanced traits requires lots of time and (I think) environmental complexity and compartmentalization of groups of creatures.
There are lots of simple/dumb capabilities that help with survival and they are much much easier to acquire than a more advanced capability like being aware of other creatures and tracking it's movement on the other side of an obstacle.
Apart from backpropagation, the biggest improvements were probably changes in the network architecture. Standard feed-forward MPCs are fairly inefficient. Then there were architectures like CNNs, LSTMs, Transformers. There were also improvements in the activation function and in the gradient descent method (AdamW), but I'm not sure whether these had a substantial impact like CNNs or Transformers. Another factor was training on GPUs.
See also: The Backstory of Backpropagation - https://yuxi.ml/essays/posts/backstory-of-backpropagation/
The only surprise here is that Schmidhuber himself didn't claim to invent it lol
Relevant:
Annotated History of Modern AI and Deep Learning - https://people.idsia.ch/~juergen/deep-learning-history.html
Japanese scientists were pioneers of AI, yet they’re being written out of its history - https://theconversation.com/japanese-scientists-were-pioneer...
Good ideas are never invented. They are always rediscovered.
Can we back propagate credit?
TIL that the same Shun'ichi Amari who founded information geometry also made early advances to gradient descent.
Today You Learn that the same Shun'ichi Amari who founded information geometry also made early advances to autodifferentiation.
Iterating gradient descent is much, much older, and immediately recognized upon defining what a gradient is (regardless of how one computes this gradient).
AD vs { symbolic differentiation, numeric finite "differentiation" } is about the insight how to compute a numeric gradient efficiently both in terms of (space) memory and time (compute) requirements.
this fight has become legendary and infamous, and also pops up on HN every 2-3 years
Isn't it just kinda a natural thing once you have the chain rule?
Reverse mode differentiation? No, it can't be that natural since it took until 1970 to be proposed. But also in a sense basic (which you could also guess, since it was introduced in a MSc thesis).
yes
When I worked on neural networks, I was taught David Rumelhart.
Despite the common refrain about how different symbolic differentiation and AD are, they are actually the same thing.
Not at all.
There are mainly 2 forms of AD: forward mode (optimal when the function being differentiated has more outputs than latent parameter inputs) and reverse mode (when it has more latent parameter inputs than outputs). If you don't understand why, you don't understand AD.
If you understand AD, you'd know why, but then you'd also see a huge difference with symbolic differentiation. In symbolic differentiation, input is an expression or DAG, the variables being computed along the way are similar such symbolic expressions (typically computed in reverse order in high school or uni, so the expression would grow exponentially with each deeper nested function, and only at the end are the input coordinates filled into the final expression, to end up with the gradient). Both forward and reverse mode have numeric variables being calculated, not symbolic expressions.
The third "option" is numeric differentiation, but for N latent parameter inputs this requires (N+1) forward evaluations: N of the function f(x1,x2,..., xi + delta, ..., xN) and 1 reference evaluation at f(x1, ..., xN). Picking a smaller delta makes it closer to a real gradient assuming infinite precision, but in practice there will be irregular rounding near the pseudo "infinitesimal" values of real world floats; alternatively take delta big enough, but then its no longer the theoretical gradient.
So symbolic differentiation was destined to fail due to ever increasing symbolic expression length (the chain rule).
Numeric differentiation was destined to fail due to imprecise gradient computation and huge amounts (N+1, many billions for current models) of forward passes to get a single (!) gradient.
AD gives the theoretically correct result with a single forward and backward pass (as opposed to N+1 passes), without requiring billions of passes, or lots of storage to store strings of formulas.
I simply do not agree that you are making a real distinction and I think comments like "If you don't understand why, you don't understand AD" are rude.
AD is just simple application of the pushbacks/pullforwards from differential geometry that are just the chain rule. It is important to distinguish between a mathematical concept and a particular algorithm/computation for implementing it. The symbolic manipulation with an 'exponentially growing nested function' is a particular way of applying the chain rule, but it is not the only way.
The problem you describe with symbolic differentiation (exponential growth of expressions) is not inherent to symbolic differentiation itself, but to a particular naïve implementation. If you represent computations as DAGs and apply common subexpression elimination, the blow-up you mention can be avoided. In fact, forward- and reverse-mode AD can be viewed as particular algorithmic choices for evaluating the same derivative information that symbolic differentiation encodes. If you represent your function as a DAG and propagate pushforwards/pullbacks, you’ve already avoided swell
https://emilien.ca/Notes/Notes/notes/1904.02990v4.pdf
It's always Schmidhuber
Funny that hinton is not mentioned. Like how childish can the author be?
What do you mean? This popular paper is cited:
[RUM] DE Rumelhart, GE Hinton, RJ Williams (1985). Learning Internal Representations by Error Propagation.
I tried to verify this, and it isn't true. This is one of the first footnotes:
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record.
this fight has become legendary and infamous
Great article and research!
It's just an application of the chain rule. It's not interesting to ask who invented it.
From the article:
Some ask: "Isn't backpropagation just the chain rule of Leibniz (1676) [LEI07-10] & L'Hopital (1696)?" No, it is the efficient way of applying the chain rule to big networks with differentiable nodes—see Sec. XII of [T22][DLH]). (There are also many inefficient ways of doing this.) It was not published until 1970 [BP1].
The article says that but it's overcomplicating to the point of being actually wrong. You could, I suppose, argue that the big innovation is the application of vectorization to the chain rule (by virtue of the matmul-based architecture of your usual feedforward network) which is a true combination of two mathematical technologies. But it feels like this and indeed most "innovations" in ML is only considered as such due to brainrot derived from trying to take maximal credit for minimal work (i.e., IP).
The real metric is whether anyone remembers it in 100 years. Any other discussion just comes off as petty.
You got it right: Leibniz!
SEE ALSO: https://www.stonewright.ai/2023/05/01/picking-apart-the-orig...
[flagged]