Backpropagation and its Alternatives
-By Ravichandra, Computer Vision Researcher, @Sally Robotics.
Since the idea of utilizing backpropagation for training neural networks was introduced in 1986, learning via neural networks has never looked back. Suddenly training networks became an efficient process which enabled magnanimous accomplishments to be achieved. It became an industry standard and many frameworks (Tensorflow, PyTorch) have been built on it, which are utilized by everyone. But gradually as we tried training bigger and deeper networks with it, some of its drawbacks came into focus. Mainly these were-
1. Vanishing Gradients: Sigmoid and tanh non-linearities tend to saturate and entirely stop learning. This happens when the neuron outputs are at the extremes, causing the gradients to be 0 or close to 0, i.e. vanish. With ReLUs, if the neurons get clamped to 0, then the weights will get zero gradients, leading to what is called the “dead ReLU” problem.
2. Exploding Gradients: In deep networks or RNNs, error gradients can accumulate during an update and result in very large gradients. These in turn result in large updates, to the network weights. At an extreme, the values of weight can become so large as to overflow and result in Nan values.
3. Computationally Expensive: Computing gradients layer by layer is undeniably a computationally expensive process. It makes one wonder if there was a better method to optimize the loss function.
While these remain the main drawbacks, there others such as choosing the hyperparameters, etc. which classify more as annoyances rather than drawbacks. So, what then did we do about it? We designed better network architectures to avoid these problems. The ResNet architecture (https://arxiv.org/abs/1512.03385) is the best example, where skip connections were created in order to avoid vanishing gradients.
Backpropagation-A flawed approach?
Many famous researchers such as Geoffrey Hinton and Yoshua Bengio have expressed their concerns over backprop not being the ideal approach. There exists no evidence that our brain performs backpropagation, and if this is the case, how then can we achieve complete artificial intelligence if we aren’t modelling networks after our own biological minds? The question of a more biologically plausible algorithm still remains and this idea is further explored in this paper: https://arxiv.org/abs/1502.04156.
Hinton’s concerns are more with the existence of another way to learn than by forcing solutions to use supervised data, i.e. unsupervised learning. He is questioning whether the methods he has pioneered and championed over the years will accomplish the original goal for neural networks, specifically, autonomous learning machines. Despite the marked progress of the last few years, we still haven’t cracked the question of how the human brain self-organizes in the absence of fixed external feedback using remarkably sparse data. Cracking this could lead to a common algorithm for supervised, unsupervised, and reinforcement learning alike.
In science, you can say things that seem crazy, but in the long run, they can turn out to be right. We can get really good evidence, and in the end, the community will come around. - Geoffrey Hinton
1. Difference Target Propagation
Backpropagation relies on infinitesmall changes (partial derivatives) in order to perform credit assignment. This could become a serious issue as one considers deeper and more non-linear functions, for example, the extreme case of non-linearity where the relation between parameters and cost is actually discrete. Citing this as the motivation, this algorithm was developed as an alternative to backpropagation. The main idea is to compute targets rather than gradients, at each layer. In a way it just like backpropagation, but much faster. So how then is it implemented? Here is a short overview of it-
a. Formulating Targets
b. Assigning a proper target to each layer
c. Difference Target Propagation
d. Training an auto-encoder with difference target propagation
While this is a highly abstract and high level description of the method, I have intentionally kept it so as delving into the mathematics and equations unnecessarily lengthens the article. The exact mathematics behind this can be found in this paper https://arxiv.org/pdf/1412.7525.pdf.
2. The HSIC Bottleneck (Hilbert-Schmidt Independence Criterion)
The approach is to train the network by using an approximation of the information bottleneck instead of backpropagation.
The above figure gives an overview of how training is done using HSIC. The HSIC-trained network, Figure(a), is a standard feedforward network trained using the HSIC IB objective, resulting in hidden representations at the last layer that can be trained rapidly. Figure(b) shows the σ-combined network, where each branch of the network HSIC-net is trained with a specific σ.
In the next step, a substitute for the mutual information between hidden representations and labels is found and maximized. This simultaneously minimizes the mutual dependency between hidden representations and the inputs. Thus, each hidden representation from the HSIC-trained network may contain different information obtained by optimizing the HSIC bottleneck objective at a particular scale. Then the aggregator sums the hidden representations to form an output representation. Further information can be found here https://arxiv.org/pdf/1908.01580v1.pdf.
What then are the advantages? The authors claim this facilitates parallel processing, requires significantly fewer operations, and does not suffer from vanishing or exploding gradients. Moreover, this seems more biologically plausible than backpropagation. Upon testing, HSIC-trained networks performed comparable to backpropagation on MNIST and CIFAR-10 datasets.
3. Online Alternating Minimization with Auxiliary Variables
The main contribution of this work is a novel online (stochastics/mini-batch) alternating minimization (AM) approach for training deep neural networks, together with the first theoretical convergence guarantees for AM in stochastic settings. This tackles the optimization problem by breaking gradient chains with auxiliary variables. This work builds upon previously proposed offline methods that break the nested objective into easier to solve local subproblems via inserting auxiliary variables corresponding to activations in each layer. More can be read about this here https://arxiv.org/pdf/1806.09077.pdf.
So, what is the key takeaway then? We can avoid gradient chain computation, which means no more vanishing gradients, lack of cross-parallelization and difficulties handling non-differentiable nonlinearities.
4. Decoupled Neural Interfaces Using Synthetic Gradients
This gives us a way to allow neural networks to communicate, to learn to send messages between themselves, in a decoupled, scalable manner paving the way for multiple neural networks to communicate with each other or improving the long-term temporal dependency of recurrent networks.
A legitimate question would be to ask how much computational complexity do these synthetic gradient models add — perhaps you would need a synthetic gradient model architecture that is as complex as the network itself. Quite surprisingly, the synthetic gradient models can be very simple. For feed-forward nets, it was actually found out that even a single linear layer works well as a synthetic gradient model. Consequently, it is both very easy to train and so produces synthetic gradients rapidly. More details can be read herehttps://arxiv.org/pdf/1608.05343.pdf.
Built by researchers at DeepMind, this method enjoys significant gains from the increased time horizon that the DNI-enabled RNNs are able to model, as well as faster convergence compared to backprop. Synthetic gradients FTW!
Machines will follow a path that mirrors the evolution of humans. Ultimately, however, self-aware, self-improving machines will evolve beyond humans’ ability to control or even understand them. - Ray Kurzweil
We have discussed the drawbacks of Backpropagation, possible flaws in the general approach as voiced by prominent researchers, and finally, some good alternatives. Ultimately, none of these can be called as “better than backprop” because all they do is achieve competitive results. This failure can be attributed to the lack of research done, which leads us to expect a lot of progress and development in the coming future. So key takeaway is this: for an algorithm to dethrone Backprop, it has got to solve the vanishing and exploding gradient problems, be computationally faster, converge faster, preferrably reduce hyperparameters and most importantly be biologically plausible. This ensures that we’re advancing in the direction of recreating the human mind, allowing us to tap into its unparalleled potential. Until then, all we can do is experiment and continue further research!