A Shallow Perspective on Deep Neural Nets

Deep neural networks are usually drawn like this — stacked layer by layer.

But we can redraw the same computation with all the blocks running in parallel… and the topology now looks shallow!

In this parallel view, every block has a direct subpath from input to output in just one hop regardless of the total depth of the network. These shallower routes for gradient flow help explain how this structure addresses the vanishing gradient problem to stabilize and speed up model training.

It’s worth noting that deep neural networks weren’t always like this. Early ResNets¹ placed ReLUs, and early Transformers² placed LayerNorms, between residuals, both of which “pollute” the residual stream and the backward gradient flows. The fact that modern variants across language modeling and computer vision (Pre-LN Transformer³, ResNet-V2⁴, ConvNeXt⁵) have converged to this (more elegant) design seems highly statistically significant.

Of course, now frontier labs are experimenting with mechanisms like mHC and attention residuals to improve on the limitations of the current paradigm, such as exploding activations. Does this parallel view visualization extend to those architectures?

References

He et al., Deep Residual Learning for Image Recognition, CVPR 2016. ↩
Vaswani et al., Attention Is All You Need, NeurIPS 2017. ↩
Xiong et al., On Layer Normalization in the Transformer Architecture, ICML 2020. ↩
He et al., Identity Mappings in Deep Residual Networks, ECCV 2016. ↩
Liu et al., A ConvNet for the 2020s, CVPR 2022. ↩