Demystifying the Lottery Ticket Hypothesis in Deep Learning (2024)

Why lottery tickets are the next big thing in training neural networks

Published in

Towards Data Science

4 min read

Mar 3, 2022

Training neural networks is expensive. OpenAI’s GPT-3 has been calculated to have a training cost of $4.6M using the lowest-cost cloud GPU on the market. It’s no wonder that Frankle and Carbin’s 2019 Lottery Ticket Hypothesis started a gold rush in research, with attention from top academic minds and tech giants like Facebook and Microsoft. In the paper, they prove the existence of winning (lottery) tickets: subnetworks of a neural network that can be trained to produce performance as good as the original network, with a much smaller size. In the post, I’ll tackle how this works, why it is revolutionary, and the state of research.

Traditional wisdom says that neural networks are best pruned after training, not at the start. By pruning weights, neurons, or other components, the resulting neural network is smaller, faster, and consumes fewer resources during inference. When done right, the accuracy is unaffected while the network size can shrink manifold.

By flipping traditional wisdom on its head, we can consider whether we could have pruned the network before training and achieved the same result. In other words, was the information from the pruned components necessary for the network to learn, even if not to represent its learning?

The Lottery Ticket Hypothesis focuses on pruning weights and offers empirical evidence that certain pruned subnetworks could be trained from the start to achieve similar performance to the entire network. How? Iterative Magnitude Pruning.

When a task like this was tried historically, the pruned networks weights would be reinitialized randomly and the performance would drop off quickly.

The key difference here is that the weights were returned to their original initialization. When trained, the results matched the original performance in the same training time, at high levels of pruning.

Demystifying the Lottery Ticket Hypothesis in Deep Learning (3)

This suggests that these lottery tickets exist, as an intersection of a specific subnetwork and initial weights. They are “winning the lottery,” so to say, as the match of that architecture and those weights perform as well as the entire network. Does this hold for bigger models?

For bigger models, this does not hold true with the same approach. When looking at sensitivity to noise, Frankle and Carbin duplicated the pruned networks and trained them on data ordered differently. IMP succeeds where linear mode connectivity exists, a very rare phenomenon where multiple networks converge to the same local minima. For small networks, this happens naturally. For large networks, it does not. So what to do?

Starting with a smaller learning rate results in IMP working for large models, as sensitivity to initial noise from the data is lessened. The learning rate can be increased over time. The other finding is that rewinding our pruned neural network’s weights to their values at a later training iteration rather than the first iteration works as well. For example, the weights at the 10th iteration in a 1000 iteration training.

These results have held steady across architectures as different as transformers, LSTMs, CNNs, and reinforcement learning architectures.

While this paper proved the existence of these lottery tickets, it does not yet provide a way to identify them. Hence, the gold rush in finding their properties and whether they can be identified before training. They’re also inspiring work in heuristics for pruning early, since our current heuristics are focused on pruning after training.

One Ticket to Win Them All (2019) shows that lottery tickets encode information that is invariant to datatype and optimizers. They are able to successfully transfer lottery tickets between networks trained on different datatypes (e.g. VGG to ImageNet), finding success.

A key indicator was the relative size of the training data for the networks. If the lottery ticket source was trained on a larger dataset than the destination network, it performed better; otherwise, similarly or worse.

Demystifying the Lottery Ticket Hypothesis in Deep Learning (4)

Drawing Early-Bird Tickets (2019): This paper aims to prove that lottery tickets can be found early in training. Each training iteration, they compute a pruning mask. If the mask in the last iteration and this one have a mask distance (using Hamming distance) below a certain threshold, the network stops to prune.

Pruning Neural Networks Without Any Data by Iteratively Conserving Synaptic Flow (2020): This paper focuses on calculating pruning at initialization with no data. It outperforms existing state-of-the-art pruning pruning algorithms at initialization. The technique focuses on maximizing critical compression, the maximum pruning that can occur without impacting performance. To do so, the authors aim to prevent entire layers from being pruned. The network does this by positively scoring keeping layers and reevaluating the score every time the network prunes.

The existence of small subnetworks in neural architectures that can be trained to perform as well as the entire neural network is opening a world of possibilities for efficient training. In the process, researchers are learning a lot about how neural networks learn and what is necessary for learning. And who knows? One day soon we may be able to prune our networks before training, saving time, compute, and energy.

Demystifying the Lottery Ticket Hypothesis in Deep Learning (2024)

Why lottery tickets are the next big thing in training neural networks

References