Compression of Neural Machine Translation Models via Pruning
Authors: Abigail See, Minh-Thang Luong, Christopher D. Manning (Stanford)
Venue: Arxiv
This paper applies pruning techniques to encoder-decoder deep multi-layer recurrent architecture with LSTM as the hidden unit type. The paper tries various pruning types, but finds the most effective to be simply pruning weights of least magnitude overall. While overall the techniques are mostly brought over from pruning techniques used in CNN's and other networks, the papers does make note of interesting artifacts of pruning.
Firstly, deeper neurons are more sensitive to pruning that early neurons. In other words, the deeper units are actually of more importance and more sensitive to even low-magnitude weights. Additionally, they find that the sparse models can even out perform the originals, and claim that this is most likely due to the "generalizing" effect that pruning has. They say that while training set accuracy decreases, validation set accuracy actually increases! Additionally, they note that the architecture generated by the sparse model cannot simply be applied by default. In other words, results are far better when you start with the original model, train it completely, and then iteratively prune.
Full Text
Venue: Arxiv
This paper applies pruning techniques to encoder-decoder deep multi-layer recurrent architecture with LSTM as the hidden unit type. The paper tries various pruning types, but finds the most effective to be simply pruning weights of least magnitude overall. While overall the techniques are mostly brought over from pruning techniques used in CNN's and other networks, the papers does make note of interesting artifacts of pruning.
Firstly, deeper neurons are more sensitive to pruning that early neurons. In other words, the deeper units are actually of more importance and more sensitive to even low-magnitude weights. Additionally, they find that the sparse models can even out perform the originals, and claim that this is most likely due to the "generalizing" effect that pruning has. They say that while training set accuracy decreases, validation set accuracy actually increases! Additionally, they note that the architecture generated by the sparse model cannot simply be applied by default. In other words, results are far better when you start with the original model, train it completely, and then iteratively prune.
Full Text
Comments
Post a Comment