Note of Compression of Neural Machine Translation Models via Pruning

The problems of NMT Model

Over-Parameterization
Long running time
Overfitting
Big Storage size

The redundancies of NMT Model

Most important: Higher Layers; Attention and Softmax Weights

redundancy: lower layers; embedding weights;

Traditional Solutions

Optimal Brain Damage (OBD) and Optimal Brain Surgeon(OBS)

Recent Ways

Magnitude based pruning with iterative retraining（基于幅度的剪枝与反复的重复训练）yielded strong results for Convolutional Neural Networks (CNN) performing visual tasks.

sparsity-inducing regularizers or ‘wiring together’ pairs of neurons with similar input weights

These approaches are much more constrained than weight-pruning schemes; they necessitate finding entire zero rows of weight matrices, or near-identical pairs of rows, in order to prune a single neuron.

weight-pruning approaches

weight-pruning approaches allow weights to be pruned freely and independently of each other

many other compression techniques for neural networks

approaches based on on low-rank approximations for weight matrices;
weight sharing via hash functions;

Understanding NMT Weights

Weight Subgroups in LSTM

details of LSTM:

[left(egin{array}{c} {i} \ {f} \ {o} \ {hat{h}} end{array} ight)=left(egin{array}{c} {operatorname{sig} m} \ {operatorname{sig} m} \ {operatorname{sig} m} \ { anh } end{array} ight) T_{4 n, 2 n}left(egin{array}{c} {h_{t}^{l-1}} \ {h_{t-1}^{l}} end{array} ight) ]

we get (left(h_{t}^{l}, c_{t}^{l} ight)) from the inputs of LSTM $left(h_{t-1}^{l}, c_{t-1}^{l} ight) $

[egin{array}{l} {c_{t}^{l}=f circ c_{t-1}^{l}+i circ hat{h}} \ {h_{t}^{l}=o circ anh left(c_{t}^{l} ight)} end{array} ]

(T_{4 n, 2 n}) is a matrix that is responsible for the parameters.

Pruning Schemes

Suppose we wish to prune x% of the total parameters in the model. How do we distribute the pruning over the different weight classes

Class-blind： Take all parameters, sort them by magnitude and prune the (x \%) with smallest magnitude, regardless of weight class.
Class-uniform： Within each class, sort the weights by magnitude and prune the (x \%) with smallest magnitude.

With class-uniform pruning, the overall performance loss is caused disproportionately by a few classes: target layer 4, attention and softmax weights; it seems that higher layers are more important than lower layers, and that attention and softmax weights are crucial