Note of Compression of Neural Machine Translation Models via Pruning

The problems of NMT Model

I
[Not supported by viewer]
am
[Not supported by viewer]
a
[Not supported by viewer]
student
[Not supported by viewer]
source language input
[Not supported by viewer]
-
[Not supported by viewer]
Je
[Not supported by viewer]
suis
[Not supported by viewer]
étudiant
[Not supported by viewer]
target language input
[Not supported by viewer]
Je
[Not supported by viewer]
suis
[Not supported by viewer]
étudiant
[Not supported by viewer]
-
[Not supported by viewer]
target language output
[Not supported by viewer]
  1. Over-Parameterization
  2. Long running time
  3. Overfitting
  4. Big Storage size

The redundancies of NMT Model

Most important: Higher Layers; Attention and Softmax Weights

redundancy: lower layers; embedding weights;

Traditional Solutions

Optimal Brain Damage (OBD) and Optimal Brain Surgeon(OBS)

Recent Ways

Magnitude based pruning with iterative retraining(基于幅度的剪枝与反复的重复训练)yielded strong results for Convolutional Neural Networks (CNN) performing visual tasks.

sparsity-inducing regularizers or ‘wiring together’ pairs of neurons with similar input weights

These approaches are much more constrained than weight-pruning schemes; they necessitate finding entire zero rows of weight matrices, or near-identical pairs of rows, in order to prune a single neuron.

weight-pruning approaches

weight-pruning approaches allow weights to be pruned freely and independently of each other

many other compression techniques for neural networks

  1. approaches based on on low-rank approximations for weight matrices;
  2. weight sharing via hash functions;

Understanding NMT Weights

Weight Subgroups in LSTM

details of LSTM:

[left(egin{array}{c} {i} \ {f} \ {o} \ {hat{h}} end{array} ight)=left(egin{array}{c} {operatorname{sig} m} \ {operatorname{sig} m} \ {operatorname{sig} m} \ { anh } end{array} ight) T_{4 n, 2 n}left(egin{array}{c} {h_{t}^{l-1}} \ {h_{t-1}^{l}} end{array} ight) ]

we get (left(h_{t}^{l}, c_{t}^{l} ight)) from the inputs of LSTM $left(h_{t-1}^{l}, c_{t-1}^{l} ight) $

[egin{array}{l} {c_{t}^{l}=f circ c_{t-1}^{l}+i circ hat{h}} \ {h_{t}^{l}=o circ anh left(c_{t}^{l} ight)} end{array} ]

(T_{4 n, 2 n}) is a matrix that is responsible for the parameters.

I
[Not supported by viewer]
am
[Not supported by viewer]
a
[Not supported by viewer]
student
[Not supported by viewer]
source language input
[Not supported by viewer]
-
[Not supported by viewer]
Je
[Not supported by viewer]
suis
[Not supported by viewer]
étudiant
[Not supported by viewer]
target language input
[Not supported by viewer]
Je
[Not supported by viewer]
suis
[Not supported by viewer]
étudiant
[Not supported by viewer]
-
[Not supported by viewer]
target language output
[Not supported by viewer]
one-hot vectors
 length V
[Not supported by viewer]
word embeddings
length n
[Not supported by viewer]
hidden layer 1
length n
[Not supported by viewer]
hidden layer 2
length n
[Not supported by viewer]
attention hidden layer
length n
[Not supported by viewer]
scores
length V
[Not supported by viewer]
one-hot vectors
length V
[Not supported by viewer]
initial (zero)
    states
[Not supported by viewer]
context vector
 (one for each
  target word)
     length n
[Not supported by viewer]

Pruning Schemes

Suppose we wish to prune x% of the total parameters in the model. How do we distribute the pruning over the different weight classes

  1. Class-blind: Take all parameters, sort them by magnitude and prune the (x \%) with smallest magnitude, regardless of weight class.
  2. Class-uniform: Within each class, sort the weights by magnitude and prune the (x \%) with smallest magnitude.

With class-uniform pruning, the overall performance loss is caused disproportionately by a few classes: target layer 4, attention and softmax weights; it seems that higher layers are more important than lower layers, and that attention and softmax weights are crucial

原文地址:https://www.cnblogs.com/wevolf/p/12105538.html