调参tips

对于一个模型，都可以从以下几个方面进行调参：

1. 对weight和bias进行初始化（效果很好，一般都可以提升1-2%）

Point 1 (CNN):

1 for conv in self.convs1:
2    init.xavier_normal(conv.weight, gain=np.sqrt(2.0))　　# 对weight进行正态分布初始化
3    # init.normal(conv.weight, mean=0, std=0.1)
4    # init.constant(conv.bias, 0.1)　　　　　　　　　　　　　　# 对bias初始化为0.1

Point 2 (LSTM):

（1）Bias vectors are initialized to zero, except the bias b f for the forget gate in LSTM , which is initialized to 1.0 .(参见论文End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF)。weight 使用高斯分布或是均匀分布都可以。详细讲解参考博文Deep Learning 之参数初始化

（2）简单的设置就是，weight设为0.1，bias设为0。

1         init.xavier_normal(self.lstm.all_weights[0][0], gain=np.sqrt(2.0))
2         self.lstm.all_weights[0][3].data[20:40].fill_(1)    # forget gate
3         self.lstm.all_weights[0][3].data[0:20].fill_(0)
4         self.lstm.all_weights[0][3].data[40:80].fill_(0)

注：对于封装好的lstm，其提供了all_weights接口统一对其参数进行初始化，不能单个定义，forget gate对应的下标是20-39。若是使用lstmcell则可以对单个想要修改的参数进行修改。

2. clip gradients让权重的梯度更新限制在一定范围内，防止单个节点出现梯度爆炸、梯度消失。

1 optimizer.zero_grad()
2 logit = model(feature)
3 loss = F.cross_entropy(logit, target)
4 loss.backward()
5 # clip gradients
6 utils.clip_grad_norm(model.parameters(), 5)
7 optimizer.step()

3. L2 regularization

L2值也叫惩罚值，是为了防止过拟合问题。提供了接口可直接设值，一般设为1e-8。

1 optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=0.01)

4. batch normalization批标准化若设置正确，据说会大大加大迭代速度，效果明显。

若是BatchNorm2d(x)，input是（batchsize，channel，height，width），x值对应channel，即维度1。所以channel=0时，求一次mean，var，做一次normalize；channel=1时，求一次.......channel=x时，求一次。BatchNorm1d时情况也是一样的，x对应的是维度1的值，若是不对应，则需要进行转置，如下示例。

1 m = nn.BatchNorm1d(2)
2 input = torch.randn(2, 10)
3 input = Variable(input)
4 input = Variable(torch.transpose(input.data, 0, 1))
5 print(input)
6 output = m(input)
7 print(output)

Point 1 (CNN):

 1     def __init__(self, args):
 2         super(CNN, self).__init__()
 3         self.bn = nn.BatchNorm2d(1)
 4     
 5     def forward(self, x):
 6             for conv in self.convs1:
 7             xx = conv(x)                        # variable [torch.FloatTensor of size 16x200x35x1]
 8             xx = Variable(torch.transpose(xx.data, 2, 3))
 9             xx = Variable(torch.transpose(xx.data, 1, 2))
10             xx = self.bn(xx)
11             xx = F.relu(xx)
12             xx = xx.squeeze(1)
13             a.append(xx)

Point 2 (LSTM):

1 class BiLSTM(nn.Module):
2     def __init__(self, args):
3         super(BiLSTM, self).__init__()
4         self.bn1 = nn.BatchNorm1d(2*self.hidden_size) 
5 
6 　　def forward(self, sentence):
7         out = self.bn1(out)
8         out = F.tanh(out)
9         y = self.hidden2label(out)

结果：以上两种设置并没有提高准确率。

Point 3 (BN-LSTM):

参看论文RECURRENT BATCH NORMALIZATION，不使用pytorch框架，自己实现。