Pytorch-使用Bert预训练模型微调中文文本分类

渣渣本跑不动，以下代码运行在Google Colab上。

语料链接：https://pan.baidu.com/s/1YxGGYmeByuAlRdAVov_ZLg
提取码：tzao

neg.txt和pos.txt各5000条酒店评论，每条评论一行。

安装transformers库

!pip install transformers

导包，设定超参数

 1 import numpy as np
 2 import random
 3 import torch
 4 import matplotlib.pyplot as plt
 5 from torch.nn.utils import clip_grad_norm_
 6 from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler 
 7 from transformers import BertTokenizer, BertForSequenceClassification, AdamW
 8 from transformers import get_linear_schedule_with_warmup
 9 
10 SEED = 123
11 BATCH_SIZE = 16
12 LEARNING_RATE = 2e-5
13 WEIGHT_DECAY = 1e-2
14 EPSILON = 1e-8
15 
16 random.seed(SEED)
17 np.random.seed(SEED)
18 torch.manual_seed(SEED)

1.数据预处理

1.1读取文件

 1 def readfile(filename):
 2     with open(filename, encoding="utf-8") as f:        
 3         content = f.readlines()
 4         return content
 5 
 6 pos_text, neg_text = readfile('hotel/pos.txt'), readfile('hotel/neg.txt')
 7 sentences = pos_text + neg_text
 8 
 9 #设定标签
10 pos_targets = np.ones((len(pos_text)))
11 neg_targets = np.zeros((len(neg_text)))
12 targets = np.concatenate((pos_targets, neg_targets), axis=0).reshape(-1, 1)   #(10000, 1)
13 total_targets = torch.tensor(targets)

Tip：调用readfile时报错了UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 0

解决办法：将txt文件在Notepad++中打开，点击工具栏的编码，转为UTF-8编码。

1.2BertTokenizer进行编码，将每一句转成数字

1 tokenizer = BertTokenizer.from_pretrained('bert-base-chinese', cache_dir="E:/transformer_file/")
2 print(pos_text[2])
3 print(tokenizer.tokenize(pos_text[2]))
4 print(tokenizer.encode(pos_text[2]))
5 print(tokenizer.convert_ids_to_tokens(tokenizer.encode(pos_text[2])))

不错，下次还考虑入住。交通也方便，在餐厅吃的也不错。

['不', '错', '，', '下', '次', '还', '考', '虑', '入', '住', '。', '交', '通', '也', '方', '便', '，', '在', '餐', '厅', '吃', '的', '也', '不', '错', '。']

[101, 679, 7231, 8024, 678, 3613, 6820, 5440, 5991, 1057, 857, 511, 769, 6858, 738, 3175, 912, 8024, 1762, 7623, 1324, 1391, 4638, 738, 679, 7231, 511, 102]

['[CLS]', '不', '错', '，', '下', '次', '还', '考', '虑', '入', '住', '。', '交', '通', '也', '方', '便', '，', '在', '餐', '厅', '吃', '的', '也', '不', '错', '。', '[SEP]']

为了使每一句的长度相等，稍作处理；

 1 #将每一句转成数字（大于126做截断，小于126做PADDING，加上首尾两个标识，长度总共等于128）
 2 def convert_text_to_token(tokenizer, sentence, limit_size=126):
 3     
 4     tokens = tokenizer.encode(sentence[:limit_size])  #直接截断  
 5     if len(tokens) < limit_size + 2:                  #补齐（pad的索引号就是0）
 6         tokens.extend([0] * (limit_size + 2 - len(tokens)))   
 7     return tokens
 8 
 9 input_ids = [convert_text_to_token(tokenizer, sen) for sen in sentences]
10 
11 input_tokens = torch.tensor(input_ids)
12 print(input_tokens.shape)                    #torch.Size([10000, 128])

1.3attention_masks, 在一个文本中，如果是PAD符号则是0，否则就是1

 1 #建立mask
 2 def attention_masks(input_ids):
 3     atten_masks = []
 4     for seq in input_ids:
 5         seq_mask = [float(i>0) for i in seq]
 6         atten_masks.append(seq_mask)
 7     return atten_masks
 8 
 9 atten_masks = attention_masks(input_ids)
10 attention_tokens = torch.tensor(atten_masks)

构造input_ids和atten_masks的目的和前面一节中提到的.encode_plus函数返回的input_ids和attention_mask一样，input_type_ids和本次任务无关，它是针对每个训练集有两个句子的任务（如问答任务）。

1.4划分训练集和测试集

两个划分函数的参数random_state和test_size值要一致，才能使得train_inputs和train_masks一一对应。

1 from sklearn.model_selection import train_test_split
2 train_inputs, test_inputs, train_labels, test_labels = train_test_split(input_tokens, total_targets, random_state=666, test_size=0.2)
3 train_masks, test_masks, _, _ = train_test_split(attention_tokens, input_tokens, random_state=666, test_size=0.2)
4 print(train_inputs.shape, test_inputs.shape)      #torch.Size([8000, 128]) torch.Size([2000, 128])
5 print(train_masks.shape)                          #torch.Size([8000, 128])和train_inputs形状一样
6 
7 print(train_inputs[0])
8 print(train_masks[0])

tensor([ 101, 2769, 6370, 4638, 3221, 10189, 1039, 4638, 117, 852, 2769, 6230, 2533, 8821, 1039, 4638, 7599, 3419, 3291, 1962, 671, 763, 117, 3300, 671, 2476, 1377, 809, 1288, 1309, 4638, 3763, 1355, 119, 2456, 6379, 1920, 2157, 6370, 3249, 6858, 7313, 106, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

1.5创建DataLoader，用来取出一个batch的数据

TensorDataset 可以用来对 tensor 进行打包，就好像 python 中的 zip 功能。该类通过每一个 tensor 的第一个维度进行索引，所以该类中的 tensor 第一维度必须相等，且TensorDataset 中的参数必须是 tensor类型。

RandomSampler对数据集随机采样。

SequentialSampler按顺序对数据集采样。

1 train_data = TensorDataset(train_inputs, train_masks, train_labels)
2 train_sampler = RandomSampler(train_data)
3 train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)
4 
5 test_data = TensorDataset(test_inputs, test_masks, test_labels)
6 test_sampler = SequentialSampler(test_data)
7 test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)

查看一下train_dataloader的内容：

1 for i, (train, mask, label) in enumerate(train_dataloader):
2     print(train.shape, mask.shape, label.shape)               #torch.Size([16, 128]) torch.Size([16, 128]) torch.Size([16, 1])
3     break
4 print('len(train_dataloader)=', len(train_dataloader))        #500

2.创建模型、优化器

创建模型

1 model = BertForSequenceClassification.from_pretrained("bert-base-chinese", num_labels = 2)     #num_labels表示2个分类，好评和差评
2 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
3 model.to(device)

定义优化器

参数eps是为了提高数值稳定性而添加到分母的一个项(默认: 1e-8)。

1 optimizer = AdamW(model.parameters(), lr = LEARNING_RATE, eps = EPSILON)

更通用的写法：bias和LayNorm.weight没有用权重衰减

1 no_decay = ['bias', 'LayerNorm.weight']
2 optimizer_grouped_parameters = [
3         {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': WEIGHT_DECAY},
4         {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
5 ]
6 optimizer = AdamW(optimizer_grouped_parameters, lr = LEARNING_RATE, eps = EPSILON)

学习率预热，训练时先从小的学习率开始训练

1 epochs = 2
2 # training steps 的数量: [number of batches] x [number of epochs]. 
3 total_steps = len(train_dataloader) * epochs
4 
5 # 设计 learning rate scheduler.
6 scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)

3.训练、评估模型

3.1模型准确率

1 def binary_acc(preds, labels):      #preds.shape=(16, 2) labels.shape=torch.Size([16, 1])
2     correct = torch.eq(torch.max(preds, dim=1)[1], labels.flatten()).float()      #eq里面的两个参数的shape=torch.Size([16])    
3     acc = correct.sum().item() / len(correct)
4     return acc

3.2计算模型运行时间

1 import time
2 import datetime
3 def format_time(elapsed):    
4     elapsed_rounded = int(round((elapsed)))    
5     return str(datetime.timedelta(seconds=elapsed_rounded))   #返回 hh:mm:ss 形式的时间

3.3训练模型

传入model的参数必须是tensor类型的；

nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=2)用于解决神经网络训练过拟合的方法；

输入是（NN参数，最大梯度范数，范数类型=2) 一般默认为L2 范数；

Tip：注意这个方法只在训练的时候使用，在测试的时候不用；

 1 def train(model, optimizer):
 2     t0 = time.time()
 3     avg_loss, avg_acc = [],[]
 4     
 5     model.train()
 6     for step, batch in enumerate(train_dataloader):
 7 
 8         # 每隔40个batch 输出一下所用时间.
 9         if step % 40 == 0 and not step == 0:            
10             elapsed = format_time(time.time() - t0)
11             print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
12 
13         b_input_ids, b_input_mask, b_labels = batch[0].long().to(device), batch[1].long().to(device), batch[2].long().to(device)
14         
15         output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
16         loss, logits = output[0], output[1] 
17     
18         avg_loss.append(loss.item())
19         
20         acc = binary_acc(logits, b_labels)
21         avg_acc.append(acc)
22         
23         optimizer.zero_grad()
24         loss.backward()
25         clip_grad_norm_(model.parameters(), 1.0)      #大于1的梯度将其设为1.0, 以防梯度爆炸
26         optimizer.step()              #更新模型参数
27         scheduler.step()              #更新learning rate
28         
29     avg_acc = np.array(avg_acc).mean()
30     avg_loss = np.array(avg_loss).mean()
31     return avg_loss, avg_acc

此处output的形式为（元组类型，第0个元素是loss值，第1个元素是每个batch中好评和差评的概率）：

(tensor(0.0210, device='cuda:0', grad_fn=<NllLossBackward>), 
tensor([[-2.9815,  2.6931],
        [-3.2380,  3.1935],
        [-3.0775,  3.0713],
        [ 3.0191, -2.3689],
        [ 3.1146, -2.7957],
        [ 3.7798, -2.7410],
        [-0.3273,  0.8227],
        [ 2.5012, -1.5535],
        [-3.0231,  3.0162],
        [ 3.4146, -2.5582],
        [ 3.3104, -2.2134],
        [ 3.3776, -2.5190],
        [-2.6513,  2.5108],
        [-3.3691,  2.9516],
        [ 3.2397, -2.0473],
        [-2.8622,  2.7395]], device='cuda:0', grad_fn=<AddmmBackward>))

3.4评估模型

调用model模型时不传入label值。

 1 def evaluate(model):    
 2     avg_acc = []    
 3     model.eval()         #表示进入测试模式
 4       
 5     with torch.no_grad():
 6         for batch in test_dataloader:
 7             b_input_ids, b_input_mask, b_labels = batch[0].long().to(device), batch[1].long().to(device), batch[2].long().to(device)
 8         
 9             output = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
10             
11             acc = binary_acc(output[0], b_labels)
12             avg_acc.append(acc)
13     avg_acc = np.array(avg_acc).mean()
14     return avg_acc

此处output的形式为（元组类型，第0个元素是每个batch中好评和差评的概率）：

(tensor([[ 3.8217, -2.7516],
        [ 2.7585, -2.0853],
        [-2.9317,  2.9092],
        [-3.3724,  3.2597],
        [-2.8692,  2.6741],
        [-3.2784,  2.9276],
        [ 3.4946, -2.8895],
        [ 3.7855, -2.8623],
        [-2.2249,  2.4336],
        [-2.4257,  2.4606],
        [ 3.3996, -2.5760],
        [-3.1986,  3.0841],
        [ 3.6883, -2.9492],
        [ 3.2883, -2.3600],
        [ 2.6723, -2.0778],
        [-3.1868,  3.1106]], device='cuda:0'),)

3.5运行训练模型和评估模型

1 for epoch in range(epochs):
2     
3     train_loss, train_acc = train(model, optimizer)
4     print('epoch={},训练准确率={}，损失={}'.format(epoch, train_acc, train_loss))
5     test_acc = evaluate(model)
6     print("epoch={},测试准确率={}".format(epoch, test_acc))

运行结果如下：

  Batch    40  of    500.    Elapsed: 0:00:14.
  Batch    80  of    500.    Elapsed: 0:00:28.
  Batch   120  of    500.    Elapsed: 0:00:42.
  Batch   160  of    500.    Elapsed: 0:00:57.
  Batch   200  of    500.    Elapsed: 0:01:12.
  Batch   240  of    500.    Elapsed: 0:01:26.
  Batch   280  of    500.    Elapsed: 0:01:41.
  Batch   320  of    500.    Elapsed: 0:01:56.
  Batch   360  of    500.    Elapsed: 0:02:11.
  Batch   400  of    500.    Elapsed: 0:02:26.
  Batch   440  of    500.    Elapsed: 0:02:42.
  Batch   480  of    500.    Elapsed: 0:02:57.
epoch=0,训练准确率=0.9015，损失=0.2549531048182398
epoch=0,测试准确率=0.9285
  Batch    40  of    500.    Elapsed: 0:00:16.
  Batch    80  of    500.    Elapsed: 0:00:31.
  Batch   120  of    500.    Elapsed: 0:00:47.
  Batch   160  of    500.    Elapsed: 0:01:03.
  Batch   200  of    500.    Elapsed: 0:01:18.
  Batch   240  of    500.    Elapsed: 0:01:34.
  Batch   280  of    500.    Elapsed: 0:01:50.
  Batch   320  of    500.    Elapsed: 0:02:06.
  Batch   360  of    500.    Elapsed: 0:02:22.
  Batch   400  of    500.    Elapsed: 0:02:37.
  Batch   440  of    500.    Elapsed: 0:02:53.
  Batch   480  of    500.    Elapsed: 0:03:09.
epoch=1,训练准确率=0.9595，损失=0.14357946291333065
epoch=1,测试准确率=0.939

4.预测

 1 def predict(sen):
 2     
 3     input_id = convert_text_to_token(tokenizer, sen)
 4     input_token =  torch.tensor(input_id).long().to(device)            #torch.Size([128])
 5     
 6     atten_mask = [float(i>0) for i in input_id]
 7     attention_token = torch.tensor(atten_mask).long().to(device)       #torch.Size([128])         
 8     
 9     output = model(input_token.view(1, -1), token_type_ids=None, attention_mask=attention_token.view(1, -1))     #torch.Size([128])->torch.Size([1, 128])否则会报错
10     print(output[0])
11     
12     return torch.max(output[0], dim=1)[1]
13 
14 label = predict('酒店位置难找，环境不太好，隔音差，下次不会再来的。')
15 print('好评' if label==1 else '差评')
16 
17 label = predict('酒店还可以，接待人员很热情，卫生合格，空间也比较大，不足的地方就是没有窗户')
18 print('好评' if label==1 else '差评')
19 
20 label = predict('"服务各方面没有不周到的地方, 各方面没有没想到的细节"')
21 print('好评' if label==1 else '差评')

tensor([[ 3.5719, -2.7315]], device='cuda:0', grad_fn=<AddmmBackward>)

差评

tensor([[-2.7998, 2.8675]], device='cuda:0', grad_fn=<AddmmBackward>)

好评

tensor([[-1.9614, 1.5925]], device='cuda:0', grad_fn=<AddmmBackward>)

好评

性能还可以，第三句这种有点奇怪的句子也能正确识别了。

参考链接：https://blog.csdn.net/Code_Tookie/article/details/104944888?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param