文本分类流程详细总结（keras）

一、背景

在进行深度学习的时候，需要进行模型的预处理和数据转换，这里记录一下内容和方法，方便以后的使用和查找。根据模型的过程，将会按照数据集的处理、标签转化、文本向量化、模型构建、添加评估内容等几个基础的方面进行介绍。

二、内容介绍

2.1 数据的读取

数据的读取一般是直接使用pandas进行读取。这里需要注意的问题就是编码的问题，在进行操作的时候，往往会出现无法识别的编码，下面进行一个总结和一些情况的处理，有助于以后能够快速找到问题，同时给出知乎的编码差距。

1.编码问题
	一般常用的编码是utf-8的编码格式。有时候会出现无法解析的编码
	编码集大小：GBK < GB2312 < GB18030，具体可以查看编码差距。
2. 特殊情况的处理
	可以使用notepad++查看字节编码，然后使用对应的编码进行处理
上述方法可以解决字符编码的问题。

2.2 数据的洗涤

数据洗涤一般使用pandas中的apply的方法。下面给出字符替换的内容（参考Tokenizer源码中数据洗涤的方法）,需要其他的内容直接在处理函数中增加即可。

def data_detail(text: str) -> str:
    filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~	
',
    for i in filters:
    	text.replace(i, "")
    return text
    
pf.text = pf.text.apply(data_detail)

2.3 文本向量化

对于文本数据需要进行文本的向量化，像图像数据或者其他的数字数据不需要进行这一步操作，可以直接放入模型中进行训练。文本话的方法也是训练一个文本转化器，然后通过序列化的方法，将文本转成对于的id值，这一步操作可以使用内置方法，也可以自己建立一个词典，然后使用numpy将得到的结果向量化，就可以进行训练。

# 构建一个转化器
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
# 训练文本
tokenizer.fit_on_texts(text_c)
# 将文本序列化
# 默认情况下，没有出现的词不会给出id会自动过滤掉，否则需要在Tokenizer的时候给出oov_token的参数，这样没有出现的词就会用oov_token的标签代替。
sequences = tokenizer.texts_to_sequences(texts)
# 填补到统一位数，post表示在后面补位，默认是在前面补位，同时可以指定补充的值默认为0
data = pad_sequences(sequences, maxlen=MAX_NUM_WORDS, padding="post")

2.4 标签one-hot化

在进行训练的时候，需要将标签也转码，因此我们需要将其编程one-hot编码，这里也分为两种情况。第一是数字标签且不要求按照顺序排列，可以直接使用to_categorical转码。第二是非数字标签，或者要求从1或者0开始的，那就是需要使用到LabelEncoder()去训练一个标签编码器，然后进行标注，标注完成之后再使用to_categorical去编码，同时可以使用标签器将标签转成数据。下面介绍一下使用：

# 注：对于多分类的标签一样适用，直接使用to_categorical即可完成。
"""
数字标签
label = to_categorical(x)
"""
"""
字符标签，或者要求连续的标签
"""
# 标签编码器
from sklearn import preprocessing
# 声明标签数据编码器
label_coder = preprocessing.LabelEncoder()
# 训练
label_coder.fit(labels)
# 转成数字
result_tr = label_coder.transform(labels)
# one-hot
labels = to_categorical(result_tr)

2.5 数据分割

在训练的时候，需要对数据进行一个分割，或者是重排。可以先将标签以及数据都处理好再去分割，或者重排。

# 划分数据
from sklearn.model_selection import train_test_split
# 重排数据
import random
# 数据分割
# 划分的结果得到四个内容：训练集、训练集标签、测试集、测试集标签。random_state保证了打乱数据的一致性，也就是每次打乱的结果都一直，结果之间就可以比较。
train_x, test_x, train_y, test_y = train_test_split(data, labels, random_state=42)

# 数据重排
# 对训练集数据进行重排，避免数据对结果造成影响。
def none_shuffle(x_1, y_1, state_random=42):
    _array = list(zip(x_1, y_1))
    random.seed(state_random)
    random.shuffle(_array)
    x_, y_ = zip(*_array)
    return np.array(x_, dtype="int32"), np.array(y_, dtype="float32")
train_x, train_y = none_shuffle(train_x, train_y)

2.6 模型构建

通过前面的处理，已经将数据准备完成，下面就将进行模型构建，模型构建的方式通常分为两种。第一个是使用Sequential然后用add的方法去叠加模型；另一个是使用Model将输入、输出指定，包括Input，和输出形式，这样就可以完成模型的构建。这两种方法的区别是：使用Sequential的可以直接在预测中调用方法预测，方法简单但是在扩展性上不好，多模型复杂结构是无法完成的。使用Model的方法需要使用np.max的方法获取最大输出值的位置，可以一层一层指定，拓展性高，可以完成复杂模型的叠加和处理。

"""
在使用keras的时候需要注意模型导入的方式，使用keras导入的和使用TensorFlow.keras导入模型这两种方式不能互通，有时候会出现bug提示，无法识别网络层。
"""
# 第一种使用Sequential构建模型
model = Sequential()
model.add(Embedding(400, 256))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Attention())
model.add(FM(256))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(len(label_coder.classes_), activation='softmax'))

# 第二种使用Model构建模型
MAX_LEN = 400
input_layer = Input(shape=(MAX_LEN,))
layer = Embedding(input_dim=len(id2char), output_dim=256)(input_layer)
layer = Bidirectional(LSTM(256, return_sequences=True))(layer)
layer = Flatten()(layer)
output_layer = Dense(len(df_train['label'].unique()), activation='softmax')(layer)
model = Model(inputs=input_layer, outputs=output_layer)
model.summary()

2.7 模型训练

模型训练通常需要一个指标或者优化，包括设置损失函数、评估函数等。

# 损失函数指定
# categorical_crossentropy用于多分类和激活函数softmax匹配
# binary_crossentropy用于二分类和激活函数sigmoid匹配

# 模型编译，模型通过编译之后才可以训练，可以通过以下的内容对模型训练进行优化。
optimizer：指定优化器，优化器可以通过字符添加，也可以引入keras中的optimizers指定，callbacks：可以通过keras中的callbacks对模型设置ModelCheckpoint、EarlyStopping、TensorBoard、ReduceLROnPlateau等，对模型训练过程更加精准的控制。
metrics：设置评估的参数，可以在 训练过程中看到，系统给出的是accuracy，其他的需要自己编写、实现。
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy',recall_threshold(0.5),precision_threshold(0.5)])

# 训练的时候，对于多输入，按照输入的顺序做一个一个整体输入，同时可以通过validation_data指定验证数据。具体实现在后面给出一些例子。
# 单输入
history = model.fit(x_train, y_train)
                    batch_size=256,
                    epochs=100,
                    validation_data=(x_test, y_test),
                    verbose=1,
                    callbacks=callbacks)
# 多输入
# 下面的三个输入会对应model中的inputs参数,它也是多个输入和这个对应起来即可。其他参数和单输入训练一致。
history = model.fit([x_train, wordnet_train, kg_train], y_train)

给出metrics的方法包括f1、召回率、精确率。

def F1_macro(y_true, y_pred):
    # matthews_correlation
    y_pred_pos = K.round(K.clip(y_pred, 0, 1))
    y_pred_neg = 1 - y_pred_pos
    y_pos = K.round(K.clip(y_true, 0, 1))
    y_neg = 1 - y_pos
    tp = K.sum(y_pos * y_pred_pos)
    tn = K.sum(y_neg * y_pred_neg)
    fp = K.sum(y_neg * y_pred_pos)
    fn = K.sum(y_pos * y_pred_neg)
    numerator = (tp * tn - fp * fn)
    denominator = K.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    return numerator / (denominator + K.epsilon())
    
def recall_threshold(threshold = 0.5):
    def recall(y_true, y_pred):
        """Recall metric.
        Computes the recall over the whole batch using threshold_value.
        """
        threshold_value = threshold
        # Adaptation of the "round()" used before to get the predictions. Clipping to make sure that the predicted raw values are between 0 and 1.
        y_pred = K.cast(K.greater(K.clip(y_pred, 0, 1), threshold_value), K.floatx())
        # Compute the number of true positives. Rounding in prevention to make sure we have an integer.
        true_positives = K.round(K.sum(K.clip(y_true * y_pred, 0, 1)))
        # Compute the number of positive targets.
        possible_positives = K.sum(K.clip(y_true, 0, 1))
        recall_ratio = true_positives / (possible_positives + K.epsilon())
        return recall_ratio
    return recall

def precision_threshold(threshold=0.5):
    def precision(y_true, y_pred):
        """Precision metric.
        Computes the precision over the whole batch using threshold_value.
        """
        threshold_value = threshold
        # Adaptation of the "round()" used before to get the predictions. Clipping to make sure that the predicted raw values are between 0 and 1.
        y_pred = K.cast(K.greater(K.clip(y_pred, 0, 1), threshold_value), K.floatx())
        # Compute the number of true positives. Rounding in prevention to make sure we have an integer.
        true_positives = K.round(K.sum(K.clip(y_true * y_pred, 0, 1)))
        # count the predicted positives
        predicted_positives = K.sum(y_pred)
        # Get the precision ratio
        precision_ratio = true_positives / (predicted_positives + K.epsilon())
        return precision_ratio
    return precision

给出callbacks的方法。

model_dir = main_model_dir + time.strftime('%Y-%m-%d %H-%M-%S') + "/"
model_file = model_dir + "{epoch:02d}-val_acc-{val_acc:.2f}-val_loss-{val_loss:.2f}.hdf5"
# 保存最好模型
checkpoint = ModelCheckpoint(
    model_file, 
    monitor='val_acc', 
    save_best_only=True)

# 提前结束
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=5,
    verbose=1,
    restore_best_weights=True)

# 减少学习率
reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    verbose=1)
    
callbacks = [checkpoint, reduce_lr, early_stopping]

2.8 结果预测

结果预测也分为两个，一个是Sequential的方法，一个是Model的方法。对于多输入的预测，将测试集数据放在一起作为整体的输入即可。

# Sequential的方法
"""
对于sequential可以直接使用predict_classes进行预测，会直接返回对应的预测编码，也就是标签信息。
"""
y_pre = model.predict_classes(test_x)

# Model的方法
"""
首先进行结果的预测，然后会返回一个n为的结果，在通过argmax返回最大结果的位置，最大值对应标签结果。
"""
predict = model.predict(x_test, verbose=1, batch_size=40)
sc=np.argmax(predict,axis=1)

# 得到标签，通过标签编码器，转成对应的数据。
show = label_coder.inverse_transform(y_pre)
print(show[:5])

# 多输入的情况，其他和单输入的结果转化一样
scores = model.predict([x_test, wordnet_test, kg_test], verbose=1, batch_size=40)

2.9 结果分析

对于分类结果的分析主要包括以下的内容：结果的评估，训练曲线的分析，分类报告等内容。

结果评估

# 结果评估
"""
首先获取模型评估的方法，然后将测试集传入即可。会得到一个整体的结果。
"""
print(model.metrics_names)
# 单输入
score, acc,recall,precision= model.evaluate(x_test, y_test, batch_size=128)
print("
Test loss score: %.4f, accuracy: %.4f, recall: %.4f,precision: %.4f" % (score, acc,recall,precision))

# 多输入
"""
与训练和预测一样，将需要的多个测试集做一个整体输入。
"""
score = model.evaluate([x_test, wordnet_test, kg_test], y_test, batch_size=BATCH_SIZE)
print("ACCURACY:", score[1])
print("LOSS:", score[0])

训练曲线分析
训练完的结果可以通过history进行保存。然后绘制训练曲线，曲线结果分析的详细介绍。

# 查询history中的关键字数据
print(history.history.keys())
# 绘制曲线
"""
loss、val_loss需要减小，表示训练结果的偏差越来越小
acc、val_acc要增加，表示结果越来越好
其他情况需要去
"""
plot_performance(history=history)

绘制图像代码

import matplotlib.pyplot as plt
def plot_performance(history=None, figure_directory=None, ylim_pad=[0, 0]):
	xlabel = 'Epoch'
    legends = ['Training', 'Validation']
    plt.figure(figsize=(20, 5))
    y1 = history.history['acc']
    y2 = history.history['val_acc']
    min_y = min(min(y1), min(y2))-ylim_pad[0]
    max_y = max(max(y1), max(y2))+ylim_pad[0]
    plt.subplot(121)
    plt.plot(y1)
    plt.plot(y2)
    plt.title('Model Accuracy
'+date_time(1), fontsize=17)
    plt.xlabel(xlabel, fontsize=15)
    plt.ylabel('Accuracy', fontsize=15)
    plt.ylim(min_y, max_y)
    plt.legend(legends, loc='upper left')
    plt.grid()
    y1 = history.history['loss']
    y2 = history.history['val_loss']
    min_y = min(min(y1), min(y2))-ylim_pad[1]
    max_y = max(max(y1), max(y2))+ylim_pad[1]
    plt.subplot(122)
    plt.plot(y1)
    plt.plot(y2)
    plt.title('Model Loss
'+date_time(1), fontsize=17)
    plt.xlabel(xlabel, fontsize=15)
    plt.ylabel('Loss', fontsize=15)
    plt.ylim(min_y, max_y)
    plt.legend(legends, loc='upper left')
    plt.grid()
    if figure_directory:
        plt.savefig(figure_directory+"/history")
    plt.show()

分类报告
前面已经有了评估报告，这里的分类报告可以查看到具体每个类的准确率，和详细信息。

from sklearn.metrics import f1_score, accuracy_score,confusion_matrix,classification_report, recall_score, precision_score
re_sult = classification_report(x_ture_t, sc, zero_division=1)

# average表示取值的不同，有micro、binary等具体可以查看源码
print(f"recall_score:{recall_score(x_ture_t, sc, average='macro')}")
print(f"f1_score:{f1_score(x_ture_t, sc, average='macro')}")
print(f"accuracy_score:{accuracy_score(x_ture_t, sc)}")
print(f"precision_score:{precision_score(x_ture_t, sc, average='macro')}")

2.10 模型保存

模型保存也分为两种：全部信息和权重信息。

model.save("xxx.h5")
model.save_weights("xxx.h5")

三、总结

记录这个的主要目的就是方便查看以及可以巩固学习，处理方法都很通用，把它写成电子档方便查找，后面使用的时候可以直接引用。