ML--文本数据处理

ML–文本数据处理

一直以来，自然语言处理(Natual Language Processing,NLP)作为人工智能的重要分支之一，其研究的内容是如何实现人与计算机之间用自然语言进行有效的通信。自然语言处理中的基础知识–如何对文本数据进行处理

主要涉及的知识点有：

文本数据的特征提取
中文文本的分词办法
用n-Gram模型优化文本数据
使用tf-idf模型改善特征提取
删除停用词(Stopwords)

一.文本数据的特征提取，中文分词及词袋模型

1.使用CountVectorizer对文本进行特征提取

数据特征大致可以分为两种：一种是用来表示数值的连续特征；另一种是表示样本所在分类的类型特征。而在自然语言处理的领域中，我们会接触到的是第三种数据类型–文本数据

文本数据在计算机中往往被存储为字符串类型(String)。中文不像英语那样，在每个词之间有空格作为分界线，这就要求我们在处理中文文本的时候，需要先进行分词处理

例如英语：“The quick brown fox jumps over a lazy dog”，中文意思是"那只敏捷的棕色狐狸跳过了一只懒惰的狗"

from sklearn.feature_extraction.text import CountVectorizer

vect=CountVectorizer()

# 使用CountVectorizer拟合文本数据
en=['The quick brown fox jumps over a lazy dog']
vect.fit(en)

print('单词数：{}'.format(len(vect.vocabulary_)))
print('分词：{}'.format(vect.vocabulary_))

单词数：8
分词：{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumps': 3, 'over': 5, 'lazy': 4, 'dog': 1}

[结果分析] 程序没有将冠词"a"统计进来。因为"a"只有一个字母，所以程序没有把它作为一个单词

下面我们再来看中文的情况

# 使用中文文本进行实验
cn=['那只敏捷的棕色狐狸跳过了一只懒惰的狗']

# 拟合中文文本数据
vect.fit(cn)

print('单词数：{}'.format(len(vect.vocabulary_)))
print('分词：{}'.format(vect.vocabulary_))

单词数：1
分词：{'那只敏捷的棕色狐狸跳过了一只懒惰的狗': 0}

[结果分析] 程序无法对中文语句进行分词，它把整句话当成了一个词，这是因为中文与英语不同，英语的词与词之间有空格作为天然的分隔符，而中文却没有

2.使用分词工具对中文文本进行分词

我们使用jieba模块来对上文中的中文语句进行分词

import jieba

cn=jieba.cut('那只敏捷的棕色狐狸跳过了一个只懒惰的狗')

# 使用空格作为词与词之间的分界线
cn=[' '.join(cn)]

print(cn)

Building prefix dict from the default dictionary ...
Dumping model to file cache C:UserslenovoAppDataLocalTempjieba.cache
Loading model cost 1.013 seconds.
Prefix dict has been built succesfully.


['那 只 敏捷 的 棕色 狐狸 跳过 了 一个 只 懒惰 的 狗']

借助jieba模块，我们把这句中文语句进行了分词操作，并在每个单词之间插入空格作为分界线

下面我们重新使用CountVectorizer对其进行特征抽取

# 使用CountVectorizer对中文文本向量化
vect.fit(cn)

print('单词数：{}'.format(len(vect.vocabulary_)))
print('分词：{}'.format(vect.vocabulary_))

单词数：6
分词：{'敏捷': 2, '棕色': 3, '狐狸': 4, '跳过': 5, '一个': 0, '懒惰': 1}

3.使用词袋模型将文本数据转为数组

CountVectorizer给每个词编码为一个从0到5的整型数。经过这样的处理之后，我们便可以用一个稀疏矩阵(sparse matrix)对这个文本数据进行表示了

# 定义词袋模型
bag_of_words=vect.transform(cn)

print('转化为词袋的特征：
{}'.format(repr(bag_of_words)))

转化为词袋的特征：
<1x6 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

[结果分析] 原来的那句话，被转化为一个1行6列的稀疏矩阵，类型为64位整型数值，其中有6个元素

下面我们看看6个元素都是什么，输入代码如下：

# 打印词袋模型的密度表达
print('词袋的密度表达：
{}'.format(bag_of_words.toarray()))

词袋的密度表达：
[[1 1 1 1 1 1]]

[结果分析] 我们通过分词工具拆分出的6个单词在这句话中出现的次数。比如在数组中的第一个元素是1，它代表在这句话中，"一只"这个词出现的次数是1次；第二个元素1，代表这句话中，"懒惰"这个词出现的次数也是1次

现在我们可以试着换一句话来看看结果有什么不同。例如"懒惰的狐狸不如敏捷的狐狸敏捷，敏捷的狐狸不如懒惰的狐狸懒惰"

cn_1=jieba.cut('懒惰的狐狸不如敏捷的狐狸敏捷，敏捷的狐狸不如懒惰的狐狸懒惰')

# 以空格进行分隔
cn2=[' '.join(cn_1)]

print(cn2)

['懒惰 的 狐狸 不如 敏捷 的 狐狸 敏捷 ， 敏捷 的 狐狸 不如 懒惰 的 狐狸 懒惰']

vect.fit(cn2)

print('单词数：{}'.format(len(vect.vocabulary_)))
print('分词：{}'.format(vect.vocabulary_))

单词数：4
分词：{'懒惰': 1, '狐狸': 3, '不如': 0, '敏捷': 2}

接下来，我们再用CountVectorizer将这句话进行转化，输入代码如下：

# 建立新的词袋模型
new_bag=vect.transform(cn2)

print('转化为词袋的特征：
{}'.format(repr(new_bag)))
print('词袋的密度表达：
{}'.format(new_bag.toarray()))

转化为词袋的特征：
<1x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>
词袋的密度表达：
[[2 3 3 4]]

二.对文本数据进一步优化处理

学习如何使用n_Gram算法来改善词袋模型，以及如何使用tf-idf算法对文本数据进行处理，和如何删除文本数据中的停用词

1.使用n-Gram改善词袋模型

虽然用词袋模型可以简化自然语言，利于机器学习算法建模，但是它的劣势也很明显–由于词袋模型把句子看作单词的简单集合，那么单词出现的顺序就会被无视，这样一来可能会导致包含同样单词，但是顺序不一样的两句话在机器看成了完全一样的意思

# 随便写一句话
joke=jieba.cut('道士看见和尚亲吻了尼姑的嘴唇')

# 插入空格
joke=[' '.join(joke)]

vect.fit(joke)
joke_feature=vect.transform(joke)

print('这句话的特征表达：
{}'.format(joke_feature.toarray()))

这句话的特征表达：
[[1 1 1 1 1 1]]

接下来我们把这句话的顺序打乱，变成"尼姑看见道士的嘴唇亲吻了和尚"

joke2=jieba.cut('尼姑看见道士的嘴唇亲吻了和尚')

joke2=[' '.join(joke2)]

joke2_feature=vect.transform(joke2)

print('这句话的特征表达：
{}'.format(joke2_feature.toarray()))

这句话的特征表达：
[[1 1 1 1 1 1]]

[结果分析] 两个结果是完全一样的。也就是说，这两句意思完全不同的话，对于机器来说，意思是一模一样的

要解决这个问题，我们可以对CountVectorizer中的ngram_range参数进行调节。这里先介绍一下，n_Gram是大词汇连续文本或语音识别中的常用的一种语言模型，它是利用上下文相邻词的搭配信息来进行文本数据转换的，其中n代表一个整型数值，例如n等于2的时候，模型称为bi-Gram,意思是n-Gram会对相邻的两个单词进行配对；而n等于3时，模型称为tri-Gram，也就是会对相邻的3个单词进行配对

# 修改CountVectorizer的ngram参数
vect=CountVectorizer(ngram_range=(2,2))

cv=vect.fit(joke)
joke_feature=cv.transform(joke)

print('调整n-Gram参数后的词典：{}'.format(cv.get_feature_names()))
print('新的特征表达：{}'.format(joke_feature.toarray()))

调整n-Gram参数后的词典：['亲吻 尼姑', '和尚 亲吻', '尼姑 嘴唇', '看见 和尚', '道士 看见']
新的特征表达：[[1 1 1 1 1]]

现在我们再来试试；另外一句：“尼姑看见道士的嘴唇亲吻了和尚”

joke2=jieba.cut('尼姑看见道士的嘴唇亲吻了和尚')

joke2=[' '.join(joke2)]

joke2_feature=vect.transform(joke2)

print('这句话的特征表达：
{}'.format(joke2_feature.toarray()))

这句话的特征表达：
[[0 0 0 0 0]]

2.使用tf-idf模型对文本数据进行处理

tf-idf全称为"term frequency-inverse document frequency"，一般翻译为"词频-逆向文本频率"。它是一种用来评估某个词对于一个语料库中的某一份文件的重要程度，如果某个词在某个文件中出现的次数非常高，但在其他文件中出现的次数很少，那么tf-idf就会认为这个词能够很好地将文件进行区分，重要程度就会较高
第一步：计算词频

第二步：计算逆文档频率

第三步：计算TF-IDF

在scikit-learn当中，有两个类使用了tf-idf方法，其中一个是TfidfTransformer，它用来将CountVectorizer从文本中提取的特征矩阵进行转化；另一个是TfidfVectorizer,它和CountVecterizer的用法是相同的

为了进一步介绍TfidfVectorizer的用法，以及它和CountVectorizer的区别，我们下面使用一个相对复杂的数据集，也是一个非常经典的用于进行自然语言处理的案例，就是IMDB电影评论数据集，IMDB数据集下载

为了能够减低数据载人的时间，更好地为大家进行展示，我们从train和test文件夹中各抽取50个正面评论和50个负面评论，保存在新的文件夹中，命名为Imdblite

from sklearn.datasets import load_files

train_set=load_files('Imdblite/train')

X_train,y_train=train_set.data,train_set.target

print('训练集文件数量：{}'.format(len(X_train)))
print('随机抽一个看看：',X_train[22])

训练集文件数量：100
随机抽一个看看： b"All I could think of while watching this movie was B-grade slop. Many have spoken about it's redeeming quality is how this film portrays such a realistic representation of the effects of drugs and an individual and their subsequent spiral into a self perpetuation state of unfortunate events. Yet really, the techniques used (as many have already mentioned) were overused and thus unconvincing and irrelevant to the film as a whole.<br /><br />As far as the plot is concerned, it was lacklustre, unimaginative, implausible and convoluted. You can read most other reports on this film and they will say pretty much the same as I would.<br /><br />Granted some of the actors and actresses are attractive but when confronted with such boring action... looks can only carry a film so far. The action is poor and intermittent: a few punches thrown here and there, and a final gunfight towards the end. Nothing really to write home about.<br /><br />As others have said, 'BAD' movies are great to watch for the very reason that they are 'bad', you revel in that fact. This film, however, is a void. It's nothing.<br /><br />Furthermore, if one is really in need of an educational movie to scare people away from drug use then I would seriously recommend any number of other movies out there that board such issues in a much more effective way. 'Requiem For A Dream', 'Trainspotting', 'Fear and Loathing in Las Vegas' and 'Candy' are just a few examples. Though one should also check out some more lighthearted films on the same subject like 'Go' (overall, both serious and funny) and 'Halfbaked'.<br /><br />On a final note, the one possibly redeeming line in this movie, delivered by Vinnie Jones was stolen from 'Lock, Stock and Two Smokling Barrels'. To think that a bit of that great movie has been tainted by 'Loaded' is vile.<br /><br />Overall, I strongly suggest that you save you money and your time by NOT seeing this movie."

为了不让"**
"**影响机器学习的模型，我们把它用空格替换掉

# 将文本中的<br />去掉
X_train=[doc.replace(b'<br />',b' ') for doc in X_train]

print('随机抽一个看看：',X_train[22])

随机抽一个看看： b"All I could think of while watching this movie was B-grade slop. Many have spoken about it's redeeming quality is how this film portrays such a realistic representation of the effects of drugs and an individual and their subsequent spiral into a self perpetuation state of unfortunate events. Yet really, the techniques used (as many have already mentioned) were overused and thus unconvincing and irrelevant to the film as a whole.  As far as the plot is concerned, it was lacklustre, unimaginative, implausible and convoluted. You can read most other reports on this film and they will say pretty much the same as I would.  Granted some of the actors and actresses are attractive but when confronted with such boring action... looks can only carry a film so far. The action is poor and intermittent: a few punches thrown here and there, and a final gunfight towards the end. Nothing really to write home about.  As others have said, 'BAD' movies are great to watch for the very reason that they are 'bad', you revel in that fact. This film, however, is a void. It's nothing.  Furthermore, if one is really in need of an educational movie to scare people away from drug use then I would seriously recommend any number of other movies out there that board such issues in a much more effective way. 'Requiem For A Dream', 'Trainspotting', 'Fear and Loathing in Las Vegas' and 'Candy' are just a few examples. Though one should also check out some more lighthearted films on the same subject like 'Go' (overall, both serious and funny) and 'Halfbaked'.  On a final note, the one possibly redeeming line in this movie, delivered by Vinnie Jones was stolen from 'Lock, Stock and Two Smokling Barrels'. To think that a bit of that great movie has been tainted by 'Loaded' is vile.  Overall, I strongly suggest that you save you money and your time by NOT seeing this movie."

# 载人测试集
test=load_files('Imdblite/test/')

X_test,y_test=test.data,test.target

# 将文本中的<br />去掉
X_test=[doc.replace(b'<br />',b' ') for doc in X_test]

len(X_test)

下面要对文本数据进行特征提取，首先使用前面学到的CountVectorizer来进行特征提取

# 用CountVectorizer拟合训练数据
vect=CountVectorizer().fit(X_train)

X_train_vect=vect.transform(X_train)

print('训练集样本特征数量：{}'.format(len(vect.get_feature_names())))
print('最后10个训练集样本特征：{}'.format(vect.get_feature_names()[-10:]))

训练集样本特征数量：3941
最后10个训练集样本特征：['young', 'your', 'yourself', 'yuppie', 'zappa', 'zero', 'zombie', 'zoom', 'zooms', 'zsigmond']

使用一个有监督学习算法来进行交叉验证评分

from sklearn.svm import LinearSVC

from sklearn.model_selection import cross_val_score

scores=cross_val_score(LinearSVC(),X_train_vect,y_train)

print('模型平均分：{:.3f}'.format(scores.mean()))

模型平均分：0.778


E:Anacondaenvsmytensorflowlibsite-packagessklearnmodel_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)

# 把测试数据集转化为向量
X_test_vect=vect.transform(X_test)

# 使用线性SVC拟合训练数据集
clf=LinearSVC().fit(X_train_vect,y_train)

print('测试集模型得分：{}'.format(clf.score(X_test_vect,y_test)))

测试集模型得分：0.58

希望能稍微提高一下模型的表现，所以接下来尝试用tf-idf算法来处理一个数据

# 导入tfidf转化工具
from sklearn.feature_extraction.text import TfidfTransformer

# 用tfidf工具转化训练集和测试集
tfidf=TfidfTransformer(smooth_idf=False)
tfidf.fit(X_train_vect)

X_train_tfidf=tfidf.transform(X_train_vect)
X_test_tfidf=tfidf.transform(X_test_vect)

print('未经tfidf处理的特征：
',X_train_vect[:5,:5].toarray())
print('经过tfidf处理的特征：
',X_train_tfidf[:5,:5].toarray())

未经tfidf处理的特征：
 [[0 0 0 0 0]
 [0 0 0 0 0]
 [0 1 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
经过tfidf处理的特征：
 [[ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.13862307  0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]

# 重新训练线性SVC模型
clf=LinearSVC().fit(X_train_tfidf,y_train)

# 使用新数据进行交叉验证
scores2=cross_val_score(LinearSVC(),X_train_tfidf,y_train)

print('经过tfidf处理的训练集交叉验证得分：{:.3f}'.format(scores.mean()))
print('经过tfidf处理的测试集得分：{:.3f}'.format(clf.score(X_test_tfidf,y_test)))

经过tfidf处理的训练集交叉验证得分：0.778
经过tfidf处理的测试集得分：0.580


E:Anacondaenvsmytensorflowlibsite-packagessklearnmodel_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)

3.删除文本中的停用次

在自然语言处理领域，有一个概念称为"停用词(Stopwords)"，指的是那些在文本处理过程中被筛选出去的，出现频率很高但又没有什么实际意义的词

关于停用词，可以参考：停用词

在我们所使用的scikit-learn中，也内置了英语的停用词表，其中包括常见的停用词318个

# 导入内置的停用词库
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print('停用词个数：',len(ENGLISH_STOP_WORDS))

# 打印停用词中前20个和后20个
print('列出前20个和最后20个：
',list(ENGLISH_STOP_WORDS)[:20],list(ENGLISH_STOP_WORDS)[-20:])

停用词个数： 318
列出前20个和最后20个：
 ['meanwhile', 'five', 'were', 'because', 'during', 'must', 'eight', 'becomes', 'serious', 'has', 'twenty', 'nowhere', 'amongst', 'himself', 'beyond', 'other', 'take', 'now', 'hundred', 'third'] ['thereby', 'otherwise', 'co', 'find', 'never', 'by', 'a', 'everything', 'on', 'its', 'very', 'wherein', 'each', 'ltd', 'onto', 'see', 'whoever', 'being', 'enough', 'he']

接下来尝试在精简版IMDB影评数据集中进行停用词的删除，看是否可以提高模型的分数

# 导入Tfidf模型
from sklearn.feature_extraction.text import TfidfVectorizer

# 激活英语停用词参数
tfidf=TfidfVectorizer(smooth_idf=False,stop_words='english')

tfidf.fit(X_train)

# 将训练数据集文本转化为向量
X_train_tfidf=tfidf.transform(X_train)

scores3=cross_val_score(LinearSVC(),X_train_tfidf,y_train)
clf.fit(X_train_tfidf,y_train)

# 将测试数据集转化为向量
X_test_tfidf=tfidf.transform(X_test)

print('去掉停用词后训练集交叉验证平均分：{:.3f}'.format(scores3.mean()))
print('去掉停用词后测试集模型得分：{:.3f}'.format(clf.score(X_test_tfidf,y_test)))

去掉停用词后训练集交叉验证平均分：0.890
去掉停用词后测试集模型得分：0.670


E:Anacondaenvsmytensorflowlibsite-packagessklearnmodel_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)