利用Python进行文章特征提取（一）

#     文字特征提取 词库模型（bag of words） 2016年2月26，星期五
# 1.词库表示法

In [9]:

# sklearn 的 CountVectorizer类能够把文档词块化（tokenize），代码如下
from sklearn.feature_extraction.text import CountVectorizer
corpus=['UNC played Duke in basketball','Duke lost the basketball game','I ate a sandwich']
vectorizer=CountVectorizer()
corpusTotoken=vectorizer.fit_transform(corpus).todense()
corpusTotoken
#[[1, 1, 0, 1, 0, 1, 0, 1],
#        [1, 1, 1, 0, 1, 0, 1, 0]]
vectorizer.vocabulary_
#{u'ate': 0,
# u'basketball': 1,
# u'duke': 2,
# u'game': 3,
# u'in': 4,
# u'lost': 5,
# u'played': 6,
# u'sandwich': 7,
# u'the': 8,
# u'unc': 9}

In [14]:

# 2. 计算向量之间的欧式距离，sklearn中引入euclidean_distances，代码如下：
from sklearn.metrics.pairwise import euclidean_distances
counts=vectorizer.fit_transform(corpus).todense()
for x,y in [[0,1],[0,2],[1,2]]:
    dist=euclidean_distances(counts[x],counts[y])
    print('文档{}与文档{}的距离{}'.format(x,y,dist))
    
#文档0与文档1的距离[[ 2.44948974]]
#文档0与文档2的距离[[ 2.64575131]]
#文档1与文档2的距离[[ 2.64575131]]

In [17]:

# 3.停用词过滤，停用词通常是构建文档意思的功能词汇，其字面意义并不体现。CountVectorizer类可以通过设置stop_words参数过滤停用词。默认是英语常用的停用词。代码如下
vectorizer=CountVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
#[[0 1 1 0 0 1 0 1]
# [0 1 1 1 1 0 0 0]
# [1 0 0 0 0 0 1 0]]
print(vectorizer.vocabulary_)
#{u'duke': 2, u'basketball': 1, u'lost': 4, u'played': 5, u'game': 3, u'sandwich': 6, u'unc': 7, u'ate': 0}

#4. 词根还原与词性还原。特征向量里面的单词很多都是一个词的不同形式，比如jumping和jumps都是jump的不同形式。词根还原与词形还原就是为了将单词从不同的时态、派生形式还原。可利用Python里面的NLTK（Natural Language ToolKit）库来处理

In [28]:

import nltk
nltk.download()

showing info http://www.nltk.org/nltk_data/

Out[28]:

True

In [26]:

from nltk.stem.wordnet import WordNetLemmatizer
lemm=WordNetLemmatizer()

In [29]:

print(lemm.lemmatize('gathering'),'v')
print(lemm.lemmatize('gathering'),'n')

#('gathering', 'v')

#('gathering', 'n')