scikit-learn进行TFIDF处理

看到https://www.cnblogs.com/pinard/p/6693230.html的博客之后自己实践了一下

第一种方法也就是CountVectorizer+TfidfTransformer的组合，代码在下面

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This is the second document.',
    'And the third one',
    'Is this the first document?',
    'I come to American to travel'
]

words = CountVectorizer.fit_transform(corpus)
tfidf = TfidfTransformer().fit_transform(words)

print (cv.get_feature_names())
print (words.toarray())
print (tfidf)

CountVectorizer()是一个类，包含了很多方法，比如要打印每个词就用get_feature_names

之后用fit_transform()处理词向量，可以打出矩阵，矩阵的每一行是每一个文本(在这里是句子)的词频统计，每列是什么词如上图

以下是TFIDF的结果：

(0, 10) 0.440270504199
(0, 5) 0.440270504199
(0, 8) 0.370369427837
(0, 4) 0.530388665338
(0, 3) 0.440270504199
(1, 10) 0.410399746731
(1, 5) 0.410399746731
(1, 8) 0.345241204967
(1, 3) 0.410399746731
(1, 7) 0.612800664198
(2, 8) 0.309317493592
(2, 1) 0.549036334
(2, 9) 0.549036334
(2, 6) 0.549036334
(3, 10) 0.440270504199
(3, 5) 0.440270504199
(3, 8) 0.370369427837
(3, 4) 0.530388665338
(3, 3) 0.440270504199
(4, 2) 0.377964473009
(4, 11) 0.755928946018
(4, 0) 0.377964473009
(4, 12) 0.377964473009

在(index1,index2)中：index1表示为第几个句子或者文档，index2为所有语料库中的单词组成的词典的序号，之后的数字为该词所计算得到的TFIDF的结果值

第二种方法就是直接用TfidfVectorizer()

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This is the second document.',
    'And the third one',
    'Is this the first document?',
    'I come to American to travel'
]

tfidf = TfidfVectorizer().fit_transform(corpus)
print (tfidf)

结果是一样的

另外的话，对于数据处理还有一种是transform()，它和fit_transform()区别是什么？

区别在哪？

fit_transform()就是将fit()和transform()结合了一下

fit_transform()就是先将数据进行拟合，然后将其标准化

transform()作用是通过找中心和缩放等实现标准化

可以说是：

transform()是一定可以替换成fit_transform()，但是fit_transform()不能替换为transform()

因为中间相差了一个fit()

那么fit()的作用是什么呢？

找到了一个博客http://blog.csdn.net/quiet_girl/article/details/72517053

fit()的作用相当于先找到数据转换的一个规则再进行标准化，所以已经标准化的数据不需要再进行fit()了