基于概率论的分类方法:朴素贝叶斯——使用朴素贝叶斯进行文档分类

前言

之前讨论过的k-近邻算法和决策树都是结果确定的分类算法,今天讨论的分类算法将不能完全确定数据实例应该划分到某个分类,或者只能给出数据实例属于给定分类的概率。

嘤嘤语录:朴素贝叶斯解决的问题是,今天下雨的概率问题,你需要根据概率确定今天要不要带伞。

说明:从本章开始,将不提供完整代码,只提供某个算法对应的代码块。

需求

以各大社交媒体为例,我们经常屏蔽一些关键性的词汇。我们要构建一个快速过滤器,如果某条留言使用了负面或者侮辱性的语言,那么就将该留言标识为内容不当。

步骤

1.准备数据

 1 def loadDataSet():
 2     postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
 3                  ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
 4                  ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
 5                  ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
 6                  ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
 7                  ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
 8     classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
 9     return postingList,classVec
10                  
11 def createVocabList(dataSet):
12     vocabSet = set([])  #create empty set
13     for document in dataSet:
14         vocabSet = vocabSet | set(document) #union of the two sets
15     return list(vocabSet)
16 
17 def setOfWords2Vec(vocabList, inputSet):
18     returnVec = [0]*len(vocabList)
19     for word in inputSet:
20         if word in vocabList:
21             returnVec[vocabList.index(word)] = 1
22         else: print "the word: %s is not in my Vocabulary!" % word
23     return returnVec
函数loadDataSet()创建了一些实验样本。postingList是一系列的词条集合,classVec是一个类别标签的集合。

函数createVocabList(dataSet)创建一个包含在文档中出现的不重复词的列表,词汇表。
函数setOfWords2Vec(vocabList, inputSet)首先创建一个和词汇表等长的向量,并将其元素都设置为0.
                      接着,遍历文档中的所有单词,如果出现了词汇表中的单词,则将输出的文档向量中的对应值设为1.

打开IDE,我们进一步熟悉一下刚才的三个函数:
>>> import bayes
>>> listOPosts,listClasses = bayes.loadDataSet()
>>> listOPosts
[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
>>> listClasses
[0, 1, 0, 1, 0, 1]
>>> myVocabList = bayes.createVocabList(listOPosts)
>>> myVocabList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']

发现现在没有出现重复的单词

>>> bayes.setOfWords2Vec(myVocabList,listOPosts[0])
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
myVocabList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park',
'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying',
'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog',
'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']

listOPosts[0]

['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']

2.训练算法
 1 def trainNB0(trainMatrix,trainCategory):
 2     numTrainDocs = len(trainMatrix) #6
 3     numWords = len(trainMatrix[0])  #32
 4     pAbusive = sum(trainCategory)/float(numTrainDocs)   #3/6.0
 5     p0Num = zeros(numWords); p1Num = zeros(numWords)      #change to ones() 
 6     p0Denom = 0.0; p1Denom = 0.0                        #change to 2.0
 7     for i in range(numTrainDocs):   # 0 1 2 3 4 5 6
 8         if trainCategory[i] == 1:
 9             p1Num += trainMatrix[i]
10             p1Denom += sum(trainMatrix[i])
11         else:
12             p0Num += trainMatrix[i]
13             p0Denom += sum(trainMatrix[i])
14     p1Vect = (p1Num/p1Denom)          #change to log()
15     p0Vect = (p0Num/p0Denom)          #change to log()
16     return p0Vect,p1Vect,pAbusive

trainCategory
[0, 1, 0, 1, 0, 1]

trainMat
[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],

[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0],

[1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],

[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1],

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]

 >>> for postinDoc in listOPosts:
trainMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc))

>>> p0v,p1v,pab=bayes.trainNB0(trainMat,listClasses)

>>> p0v array([ 0.04166667, 0.04166667, 0.04166667, 0. , 0. , 0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667, 0.04166667, 0.04166667, 0. , 0. , 0.08333333, 0. , 0. , 0.04166667, 0. , 0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667, 0. , 0.04166667, 0.04166667, 0.125 ]) >>> p1v array([ 0. , 0. , 0. , 0.05263158, 0.05263158, 0. , 0. , 0. , 0.05263158, 0.05263158, 0. , 0. , 0. , 0.05263158, 0.05263158, 0.05263158, 0.05263158, 0.05263158, 0. , 0.10526316, 0. , 0.05263158, 0.05263158, 0. , 0.10526316, 0. , 0.15789474, 0. , 0.05263158, 0. , 0. , 0. ])

pab=0.5,说明文档属于侮辱类的概率是0.5。一共输入了6句话,其中3句是侮辱性言论,因此侮辱性言论的概率是0.5

嘤嘤语录,前面处理数据的方式,可以看成是把

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

['stop', 'posting', 'stupid', 'worthless', 'garbage'],

['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

里面的数据按照事先给好的标签【0,1,0,1,0,1】分成两类

第一类是0的

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

分别计算每行在字典出现的次数/除以总的小数据量24

(关于在字典里出现的次数的理解:看到一个单词去字典查阅,有就标记一下,tag随查阅到的字数的增加而增加)

([ 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1.,
0., 0., 2., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1.,
0., 1., 0., 1., 1., 3.])

同理,对于标签为1的侮辱性

['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

['stop', 'posting', 'stupid', 'worthless', 'garbage'],

 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

查阅字典后,得到的是

([ 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0.,
1., 1., 1., 1., 1., 0., 2., 0., 1., 1., 0., 2., 0.,
3., 0., 1., 0., 0., 0.])

分别计算每行在字典出现的次数/除以总的小数据量19

这样理解一下,思路就清晰多了

为符合实际情况,我们把所有词出现的次数初始化为1,并将分母初始化为2,为方便计算,我们定义概率为log(p)

 p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
 p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()

朴素贝叶斯分类函数

 1 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
 2     p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
 3     p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
 4     if p1 > p0:
 5         return 1
 6     else: 
 7         return 0
 8    
 9 def testingNB():
10     listOPosts,listClasses = loadDataSet()
11     myVocabList = createVocabList(listOPosts)
12     trainMat=[]
13     for postinDoc in listOPosts:
14         trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
15     p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
16     testEntry = ['love', 'my', 'dalmation']
17     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
18     print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
19     testEntry = ['stupid', 'garbage']
20     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
21     print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
>>> reload(bayes)
<module 'bayes' from 'D:Python27ayes.pyc'>
>>> bayes.testingNB()
['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1

文档词袋模型

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

setOfWords2Vec()几乎完全相同,唯一不同的是当每遇到一个单词,就会增加向量中的对应值,而不仅是将对应的数值设为1.

原文地址:https://www.cnblogs.com/xiaoyingying/p/7515889.html