基于概率论的分类方法：朴素贝叶斯——使用朴素贝叶斯进行文档分类

前言

之前讨论过的k-近邻算法和决策树都是结果确定的分类算法，今天讨论的分类算法将不能完全确定数据实例应该划分到某个分类，或者只能给出数据实例属于给定分类的概率。

嘤嘤语录：朴素贝叶斯解决的问题是，今天下雨的概率问题，你需要根据概率确定今天要不要带伞。

说明：从本章开始，将不提供完整代码，只提供某个算法对应的代码块。

需求

以各大社交媒体为例，我们经常屏蔽一些关键性的词汇。我们要构建一个快速过滤器，如果某条留言使用了负面或者侮辱性的语言，那么就将该留言标识为内容不当。

步骤

1.准备数据

 1 def loadDataSet():
 2     postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
 3                  ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
 4                  ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
 5                  ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
 6                  ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
 7                  ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
 8     classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
 9     return postingList,classVec
10                  
11 def createVocabList(dataSet):
12     vocabSet = set([])  #create empty set
13     for document in dataSet:
14         vocabSet = vocabSet | set(document) #union of the two sets
15     return list(vocabSet)
16 
17 def setOfWords2Vec(vocabList, inputSet):
18     returnVec = [0]*len(vocabList)
19     for word in inputSet:
20         if word in vocabList:
21             returnVec[vocabList.index(word)] = 1
22         else: print "the word: %s is not in my Vocabulary!" % word
23     return returnVec

函数loadDataSet()创建了一些实验样本。postingList是一系列的词条集合，classVec是一个类别标签的集合。

函数createVocabList（dataSet）创建一个包含在文档中出现的不重复词的列表，词汇表。

函数setOfWords2Vec(vocabList, inputSet)首先创建一个和词汇表等长的向量，并将其元素都设置为0.
　　　　　　　　　　　　　　　　　　　   　　接着，遍历文档中的所有单词，如果出现了词汇表中的单词，则将输出的文档向量中的对应值设为1.

打开IDE，我们进一步熟悉一下刚才的三个函数：

>>> import bayes
>>> listOPosts,listClasses = bayes.loadDataSet()
>>> listOPosts
[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
>>> listClasses
[0, 1, 0, 1, 0, 1]

>>> myVocabList = bayes.createVocabList(listOPosts)
>>> myVocabList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']

发现现在没有出现重复的单词

>>> bayes.setOfWords2Vec(myVocabList,listOPosts[0])
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]

myVocabList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park',
 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 
'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog',
 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']


listOPosts[0]

['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']


2.训练算法

 1 def trainNB0(trainMatrix,trainCategory):
 2     numTrainDocs = len(trainMatrix) #6
 3     numWords = len(trainMatrix[0])  #32
 4     pAbusive = sum(trainCategory)/float(numTrainDocs)   #3/6.0
 5     p0Num = zeros(numWords); p1Num = zeros(numWords)      #change to ones() 
 6     p0Denom = 0.0; p1Denom = 0.0                        #change to 2.0
 7     for i in range(numTrainDocs):   # 0 1 2 3 4 5 6
 8         if trainCategory[i] == 1:
 9             p1Num += trainMatrix[i]
10             p1Denom += sum(trainMatrix[i])
11         else:
12             p0Num += trainMatrix[i]
13             p0Denom += sum(trainMatrix[i])
14     p1Vect = (p1Num/p1Denom)          #change to log()
15     p0Vect = (p0Num/p0Denom)          #change to log()
16     return p0Vect,p1Vect,pAbusive

trainCategory
[0, 1, 0, 1, 0, 1]

trainMat
[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],

[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0],

[1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],

[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1],

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]

>>> for postinDoc in listOPosts:
trainMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc))

>>> p0v,p1v,pab=bayes.trainNB0(trainMat,listClasses)


>>> p0v
array([ 0.04166667,  0.04166667,  0.04166667,  0.        ,  0.        ,
        0.04166667,  0.04166667,  0.04166667,  0.        ,  0.04166667,
        0.04166667,  0.04166667,  0.04166667,  0.        ,  0.        ,
        0.08333333,  0.        ,  0.        ,  0.04166667,  0.        ,
        0.04166667,  0.04166667,  0.        ,  0.04166667,  0.04166667,
        0.04166667,  0.        ,  0.04166667,  0.        ,  0.04166667,
        0.04166667,  0.125     ])
>>> p1v
array([ 0.        ,  0.        ,  0.        ,  0.05263158,  0.05263158,
        0.        ,  0.        ,  0.        ,  0.05263158,  0.05263158,
        0.        ,  0.        ,  0.        ,  0.05263158,  0.05263158,
        0.05263158,  0.05263158,  0.05263158,  0.        ,  0.10526316,
        0.        ,  0.05263158,  0.05263158,  0.        ,  0.10526316,
        0.        ,  0.15789474,  0.        ,  0.05263158,  0.        ,
        0.        ,  0.        ])

pab=0.5，说明文档属于侮辱类的概率是0.5。一共输入了6句话，其中3句是侮辱性言论，因此侮辱性言论的概率是0.5

嘤嘤语录，前面处理数据的方式，可以看成是把

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

['stop', 'posting', 'stupid', 'worthless', 'garbage'],

['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

里面的数据按照事先给好的标签【0,1,0,1,0,1】分成两类

第一类是0的

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

分别计算每行在字典出现的次数/除以总的小数据量24

（关于在字典里出现的次数的理解：看到一个单词去字典查阅，有就标记一下，tag随查阅到的字数的增加而增加）

([ 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1.,
0., 0., 2., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1.,
0., 1., 0., 1., 1., 3.])

同理，对于标签为1的侮辱性

['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

['stop', 'posting', 'stupid', 'worthless', 'garbage'],

['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

查阅字典后，得到的是

([ 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0.,
1., 1., 1., 1., 1., 0., 2., 0., 1., 1., 0., 2., 0.,
3., 0., 1., 0., 0., 0.])

分别计算每行在字典出现的次数/除以总的小数据量19

这样理解一下，思路就清晰多了

为符合实际情况，我们把所有词出现的次数初始化为1，并将分母初始化为2，为方便计算，我们定义概率为log(p)

 p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0

 p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()

朴素贝叶斯分类函数

 1 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
 2     p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
 3     p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
 4     if p1 > p0:
 5         return 1
 6     else: 
 7         return 0
 8    
 9 def testingNB():
10     listOPosts,listClasses = loadDataSet()
11     myVocabList = createVocabList(listOPosts)
12     trainMat=[]
13     for postinDoc in listOPosts:
14         trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
15     p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
16     testEntry = ['love', 'my', 'dalmation']
17     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
18     print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
19     testEntry = ['stupid', 'garbage']
20     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
21     print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)

>>> reload(bayes)
<module 'bayes' from 'D:Python27ayes.pyc'>
>>> bayes.testingNB()
['love', 'my', 'dalmation'] classified as:  0
['stupid', 'garbage'] classified as:  1

文档词袋模型

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

和setOfWords2Vec()几乎完全相同，唯一不同的是当每遇到一个单词，就会增加向量中的对应值，而不仅是将对应的数值设为1.