西瓜书4.4 基于基尼指数选择划分的决策树 预剪枝与后剪枝

当然是基于上一次决策树的基础上来修改。

首先把数据集和属性集调整一下,删去连续属性,并且划分为训练集和测试集,划分方式与书上相同,后面的函数中连续属性相关的以及信息增益的计算也都可以删掉了。

然后用pycharm里的快捷替换字符的功能(选中后ctrl+r)把所有的8改成6(也就是标记值的位置)。

写计算对应属性的基尼指数的函数,依照基尼指数的定义编写就行,遍历一遍当前节点的样本集然后计数即可,代码如下:

    def giniv(self,order):
        s=0#属性order的基尼指数
        for i in range(num[order]):
            att=0#属性值等于i时的总样本数
            tru=0#属性值等于i时的样本中标记为真的个数
            for j in range(len(self.sample)):
                 if(D[self.sample[j]][order]==i):
                    att+=1
                    if(D[self.sample[j]][6]==1):
                         tru+=1
            if(att==0):
                tru=0
            else:
                tru/=att
            s=s+att/len(self.sample)*(1-tru*tru-(1-tru)*(1-tru))
        return s

最佳划分属性的选择函数也可以简化,只要分别计算每个属性的基尼指数入栈然后对比取最大即可:

    def best_divide(self):
        gini=[]
        for i in range(len(self.attri_set)):
            order = self.attri_set[i]  # 计算第order个属性的基尼指数
            gini.append([self.giniv(order),order])
        maxx = 0
        print(gini)
        for i in range(len(self.attri_set)):
            if(gini[i][0]<gini[maxx][0]):
                maxx=i
        self.root = 1
        self.attribu=gini[maxx][1]
        return gini[maxx]

然后就是做预剪枝的部分,由于预剪枝比较的是正确率,那么比较正确个数也无不可,还能规避可能出现的除法导致的问题,思路就是计算父节点的测试集正确个数以及子节点中测试集正确个数之和,如果父节点的大于等于子节点的之和,那就从tree列表中pop出子节点,将父节点的节点属性设置为叶节点,划分属性设置为‘好瓜与否’,因此这一过程要在父节点生成完子节点后进行,也就是插入在build_son函数的末尾。而为了方便地实现对测试集的统计,在类属性中增加训练集dt,每次产生子节点时依照得到的划分属性将父节点的测试集分配给子节点,那么预剪枝函数只需要实现对本节点测试集正确数的计数以及对父节点、子节点正确数个数的统计和比较两种功能即可,将这两个功能分到两个函数中实现:

    def pre_compare(self):
        k=tree[self.father].coun_labe()
        s1=0
        for i in range(len(self.dt)):
            if(Dt[self.dt[i]][6]==k):
                s1+=1
        return s1

这是用于返回测试集正确率的

    def pre_cut(self):
        s1=self.pre_compare()
        s2=0
        for i in range(len(self.leaf)):
            s2+=tree[self.leaf[i][0]].pre_compare()
        if(s1>=s2):
            for i in range(num[self.attribu]):
                tree.pop()
            self.attribu=6
            self.root=2
        return

这是用于完成预剪枝的剩下部分工作的。

微妙的是依照我现在的如果是、否样本数量相同默认为是的统计前提,这个决策树最终会被剪到只剩一个源节点。

完整代码如下:

D = [[2, 1, 2, 3, 3, 1, 1], 
     [3, 1, 1, 3, 3, 1, 1], 
     [3, 1, 2, 3, 3, 1, 1], 
     [2, 2, 2, 3, 2, 0, 1], 
     [3, 2, 2, 2, 2, 0, 1], 
     [2, 3, 3, 3, 1, 0, 0], 
     [1, 2, 1, 2, 3, 1, 0], 
     [3, 2, 2, 3, 2, 0, 0], 
     [1, 1, 2, 1, 1, 1, 0], 
     [2, 1, 1, 2, 2, 1, 0]]

Dt=[[2, 1, 1, 3, 3, 1, 1], 
    [1, 1, 2, 3, 3, 1, 1], 
    [3, 2, 2, 3, 2, 1, 1], 
    [3, 2, 1, 2, 2, 1, 0], 
    [1, 3, 3, 1, 1, 1, 0], 
    [1, 1, 2, 1, 1, 0, 0], 
    [2, 2, 2, 2, 3, 1, 0]]

Y = ['色泽', '根蒂', '敲声', '纹理', '脐部', '触感','好瓜与否']
num = [3, 3, 3, 3, 3, 2]
tree = []


class node():
    def __init__(self, root, Dd, Aa, father, nu, floor,Dt):
        self.root = root  # 树根为0,中间为1,叶子为2
        self.num = nu  # 在树列表中的序号
        self.sample = Dd  # 样本集
        self.attribu = 6  # 叶节点属性
        self.attri_set = Aa  # 属性集
        self.danum = self.coun_labe()  # 连续属性划分时的分类值,叶节点的类别值
        self.leaf = []  # 子节点
        self.father = father  # 父节点
        self.floor = floor
        self.dt=Dt
        return

    def chos_chek(self,order):
        mark=-1
        kind=-1
        if(self.root==2):
            mark=self.danum
            kind=0
        else:
            k=self.danum
            if(self.attribu<5):
                k-=1
            mark=self.leaf[k]
            kind=1
        return [mark,kind]

    def pre_compare(self):
        k=tree[self.father].coun_labe()
        s1=0
        for i in range(len(self.dt)):
            if(Dt[self.dt[i]][6]==k):
                s1+=1
        return s1

    def pre_cut(self):
        s1=self.pre_compare()
        s2=0
        for i in range(len(self.leaf)):
            s2+=tree[self.leaf[i][0]].pre_compare()
        if(s1>=s2):
            for i in range(num[self.attribu]):
                tree.pop()
            self.attribu=6
            self.root=2
        return
    def all_label_same(self):
        k = D[self.sample[0]][6]
        mark = 1
        for i in range(len(self.sample)):
            if (D[self.sample[i]][6] != k):
                mark = 0
                break
        if (mark == 1):
            self.danum = k
            self.root = 2
        return mark

    def all_same(self):
        k = D[self.sample[0]]
        mark = 1
        for i in range(len(self.sample)):
            if (k != D[self.sample[i]]):
                mark = 0
        return mark

    def coun_labe(self):  # 计算样本中数量较多的类别
        s = 0
        mark = 0
        for i in range(len(self.sample)):
            s = s + D[self.sample[i]][6]
        if (s >= (len(self.sample) / 2)):
            mark = 1
        return mark

    def all_b_att_same(self):  # 全空或全同
        mark = 0
        if (len(self.attri_set) == 0):
            mark = 1
            self.danum = tree[self.father].coun_labe
            self.root = 2
        elif (self.all_same()):
            mark = 1
            self.danum = self.coun_labe()
            self.root = 2
        return mark

    def giniv(self,order):
        s=0#属性order的基尼指数
        for i in range(num[order]):
            att=0#属性值等于i时的总样本数
            tru=0#属性值等于i时的样本中标记为真的个数
            for j in range(len(self.sample)):
                 if(D[self.sample[j]][order]==i):
                    att+=1
                    if(D[self.sample[j]][6]==1):
                         tru+=1
            if(att==0):
                tru=0
            else:
                tru/=att
            s=s+att/len(self.sample)*(1-tru*tru-(1-tru)*(1-tru))
        return s

    def best_divide(self):
        gini=[]
        for i in range(len(self.attri_set)):
            order = self.attri_set[i]  # 计算第order个属性的基尼指数
            gini.append([self.giniv(order),order])
        maxx = 0
        print(gini)
        for i in range(len(self.attri_set)):
            if(gini[i][0]<gini[maxx][0]):
                maxx=i
        self.root = 1
        self.attribu=gini[maxx][1]
        return gini[maxx]

    def build_son(self):
        mark = 0
        if (len(self.sample) == 0 or self.sample == []):
            mark = 0
            self.root = 2
            self.danum = tree[self.father].coun_labe()
            # print('空')
        elif (self.all_label_same()):
            mark = 0
            # print('全同类')
        elif (self.all_b_att_same()):
            mark = 0
            # print('属性集空或全部属性相同')
        else:
            mark = 1
            bd = self.best_divide()
            # print(bd)
            pass_a = []
            for i in range(len(self.attri_set)):
                if (self.attri_set[i] != bd[1]):
                    pass_a.append(self.attri_set[i])
            for i in range(num[bd[1]]):
                k = i
                pass_d = []
                pass_dt=[]
                if (bd[1] != 5):
                    k += 1
                for j in range(len(self.sample)):
                    if (D[self.sample[j]][bd[1]] == k):
                        pass_d.append(self.sample[j])
                for j in range(len(self.dt)):
                    if(Dt[self.sample[j]][bd[1]]==k):
                        pass_dt.append(self.dt[j])
                lon = len(tree)
                self.leaf.append([lon, k])
                tree.append(node(1, pass_d, pass_a, self.num, lon, self.floor + 1,pass_dt))
            self.pre_cut()
            # for i in range(len(self.leaf)):
            #  tree[self.leaf[i][0]].build_son()
        return mark

    def prin(self):
        print('root:', self.root, ' 划分属性(节点属性):', Y[self.attribu], ' ', self.danum, ' floor:', self.floor)


star = node(0, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5], 0, 0, 0,[0, 1, 2, 3, 4, 5, 6])
tree.append(star)
star.build_son()
tail = 0

while (tail != (len(tree) - 1)):
    las = len(tree) - 1
    k = tail
    for i in range(k + 1, las + 1):
        tree[i].prin()
        tree[i].build_son()

flm = tree[len(tree) - 1].danum + 1
for i in range(flm):
    for j in range(len(tree)):
        if (tree[j].danum == i):
            tree[j].prin()
    print('
')

然后是后剪枝。我们可以看到后剪枝与预剪枝并没有本质上的过程区别,换句话说就是,函数基本可以照抄预剪枝。

具体的照抄方法就是在pre_cut()函数中加入一个mark标量用于返回是否进行了剪枝

然后在主函数中进行后剪枝,也就是从后到前,遍历tree数组,首先判断当前节点是否是叶节点,如果是的话储存其父节点的子节点个数k,获取(同时也执行了)父节点的pre_cut(),如果返回值为真(已经剪枝),那就把遍历的指针tail往前挪k-1次,然后无论剪枝与否,tail都往前挪1次,直到tail到0为止。

主函数中增加部分如下:

while(tail>0):
    if(tree[tail].root==2):
        k=num[tree[tree[tail].father].attribu]
        if(tree[tree[tail].father].pre_cut):
            tail=tail-k+1
    tail-=1

全部代码:

D = [[2, 1, 2, 3, 3, 1, 1], 
     [3, 1, 1, 3, 3, 1, 1], 
     [3, 1, 2, 3, 3, 1, 1], 
     [2, 2, 2, 3, 2, 0, 1], 
     [3, 2, 2, 2, 2, 0, 1], 
     [2, 3, 3, 3, 1, 0, 0], 
     [1, 2, 1, 2, 3, 1, 0], 
     [3, 2, 2, 3, 2, 0, 0], 
     [1, 1, 2, 1, 1, 1, 0], 
     [2, 1, 1, 2, 2, 1, 0]]

Dt=[[2, 1, 1, 3, 3, 1, 1], 
    [1, 1, 2, 3, 3, 1, 1], 
    [3, 2, 2, 3, 2, 1, 1], 
    [3, 2, 1, 2, 2, 1, 0], 
    [1, 3, 3, 1, 1, 1, 0], 
    [1, 1, 2, 1, 1, 0, 0], 
    [2, 2, 2, 2, 3, 1, 0]]

Y = ['色泽', '根蒂', '敲声', '纹理', '脐部', '触感','好瓜与否']
num = [3, 3, 3, 3, 3, 2]
tree = []


class node():
    def __init__(self, root, Dd, Aa, father, nu, floor,Dt):
        self.root = root  # 树根为0,中间为1,叶子为2
        self.num = nu  # 在树列表中的序号
        self.sample = Dd  # 样本集
        self.attribu = 6  # 叶节点属性
        self.attri_set = Aa  # 属性集
        self.danum = self.coun_labe()  # 连续属性划分时的分类值,叶节点的类别值
        self.leaf = []  # 子节点
        self.father = father  # 父节点
        self.floor = floor
        self.dt=Dt
        return

    def chos_chek(self,order):
        mark=-1
        kind=-1
        if(self.root==2):
            mark=self.danum
            kind=0
        else:
            k=self.danum
            if(self.attribu<5):
                k-=1
            mark=self.leaf[k]
            kind=1
        return [mark,kind]

    def pre_compare(self):
        k=tree[self.father].coun_labe()
        s1=0
        for i in range(len(self.dt)):
            if(Dt[self.dt[i]][6]==k):
                s1+=1
        return s1

    def pre_cut(self):
        s1=self.pre_compare()
        s2=0
        mark=0
        for i in range(len(self.leaf)):
            s2+=tree[self.leaf[i][0]].pre_compare()
        if(s1>=s2):
            mark=1
            for i in range(num[self.attribu]):
                tree.pop()
            self.attribu=6
            self.root=2
        return mark

    def all_label_same(self):
        k = D[self.sample[0]][6]
        mark = 1
        for i in range(len(self.sample)):
            if (D[self.sample[i]][6] != k):
                mark = 0
                break
        if (mark == 1):
            self.danum = k
            self.root = 2
        return mark

    def all_same(self):
        k = D[self.sample[0]]
        mark = 1
        for i in range(len(self.sample)):
            if (k != D[self.sample[i]]):
                mark = 0
        return mark

    def coun_labe(self):  # 计算样本中数量较多的类别
        s = 0
        mark = 0
        for i in range(len(self.sample)):
            s = s + D[self.sample[i]][6]
        if (s >= (len(self.sample) / 2)):
            mark = 1
        return mark

    def all_b_att_same(self):  # 全空或全同
        mark = 0
        if (len(self.attri_set) == 0):
            mark = 1
            self.danum = tree[self.father].coun_labe
            self.root = 2
        elif (self.all_same()):
            mark = 1
            self.danum = self.coun_labe()
            self.root = 2
        return mark

    def giniv(self,order):
        s=0#属性order的基尼指数
        for i in range(num[order]):
            att=0#属性值等于i时的总样本数
            tru=0#属性值等于i时的样本中标记为真的个数
            for j in range(len(self.sample)):
                 if(D[self.sample[j]][order]==i):
                    att+=1
                    if(D[self.sample[j]][6]==1):
                         tru+=1
            if(att==0):
                tru=0
            else:
                tru/=att
            s=s+att/len(self.sample)*(1-tru*tru-(1-tru)*(1-tru))
        return s

    def best_divide(self):
        gini=[]
        for i in range(len(self.attri_set)):
            order = self.attri_set[i]  # 计算第order个属性的基尼指数
            gini.append([self.giniv(order),order])
        maxx = 0
        #print(gini)
        for i in range(len(self.attri_set)):
            if(gini[i][0]<gini[maxx][0]):
                maxx=i
        self.root = 1
        self.attribu=gini[maxx][1]
        return gini[maxx]

    def build_son(self):
        mark = 0
        if (len(self.sample) == 0 or self.sample == []):
            mark = 0
            self.root = 2
            self.danum = tree[self.father].coun_labe()
            # print('空')
        elif (self.all_label_same()):
            mark = 0
            # print('全同类')
        elif (self.all_b_att_same()):
            mark = 0
            # print('属性集空或全部属性相同')
        else:
            mark = 1
            bd = self.best_divide()
            # print(bd)
            pass_a = []
            for i in range(len(self.attri_set)):
                if (self.attri_set[i] != bd[1]):
                    pass_a.append(self.attri_set[i])
            for i in range(num[bd[1]]):
                k = i
                pass_d = []
                pass_dt=[]
                if (bd[1] != 5):
                    k += 1
                for j in range(len(self.sample)):
                    if (D[self.sample[j]][bd[1]] == k):
                        pass_d.append(self.sample[j])
                for j in range(len(self.dt)):
                    if(Dt[self.sample[j]][bd[1]]==k):
                        pass_dt.append(self.dt[j])
                lon = len(tree)
                self.leaf.append([lon, k])
                tree.append(node(1, pass_d, pass_a, self.num, lon, self.floor + 1,pass_dt))
            # for i in range(len(self.leaf)):
            #  tree[self.leaf[i][0]].build_son()
        return mark

    def prin(self):
        print('root:', self.root, ' 划分属性(节点属性):', Y[self.attribu], ' ', self.danum, ' floor:', self.floor)


star = node(0, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5], 0, 0, 0,[0, 1, 2, 3, 4, 5, 6])
tree.append(star)
star.build_son()
tail = 0

while (tail != (len(tree) - 1)):
    las = len(tree) - 1
    k = tail
    for i in range(k + 1, las + 1):
        tree[i].prin()
        tree[i].build_son()

while(tail>0):
    if(tree[tail].root==2):
        k=num[tree[tree[tail].father].attribu]
        if(tree[tree[tail].father].pre_cut):
            tail=tail-k+1
    tail-=1

flm = tree[len(tree) - 1].danum + 1
for i in range(flm):
    for j in range(len(tree)):
        if (tree[j].danum == i):
            tree[j].prin()
    print('
')

由于结果的不易验证,暂且就先这样了(躺)。如果有错误欢迎指正交流。

原文地址:https://www.cnblogs.com/forever3329/p/14099475.html