数据分析系列精彩浓缩(三)

数据分析(三)

在分析UCI数据之前,有必要先了解一些决策树的概念(decision tree)

  • 此处推荐一个关于决策树的博客地址:
    http://www.cnblogs.com/yonghao/p/5061873.html
  • 决策树(decision tree (DT))的基本特征

    • DT 是一个监督学习方法(supervised learning method)

    • DT is a supervised learning method, thus we need labeled data

    • It is one process only thus it is not good for giant datasets

    • PS: It is pretty good on small and clean datasets

  • UCI数据特征: UCI credit approval data set

    • 690 data entries, relatively small dataset

    • 15 attributes, pretty tiny to be honest

    • missing value is only 5%

    • 2 class data

  • By looking at these two, we know DT should work well for our dataset

综上,就可以尝试用代码实现决策树的功能了,此时使用段老师提供的skeleton(框架),按照以下步骤写自己的代码

  • Copy and paste your code to function readfile(file_name) under the comment # Your code here.

  • Make sure your input and output matches how I descirbed in the docstring

  • Make a minor improvement to handle missing data, in this case let's use string "missing" to represent missing data. Note that it is given as "?".

  • Implement is_missing(value), class_counts(rows), is_numeric(value) as directed in the docstring
  • Implement class Determine. This object represents a node of our DT. 这个对象表示的是决策树的节点。
    • It has 2 inputs and a function. 有两个输入,一个方法

    • We can think of it as the Question we are asking at each node. 可以理解成决策树中每个节点我们所提出的“问题”

  • Implement the method partition(rows, question)as described in the docstring
    • Use Determine class to partition data into 2 groups

  • Implement the method gini(rows) as described in the docstring
    • Here is the formula for Gini impurity:

      • where n is the number of classes

      • is the percentage of the given class i

  • Implement the method info_gain(left, right, current_uncertainty) as described in the docstring
    • Here is the formula for Information Gain:

      • where

      • is current_uncertainty

      • is the percentage/probability of left branch, same story for

  • my code is as follows , for reference only(以下是我的代码,仅供参考)

    def readfile(file_name):
       """
      This function reads data file and returns structured and cleaned data in a list
      :param file_name: relative path under data folder
      :return: data, in this case it should be a 2-D list of the form
      [[data1_1, data1_2, ...],
        [data2_1, data2_2, ...],
        [data3_1, data3_2, ...],
        ...]
       
      i.e.
      [['a', 58.67, 4.46, 'u', 'g', 'q', 'h', 3.04, 't', 't', 6.0, 'f', 'g', '00043', 560.0, '+'],
        ['a', 24.5, 0.5, 'u', 'g', 'q', 'h', 1.5, 't', 'f', 0.0, 'f', 'g', '00280', 824.0, '+'],
        ['b', 27.83, 1.54, 'u', 'g', 'w', 'v', 3.75, 't', 't', 5.0, 't', 'g', '00100', 3.0, '+'],
      ...]
       
      Couple things you should note:
      1. You need to handle missing data. In this case let's use "missing" to represent all missing data
      2. Be careful of data types. For instance,
          "58.67" and "0.2356" should be number and not a string
          "00043" should be string but not a number
          It is OK to treat all numbers as float in this case. (You don't need to worry about differentiating integer and float)
      """
       # Your code here
       data_ = open(file_name, 'r')
       # print(data_)
       lines = data_.readlines()
       output = []
       # never use built-in names unless you mean to replace it
       for list_str in lines:
           str_list = list_str[:-1].split(",")
           # keep it
           # str_list.remove(str_list[len(str_list)-1])
           data = []
           for substr in str_list:
               if substr.isdigit():
                   if len(substr) > 1 and substr.startswith('0'):
                       data.append(substr)
                   else:
                       substr = int(substr)
                       data.append(substr)
               else:
                   try:
                       current = float(substr)
                       data.append(current)
                   except ValueError as e:
                       if substr == '?':
                           substr = 'missing'
                       data.append(substr)
           output.append(data)
       return output




    def is_missing(value):
       """
      Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
      :param value: value to be checked
      :return: boolean (True, False) of whether the input value is the same as our "missing" notation
      """
       return value == 'missing'


    def class_counts(rows):
       """
      Count how many data samples there are for each label
      数每个标签的样本数
      :param rows: Input is a 2D list in the form of what you have returned in readfile()
      :return: Output is a dictionary/map in the form:
      {"label_1": #count,
        "label_2": #count,
        "label_3": #count,
        ...
      }
      """
       # 这个方法是一个死方法 只使用于当前给定标签(‘+’,‘-’)的数据统计   为了达到能使更多不确定标签的数据的统计 扩展出下面方法
       # label_dict = {}
       # count1 = 0
       # count2 = 0
       # # rows 是readfile返回的结果
       # for row in rows:
       #     if row[-1] == '+':
       #         count1 += 1
       #     elif row[-1] == '-':
       #         count2 += 1
       # label_dict['+'] = count1
       # label_dict['-'] = count2
       # return label_dict

       # 扩展方法一
       # 这个方法可以完成任何不同标签的数据的统计 使用了两个循环 第一个循环是统计出所有数据中存在的不同类型的标签 得到一个标签列表lable_list
       # 然后遍历lable_list中的标签 重要的是在其中嵌套了遍历所有数据的循环 同时在当前循环中统计出所有数据的标签中和lable_list中标签相同的总数
       # label_dict = {}
       # lable_list = []
       # for row in rows:
       #     lable = row[-1]
       #     if lable_list == []:
       #         lable_list.append(lable)
       #     else:
       #         if lable in lable_list:
       #             continue
       #         else:
       #             lable_list.append(lable)
       #
       # for lable_i in lable_list:
       #     count_row_i = 0
       #     for row_i in rows:
       #         if lable_i == row_i[-1]:
       #             count_row_i += 1
       #     label_dict[lable_i] = count_row_i
       # print(label_dict)
       # return label_dict
       #

    # 扩展方法二
       # 此方法是巧妙的使用了dict.key()函数将所有的状态进行保存以及对出现的次数进行累计
       label_dict = {}
       for row in rows:
           keys = label_dict.keys()
           if row[-1] in keys:
               label_dict[row[-1]] += 1
           elif row[-1] not in keys:
               label_dict[row[-1]] = 1
       return label_dict


    def is_numeric(value):
       print(type(value),'-----')
       print(value)
       """
      Test if the input is a number(float/int)  
      :param value: Input is a value to be tested    
      :return: Boolean (True/False)    
      """
       # Your code here
       # 此处用到eavl()函数:将字符串string对象转换为有效的表达式参与求值运算返回计算结果
       # if type(eval(str(value))) == int or type(eval(str(value))) == float:
       #     return True
       # 不用eval()也可以 而且有博客说eval()存在一定安全隐患

       # if value is letter(字母) 和将以0开头的字符串检出来
       if str(value).isalpha() or str(value).startswith('0'):
           return False
       return type(int(value)) == int or type(float(value)) == float


    class Determine:
       """
      这个class用来对比。取列序号和值
      match方法比较数值或者字符串
      可以理解为决策树每个节点所提出的“问题”,如:
          今天温度是冷还是热?
          今天天气是晴,多云,还是有雨?
      """
       def __init__(self, column, value):
           """
          initial structure of our object
          :param column: column index of our "question"
          :param value: splitting value of our "question"
          """
           self.column = column
           self.value = value

       def match(self, example):
           """
          Compares example data and self.value
          note that you need to determine whether the data asked is numeric or categorical/string
          Be careful for missing data
          :param example: a full row of data
          :return: boolean(True/False) of whether input data is GREATER THAN self.value (numeric) or the SAME AS self.value (string)
          """
           # Your code here . missing is string too so don't judge(判断)
           e_index = self.column
           value_node = self.value
           # 此处and之后的条件是在e_index = 10是补充的,因为此列的数据类型不统一,包括0开头的字符串,还有int型数字,这就尴尬了,int 和 str 无法做compare
           if is_numeric(example[e_index]) and type(value_node) is int or type(value_node) is float:
               return example[e_index] > value_node
           else:
               return example[e_index] == value_node


       def __repr__(self):
           """
          打印树的时候用
          :return:
          """
           if is_numeric(self.value):
               condition = ">="
           else:
               condition = "是"
           return "{} {} {}?".format(
               header[self.column], condition, str(self.value))


    def partition(rows, question):
       """
      将数据分割,如果满足上面Question条件则被分入true_row,否则被分入false_row
      :param rows: data set/subset
      :param question: Determine object you implemented above
      :return: 2 lists based on the answer of the question
      """
       # Your code here . question is Determine's object
       true_rows, false_rows = [], []
       # 此处将二维数组进行遍历的目的是Determine对象中match方法只处理每个一维列表中指定索引的数据
       for row in rows:
           if question.match(row):
               true_rows.append(row)
           else:
               false_rows.append(row)
       return true_rows, false_rows


    def gini(rows):
       """
      计算一串数据的Gini值,即离散度的一种表达方式
      :param rows: data set/subset
      :return: gini值,”不纯度“ impurity
      """
       data_set_size = len(rows)    # 所有数据的总长度
       class_dict = class_counts(rows)
       sum_subgini = 0
       for class_dict_value in class_dict.values():
           sub_gini = (class_dict_value/data_set_size) ** 2
           sum_subgini += sub_gini
       gini = 1 - sum_subgini
       return gini



    def info_gain(left, right, current_uncertainty):
       """
      计算信息增益
      Please refer to the .md tutorial for details
      :param left: left branch
      :param right: right branch
      :param current_uncertainty: current uncertainty (data)
      """
       p_left = len(left) / (len(left) + len(right))
       p_right = 1 - p_left
       return current_uncertainty - p_left * gini(left) - p_right * gini(right)




    # 使用这组数据测试自己代码的质量
    data = readfile("E:datacrx.data")
    t, f = partition(data, Determine(2,'1.8'))
    print(info_gain(t, f, gini(data)))

 

January 2, 2019

原文地址:https://www.cnblogs.com/jcjc/p/10234562.html