1. 前言

本文介绍如何在无监督的情况下，对文本进行简单的观点提取和聚类。

2. 观点提取

观点提取是通过依存关系的方式，根据固定的依存结构，从原文本中提取重要的结构，代表整句的主要意思。

我认为比较重要的依存关系结构是"动补结构", "动宾关系", "介宾关系"3个关系。不重要的结构是"定中关系", "状中结构", "主谓关系"。通过核心词ROOT出发，来提取观点。

观点提取的主要方法如下，完整代码请移步致github。

''' 
关键词观点提取，根据关键词key，找到关键处的rootpath，寻找这个root中的观点，观点提取方式和parseSentence的基本一样。
支持提取多个root的观点。
'''
def parseSentWithKey(self, sentence, key=None):
    #key是关键字，如果关键字存在，则只分析存在关键词key的句子，如果没有key，则不判断。
    if key:
        keyIndex = 0
        if key not in sentence:
            return []
    rootList = []
    parse_result = str(self.hanlp.parseDependency(sentence)).strip().split('
')
    # 索引-1，改正确，因为从pyhanlp出来的索引是从1开始的。
    for i in range(len(parse_result)):
        parse_result[i] = parse_result[i].split('	')
        parse_result[i][0] = int(parse_result[i][0]) - 1
        parse_result[i][6] = int(parse_result[i][6]) - 1
        if key and parse_result[i][1] == key:
            keyIndex = i

    for i in range(len(parse_result)):
        self_index = int(parse_result[i][0])
        target_index = int(parse_result[i][6])
        relation = parse_result[i][7]
        if relation in self.main_relation:
            if self_index not in rootList:
                rootList.append(self_index)
        # 寻找多个root，和root是并列关系的也是root
        elif relation == "并列关系" and target_index in rootList:
            if self_index not in rootList:
                rootList.append(self_index)


        if len(parse_result[target_index]) == 10:
            parse_result[target_index].append([])

        #对依存关系，再加一个第11项，第11项是一个当前这个依存关系指向的其他索引
        if target_index != -1 and not (relation == "并列关系" and target_index in rootList):
            parse_result[target_index][10].append(self_index)
    
    # 寻找key在的那一条root路径
    if key:
        rootIndex = 0
        if len(rootList) > 1:
            target = keyIndex
            while True:
                if target in rootList:
                    rootIndex = rootList.index(target)
                    break
                next_item = parse_result[target]
                target = int(next_item[6])
        loopRoot = [rootList[rootIndex]]
    else:
        loopRoot = rootList

    result = {}
    related_words = set()
    for root in loopRoot:
        # 把key和root加入到result中
        if key:
            self.addToResult(parse_result, keyIndex, result, related_words)
        self.addToResult(parse_result, root, result, related_words)

    #根据'动补结构', '动宾关系', '介宾关系'，选择观点
    for item in parse_result:
        relation = item[7]
        target = int(item[6])
        index = int(item[0])
        if relation in self.reverse_relation and target in result and target not in related_words:
            self.addToResult(parse_result, index, result, related_words)

    # 加入关键词
    for item in parse_result:
        word = item[1]
        if word == key:
            result[int(item[0])] = word

    #对已经在result中的词，按照在句子中原来的顺序排列
    sorted_keys = sorted(result.items(), key=operator.itemgetter(0))
    selected_words = [w[1] for w in sorted_keys]
    return selected_words

通过这个方法，我们拿到了每个句子对应的观点了。下面对所有观点进行聚类。

2.1 观点提取效果

原句	观点
这个手机是正品吗？	手机是正品
礼品是一些什么东西？	礼品是什么东西
现在都送什么礼品啊	都送什么礼品
直接付款是怎么付的啊	付款是怎么付
如果不满意也可以退货的吧	不满意可以退货

3. 观点聚类

观点聚类的方法有几种：

直接计算2个观点的聚类。（我使用的方法）
把观点转化为向量，比较余弦距离。

我的方法是用difflib对任意两个观点进行聚类。我的时间复杂度很高(O(n^2))，用一个小技巧优化了下。代码如下：

def extractor(self):
    de = DependencyExtraction()
    opinionList = OpinionCluster()
    for sent in self.sentences:
        keyword = ""
        if not self.keyword:
            keyword = ""
        else:
            checkSent = []
            for word in self.keyword:
                if sent not in checkSent and word in sent:
                    keyword = word
                    checkSent.append(sent)
                    break

        opinion = "".join(de.parseSentWithKey(sent, keyword))
        if self.filterOpinion(opinion):
            opinionList.addOpinion(Opinion(sent, opinion, keyword))


    '''
        这里设置两个阈值，先用小阈值把一个大数据切成小块，由于是小阈值，所以本身是一类的基本也能分到一类里面。
        由于分成了许多小块，再对每个小块做聚类，聚类速度大大提升，thresholds=[0.2, 0.6]比thresholds=[0.6]速度高30倍左右。
        但是[0.2, 0.6]和[0.6]最后的结果不是一样的，会把一些相同的观点拆开。
    '''
    thresholds = self.json_config["thresholds"]
    clusters = [opinionList]
    for threshold in thresholds:
        newClusters = []
        for cluster in clusters:
            newClusters += self.clusterOpinion(cluster, threshold)
        clusters = newClusters

    resMaxLen = {}
    for oc in clusters:
        if len(oc.getOpinions()) >= self.json_config["minClusterLen"]:
            summaryStr = oc.getSummary(self.json_config["freqStrLen"])
            resMaxLen[summaryStr] = oc.getSentences()

    return self.sortRes(resMaxLen)

3.1 观点总结

对聚类在一起的观点，提取一个比较好的代表整个聚类的观点。

我的方法是对聚类观点里面的所有观点进行字的频率统计，对高频的字组成的字符串去和所有观点计算相似度，相似度最高的那个当做整个观点聚类的总的观点。

def getSummary(self, freqStrLen):
    opinionStrs = []
    for op in self._opinions:
        opinion = op.opinion
        opinionStrs.append(opinion)

    # 统计字频率
    word_counter = collections.Counter(list("".join(opinionStrs))).most_common()

    freqStr = ""
    for item in word_counter:
        if item[1] >= freqStrLen:
            freqStr += item[0]

    maxSim = -1
    maxOpinion = ""
    for opinion in opinionStrs:
        sim = similarity(freqStr, opinion)
        if sim > maxSim:
            maxSim = sim
            maxOpinion = opinion

    return maxOpinion

3.2 观点总结效果

聚类总结	所有观点
手机是全新正品	手机是全新正品手机是全新手机是不是正品保证是全新手机
能送无线充电器	能送无线充电器人家送无线充电器送无线充电器买能送无线充电器
可以优惠多少	可以优惠多少你好可优惠多少能优惠多少可以优惠多少
是不是翻新机	是不是翻新机不会是翻新机手机是还是翻新会不会是翻新机
花呗可以分期	花呗不够可以分期花呗分期可以可以花呗分期花呗可以分期
没有给发票	我没有发票发票有开给我没有给发票你们有给发票

4. 总结

以上我本人做的一些简单的观点提取和聚类，可以适用一些简单的场景中。