利用pyltp进行实体识别

一.实体识别作为信息抽取中基础的也是重要的一步，其技术可以分为三类，分别是其于规则的方法、其于统计模型的方法以及基于深度学习的方法。

基于规则的方法，主要依靠构建大量的实体抽取规则，一般由具有一定领域知识的专家手工构建。然后将规则与文本进行匹配，识别出实体。

基于统计的方法，需要一定的标注语料进行训练，采用的基本模型有马尔可夫HMM、条件马尔可夫CMM、最大熵ME以及条件随机场CRF等，这此方法作为序列标注问题进行处理，主要涉及步骤有语料标注、特征定义和模型训练。

基于深度的方法，也是目前比较大热的研究方向。最常用的也是大家熟悉的模型有LSTM-CRF、LSTM-CNN-CRF以及基于attention注意力机制的方法等。

下面初步使用哈工大的pyltp进行实体识别的程序，程序开始前需要下载官方的一些模型有分词模型“cws.model”，词性标注模型“pos.model”以及实体识别模型“ner.model“。

下面是pyltp官网的模型下载页面，地址是http://ltp.ai/download.html。

二.程序

  1 # -*- coding: utf-8 -*-
  2 import os
  3 from pyltp import Segmentor, Postagger, Parser, NamedEntityRecognizer
  4 from collections import OrderedDict
  5 
  6 class LtpParser():
  7     def __init__(self):
  8         LTP_DIR = "../ltp_model"
  9         self.segmentor = Segmentor()
 10         self.segmentor.load_with_lexicon(os.path.join(LTP_DIR, "cws.model"), os.path.join(LTP_DIR, "word_dict.txt")) #加载外部词典
 11 
 12         self.postagger = Postagger()
 13         self.postagger.load_with_lexicon(os.path.join(LTP_DIR, "pos.model"), os.path.join(LTP_DIR, "n_word_dict.txt")) #加载外部词典
 14 
 15         # self.parser = Parser()
 16         # self.parser.load(os.path.join(LTP_DIR, "parser.model")) #依存句法分析
 17 
 18         self.recognizer = NamedEntityRecognizer()
 19         self.recognizer.load(os.path.join(LTP_DIR, "ner.model"))#实体识别
 20 
 21         # #加载停词
 22         # with open(LTP_DIR + '/stopwords.txt', 'r', encoding='utf8') as fread:
 23         #     self.stopwords = set()
 24         #     for line in fread:
 25         #         self.stopwords.add(line.strip())
 26 
 27     '''把实体和词性给进行对应'''
 28     def wordspostags(self, name_entity_dist, words, postags):
 29         pre = ' '.join([item[0] + '/' + item[1] for item in zip(words, postags)])
 30         post = pre
 31         for et, infos in name_entity_dist.items():
 32             if infos:
 33                 for info in infos:
 34                     post = post.replace(' '.join(info['consist']), info['name'])
 35         post = [word for word in post.split(' ') if len(word.split('/')) == 2 and word.split('/')[0]]
 36         words = [tmp.split('/')[0] for tmp in post]
 37         postags = [tmp.split('/')[1] for tmp in post]
 38 
 39         return words, postags
 40 
 41     '''根据实体识别结果,整理输出实体列表'''
 42     def entity(self, words, netags, postags):
 43         '''
 44         :param words: 词
 45         :param netags: 实体
 46         :param postags: 词性
 47         :return:
 48         '''
 49         name_entity_dict = {}
 50         name_entity_list = []
 51         place_entity_list = []
 52         organization_entity_list = []
 53         ntag_E_Nh = ""
 54         ntag_E_Ni = ""
 55         ntag_E_Ns = ""
 56         index = 0
 57         for item in zip(words, netags):
 58             word = item[0]
 59             ntag = item[1]
 60             if ntag[0] != "O":
 61                 if ntag[0] == "S":
 62                     if ntag[-2:] == "Nh":
 63                         name_entity_list.append(word + '_%s ' % index)
 64                     elif ntag[-2:] == "Ni":
 65                         organization_entity_list.append(word + '_%s ' % index)
 66                     else:
 67                         place_entity_list.append(word + '_%s ' % index)
 68                 elif ntag[0] == "B":
 69                     if ntag[-2:] == "Nh":
 70                         ntag_E_Nh = ntag_E_Nh + word + '_%s ' % index
 71                     elif ntag[-2:] == "Ni":
 72                         ntag_E_Ni = ntag_E_Ni + word + '_%s ' % index
 73                     else:
 74                         ntag_E_Ns = ntag_E_Ns + word + '_%s ' % index
 75                 elif ntag[0] == "I":
 76                     if ntag[-2:] == "Nh":
 77                         ntag_E_Nh = ntag_E_Nh + word + '_%s ' % index
 78                     elif ntag[-2:] == "Ni":
 79                         ntag_E_Ni = ntag_E_Ni + word + '_%s ' % index
 80                     else:
 81                         ntag_E_Ns = ntag_E_Ns + word + '_%s ' % index
 82                 else:
 83                     if ntag[-2:] == "Nh":
 84                         ntag_E_Nh = ntag_E_Nh + word + '_%s ' % index
 85                         name_entity_list.append(ntag_E_Nh)
 86                         ntag_E_Nh = ""
 87                     elif ntag[-2:] == "Ni":
 88                         ntag_E_Ni = ntag_E_Ni + word + '_%s ' % index
 89                         organization_entity_list.append(ntag_E_Ni)
 90                         ntag_E_Ni = ""
 91                     else:
 92                         ntag_E_Ns = ntag_E_Ns + word + '_%s ' % index
 93                         place_entity_list.append(ntag_E_Ns)
 94                         ntag_E_Ns = ""
 95             index += 1
 96         name_entity_dict['nhs'] = self.modify(name_entity_list, words, postags, 'nh')
 97         name_entity_dict['nis'] = self.modify(organization_entity_list, words, postags, 'ni')
 98         name_entity_dict['nss'] = self.modify(place_entity_list, words, postags, 'ns')
 99         return name_entity_dict
100 
101     def modify(self, entity_list, words, postags, tag):
102         modify = []
103         if entity_list:
104             for entity in entity_list:
105                 entity_dict = {}
106                 subs = entity.split(' ')[:-1]
107                 start_index = subs[0].split('_')[1]
108                 end_index = subs[-1].split('_')[1]
109                 entity_dict['stat_index'] = start_index
110                 entity_dict['end_index'] = end_index
111                 if start_index == entity_dict['end_index']:
112                     consist = [words[int(start_index)] + '/' + postags[int(start_index)]]
113                 else:
114                     consist = [words[index] + '/' + postags[index] for index in
115                                range(int(start_index), int(end_index) + 1)]
116                 entity_dict['consist'] = consist
117                 entity_dict['name'] = ''.join(tmp.split('_')[0] for tmp in subs) + '/' + tag
118                 modify.append(entity_dict)
119         return modify
120 
121     '''词性和实体'''
122     def post_ner(self, words):
123         postags = list(self.postagger.postag(words))
124         # words_filter =[]
125         # postags = []
126         # for word, postag in zip(words, self.postagger.postag(words)):
127         #     if 'n' in postag:
128         #         postags.append(postag)
129         #         words_filter.append(word)
130         nerags = self.recognizer.recognize(words, postags)
131         return postags, nerags
132 
133     def parser_process(self, sentence):
134         words = list(self.segmentor.segment(sentence))
135         post, ner = self.post_ner(words)  # 词性和实体
136         name_entity_dist = self.entity(words, ner, post)
137         words, postags = self.wordspostags(name_entity_dist, words, post)
138         return words, postags
139 
140 if __name__ == '__main__':
141     content_1 = '提起本山传媒，相信大家应该再熟悉不过了。近日，文化产业新闻查询资料显示，赵本山旗下的本山传媒有限公司，在今年的8月1号已经改名，改成了辽宁民间艺术团有限公司，去掉了赵本山的效应。而此前很多以赵本山名字冠名的组织，也已经更名了。　　　　　　　　此前在2015年6月9日，辽宁大学官宣将辽宁大学本山艺术学院更名为辽宁大学艺术学院。　　　　其实，公司改名也早有预料，这几年本山传媒的演员参加《欢乐喜剧人》《喜剧总动员》等综艺节目时，一直都宣称来自辽宁民间艺术团，杨树林更是以团长自居。　　　　据文化产业新闻查询，本山传媒是以辽宁民间艺术团为核心组建成的大型文化产业集团，由表演艺术家赵本山任集团董事长。被文化部授予“文化企业三十强”。本山传媒前身为辽宁民间艺术团，成立于2003年，是辽宁省文化厅直属的民营文化企业。　　　　本山传媒是集演艺、影视、艺术教育于一身的大型文化产业集团。2004年，被文化部授予首批“文化产业示范基地”；2010年，“刘老根大舞台”被文化部、国家旅游局联合评为首批“国家文化旅游重点项目”；2010年起连续三年被中宣部评为“全国文化企业三十强”。　　　　本山传媒拍摄有“刘老根”“乡村爱情故事”“马大帅”等享誉全国的优秀影视作品。　　　　　　改名后的“本山传媒”股东结构如何？　　　　虽然改名了，但持股人都是赵本山和马立娟，实打实的肥水不流外人田。　　　　通过股份占有树状图可以看到，股东分别是本山控股有限公司、赵本山及其妻子马立娟。　　　　　　　　　　其中，本山控股有限公司占股60%、赵本山和马立娟分别占19.6%、20.4%，马丽娟是疑似实际控制人，但最终受益人仍是赵本山和马丽娟夫妻二人，没有第三者，可以说是实打实的“自家产业”了。　　　　　　　　为何“去本山化”　　　　赵本山之所以将公司改名，肯定不只是因为好听，还有更深的意义！　　　　一方面是为了更进一步的“去本山化”。这几年本山传媒公司一直在努力的“去本山化”，为的当然就是摆脱对赵本山个人名望的依赖。　　　　之前，杨树林等人在进行活动的时候，就介绍自己是辽宁民间艺术团，实质就已经在为人气和知名度做出了一定的铺垫，毕竟到了一定程度，公司也是需要转型的。　　　　值得一提的是，其实本山传媒的前身是辽宁民间艺术团，但在赵本山事业鼎盛时期接受了他，并用自己的名字来命名，可以看出是为了打造一定的知名度。尽管现在培育出的人才众多，但是被大家熟知的却极少，因此“去本山化”可谓是迟早的事情。　　　　随着本山传媒的更名，很多网友不禁感叹，这是一个时代的结束。　　　　这几年本山传媒是在走下坡路。不但赵本山自己也不上春晚了，就连徒弟们也相继淡出大家的视野。以前上过春晚的丫蛋，小沈阳等，现在已经沦为十八线明星了。就连曾经和马云拍过小品的宋小宝近年来也消声灭迹了。　　　　　　　　这次本山传媒更名，大众认为去本山化倾向明显，回望本山传媒早期，是依靠赵本山老师一个人的知名度去闯天地。但是，随着现代文化公司运营的逐渐正规化，越来越多的文化公司开始注重打造公司品牌而不再注重个人品牌效应，所以这一次公司的改名，是对现代文化公司商业运营的重大转变。　　　　仓促中拟写此文，纪念本山传媒，纪念本山时代。'
142     ltp = LtpParser()
144     words, postags = ltp.parser_process(content_1)
145 
146     NER_1 = OrderedDict()
147     for index, tag in enumerate(postags):
148         if tag == 'ni' and len(words[index]) > 1: #组织名
149             NER_1.setdefault('ORG', set()).add(words[index])
150         elif tag == 'nh' and len(words[index]) > 1:#人名
151             NER_1.setdefault('PER', set()).add(words[index])
152         elif tag == 'ns' and len(words[index]) > 1:#地名
153             NER_1.setdefault('LOC', set()).add(words[index])
154     print(NER_1)

三.结果

OrderedDict([('LOC', {'赵本山', '辽宁省', '本山', '沈阳'}), ('PER', {'赵本山', '马丽娟', '本山', '马立娟', '杨树林', '刘老根', '宋小宝', '官宣'}), ('ORG', {'国家旅游局', '辽宁民间艺术团有限公司', '本山传媒有限公司', '辽宁民间艺术团', '文化部', '辽宁大学', '本山传媒公司', '中宣部'})])
结果中把''赵本山‘，’本山‘当作LOC了。这里可以引入外部词典解决。