polyglot安装和使用

一、polyglot概述

有一批印度语的文档需要进行提关键词等处理，所以找到了polyglot这个工具
目前，在NLP任务处理中，Python支持英文处理的开源包有NLTK、Scapy、StanfordCoreNLP、GATE、OPenNLP，支持中文处理的开源工具包有Jieba、ICTCLAS、THU LAC、HIT LTP，但是这些工具大部分仅对特定类型的语言提供支持。
特征
语言检测 Language Detection (支持196种语言)
分句、分词 Tokenization (支持165种语言)
实体识别 Named Entity Recognition (支持40种语言)
词性标注 Part of Speech Tagging(支持16种语言)
情感分析 Sentiment(支持136种语言)
词嵌入 Word Embeddings(支持137种语言)
翻译 Transliteration(支持69种语言)
管道 Pipelines

二、安装polyglot

1、安装pyicu

https://www.lfd.uci.edu/~gohlke/pythonlibs/#pyicu

2、安装polyglot

pip install polyglot

3、安装pycld2

polyglot的语言检测依赖pycld2和cld2,其中cld2是Google开发的多语言检测应用

https://github.com/snazari/topModel/blob/master/install_moduls/pycld2-0.31-cp36-cp36m-win_amd64.whl

4、安装morfessor

pip install morfessor

三、使用

1、分词

测试用例

Japan's last pager provider has announced it will end its service in September 2019 - bringing a national end to telecommunication beepers , 
50 years after their introduction.Around 1,500 users remain subscribed to Tokyo Telemessage , which has not made the devices in 20 years .

from polyglot.text import Text
Text(text_en).words
["Japan's", 'last', 'pager', 'provider', 'has', 'announced', 'it', 'will', 'end', 'its', 'service', 'in', 'September', '2019', '-', 'bringing', 'a', 'national', 'end', 'to', 'telecommunication', 'beepers', ',', '50', 'years', 'after', 'their', 'introduction.Around', '1,500', 'users', 'remain', 'subscribed', 'to', 'Tokyo', 'Telemessage', ',', 'which', 'has', 'not', 'made', 'the', 'devices', 'in', '20', 'years', '.']

2、实体识别

polyglot实体识别的训练语料来源于维基百科（WIKI），其训练好的模型并没有初次安装，需要下载相应的模型。polyglot支持40种语言的实体类（人名、地名、机构名）的识别。

1.下载模型

import polyglot
!polyglot download ner2.id embeddings2.id # 印度尼西亚的缩写为id,同理，中文的缩写为zh

这时候，我这里会报如下错误，当然各位要是没用这种错误是最好的。

from signal import signal, SIGPIPE, SIG_DFL
ImportError: cannot import name 'SIGPIPE'```

如何解决，我们需要进入安装好的polyglot文件夹中，它就位于上面说到的site-packages里面。

找到 _ _ main _ _ .py这个文件，打开并修改：
我们找到这两行代码，然后注释掉，如下：

#from signal import signal, SIGPIPE, SIG_DFL
#signal(SIGPIPE, SIG_DFL)

找到downloader.py这个文件，打开并修改：
我们首先找到def fromcsobj(csobj) 这个方法，然后把这个方法里面的所有的path.sep替换为 ‘/’,（注意，单引号不能省略）。

2.使用模型进行实体识别

text_id = "Aku orang Cina."
print(Text(text_id).entities)

但是经过实验：英语，汉语，印度尼西亚语的实体识别都返回空列表。

3、语言检测

from polyglot.detect import  Detector
text_cn = "Celana dalam Wanita Sexy Jala Sisi Transparan Dasi gstring Bikini C188Harga "
r = Detector(text_cn).language
print(r)
print(r.name)

参考文献

1、polyglot：Pipeline 多语言NLP工具
2、Windows(10) Python polyglot安装和运行失败的问题