自然语言处理

1. NLTK python的NLP工具包

2. WordNet 英文的语义网络

3. BabelNet 多语言版的语义网络, 全部数据有29GB, 需要以科研的身份申请, 否则只能在线上用每天限量1k的接口. 等有空试验下接口.

4. 决策树分类器

5. 朴素贝叶斯分类器

6. 最大熵分类器 -- 需要补习概率论知识了

Update 2018-06-15: 浙大概率论第四版, 看了两章, 看懂贝叶斯公式了. 但是没看到有关最大熵的章节, 继续看NLTK, 百度了一些最大熵的说明, 这次大概看懂了, 但是后面的词性标注, 句法标注还是看得很懵, 貌似还是需要很多语言相关的先验知识. 下载了deep learning book 中文版, 打算直接看深度学习吧

基础知识

词向量和语言模型 http://licstar.net/archives/328

分词参考汇总

SNS文本数据挖掘 http://www.matrix67.com/blog/archives/5044

1. 基于AC自动机的分词

https://spaces.ac.cn/archives/3908

这个是基于词库（用AC自动机载入）的分词，用动态规划得到概率最大的分词结果

import ahocorasick
import math


def load_dic(dicfile):
    dic = ahocorasick.Automaton()
    total = 0.0
    with open(dicfile, 'r', encoding='utf-8') as dicfile:
        words = []
        for line in dicfile:
            line = line.split(' ')
            count = int(line[1])
            if count <= 0:
                count = 1
            words.append((line[0], count))
            total += int(line[1])
    for word,occurs in words:
        dic.add_word(word, (word, math.log(occurs/total))) #这里使用了对数概率，防止溢出
    dic.make_automaton()
    return dic

def print_all_cut(sentence):
    words = []
    for i,j in dic.iter(sentence):
        print(i, j)


dic = load_dic('dict.txt')
# 可以看到, 序号就是词在句中的位置
print_all_cut('最大匹配法是指从左到右逐渐匹配词库中的词语')

'''
使用动态规划, 
每次用start处的概率加上本词的概率, 如果end处未赋值或者end处的值比这个小, 就用这个值代替
如果start处还没有值, 就取start之前的最大的值造一个
'''
def max_proba_cut(sentence):
    paths = {0:([], 0)}
    end = 0
    for i,j in dic.iter(sentence):
        start,end = 1+i-len(j[0]), i+1
        if start not in paths:
            last = max([i for i in paths if i < start])
            paths[start] = (paths[last][0]+[sentence[last:start]], paths[last][1]-10)
        proba = paths[start][1]+j[1]
        if end not in paths or proba > paths[end][1]:
            paths[end] = (paths[start][0]+[j[0]], proba)
    if end < len(sentence):
        return paths[end][0] + [sentence[end:]]
    else:
        return paths[end][0]

path = max_proba_cut('最大匹配法是指从左到右逐渐匹配词库中的词语')
print(path)

2A. 基于信息熵的新词发现

https://spaces.ac.cn/archives/3491

这是无词库的分词方式，对语料预处理，去掉中英文标点仅保留中英文数字并拆分为句子，对所有内容进行1~N元划分，分别计算支持度和熵

import numpy as np
import pandas as pd
import re
from numpy import log, min

f = open('d:/WorkPython/tmp/TianLongBaBu.txt', 'r', encoding='utf-8')  # 读取文章
s = f.read()  # 读取为一个字符串

# 定义要去掉的标点字
drop_dict = [u'，', u'
', u'。', u'、', u'：', u'(', u')', u'[', u']', u'.', u',', u' ', u'u3000', u'”', u'“', u'？', u'?',
             u'！', u'‘', u'’', u'…']
for i in drop_dict:  # 去掉标点字
    s = s.replace(i, '')

# 为了方便调用，自定义了一个正则表达式的词典
myre = {2: '(..)', 3: '(...)', 4: '(....)', 5: '(.....)', 6: '(......)', 7: '(.......)'}

min_count = 10  # 录取词语最小出现次数
min_support = 30  # 录取词语最低支持度，1代表着随机组合
min_s = 3  # 录取词语最低信息熵，越大说明越有可能独立成词
max_sep = 4  # 候选词语的最大字数
t = []  # 保存结果用。

l = list(s) # 将字符串拆成字符列表
serial = pd.Series(l) # 给每个字符加上编号, 从0开始
t.append(serial.value_counts())  # 对每个字符计算字频, 会产生一个按字频倒序的列表, 作为t[0]
tsum = t[0].sum()  # 统计总字数, .sum()函数是pandas.Series的一个函数
rt = []  # 保存结果用

for m in range(2, max_sep + 1):
    print(u'正在生成%s字词...' % m)
    t.append([])
    for i in range(m):  # 生成所有可能的m字词, 放到t[m-1]
        t[m - 1] = t[m - 1] + re.findall(myre[m], s[i:])

    t[m - 1] = pd.Series(t[m - 1]).value_counts()  #  对每个词计算词频, 产生按字频倒序的列表, 作为t[m-1]
    t[m - 1] = t[m - 1][t[m - 1] > min_count]  # 去掉词频小于min_count的词
    tt = t[m - 1][:] # 复制刚才产生的列表
    '''
    用map的方式，将刚才产生的列表的每个元素，一一用lambda里的方法去计算, 这里
    ms in tt.index ms是这个元素的值, 例如'段誉' 
    aa = ms[:m-1] 前一部分, 例如'段'  
    bb = tmp[m-1:] 后一部分, 例如'誉'
    cc = t[m - 1][tmp] 这个元素的词频
    dd = t[m - 2][tmp[:m - 1]] 前一部分对应元素的词频
    ee = t[0][tmp[m - 1:]] 后一部分对应元素的词频
    '''
    ld = lambda ms: tsum * t[m - 1][ms] / t[m - 2 - k][ms[:m - 1 - k]] / t[k][ms[m - 1 - k:]]
    for k in range(m - 1):
        qq = np.array(list(map(ld, tt.index))) > min_support  # 最小支持度筛选。
        tt = tt[qq] # pandas.Series可以通过[True, False, ...]这样的序列提取子集合, 不断滤掉未达到最小支持度的词
    rt.append(tt.index)


def cal_S(sl):  # 信息熵计算函数
    return -((sl / sl.sum()).apply(log) * sl / sl.sum()).sum()


for i in range(2, max_sep + 1):
    print(u'正在进行%s字词的最大熵筛选(%s)...' % (i, len(rt[i - 2])))
    pp = []  # 保存所有的左右邻结果
    for j in range(i + 2):
        pp = pp + re.findall('(.)%s(.)' % myre[i], s[j:])
    pp = pd.DataFrame(pp).set_index(1).sort_index()  # 先排序，这个很重要，可以加快检索速度
    index = np.sort(np.intersect1d(rt[i - 2], pp.index))  # 作交集
    # 下面两句分别是左邻和右邻信息熵筛选
    index = index[np.array(list(map(lambda s: cal_S(pd.Series(pp[0][s]).value_counts()), index))) > min_s]
    rt[i - 2] = index[np.array(list(map(lambda s: cal_S(pd.Series(pp[2][s]).value_counts()), index))) > min_s]

# 下面都是输出前处理
for i in range(len(rt)):
    t[i + 1] = t[i + 1][rt[i]]
    t[i + 1].sort_values(ascending=False)

# 保存结果并输出
pd.DataFrame(pd.concat(t[1:])).to_csv('result.txt', header=False)

2B. 基于切分的新词发现

https://spaces.ac.cn/archives/3913

简单地通过计算相邻两字出现的概率，断开概率低于阈值的字而达到分词的效果

import re
import json

from collections import defaultdict #defaultdict是经过封装的dict，它能够让我们设定默认值
from tqdm import tqdm #tqdm是一个非常易用的用来显示进度的库
from math import log

def load_texts(text_file):
    with open(text_file, 'r', encoding='utf-8') as articles:
        for article in articles:
            json_obj = json.loads(article)
            if 'content' in json_obj:
                yield json_obj['content']


class Find_Words:
    def __init__(self, min_count=10, min_pmi=0):
        self.min_count = min_count
        self.min_pmi = min_pmi
        self.chars, self.pairs = defaultdict(int), defaultdict(int)
        #如果键不存在，那么就用int函数初始化一个值，int()的默认结果为0
        self.total = 0.

    # 预切断句子，以免得到太多无意义（不是中文、英文、数字）的字符串
    def text_filter(self, texts):
        for a in tqdm(texts):
            for t in re.split(u'[^u4e00-u9fa50-9a-zA-Z]+', a):
                #这个正则表达式匹配的是任意非中文、非英文、非数字，因此它的意思就是用任意非中文、非英文、非数字的字符断开句子
                if t:
                    yield t

    # 计数函数，计算单字出现频数、相邻两字出现的频数
    def count(self, texts):
        for text in self.text_filter(texts):
            self.chars[text[0]] += 1
            for i in range(len(text)-1):
                self.chars[text[i+1]] += 1
                self.pairs[text[i:i+2]] += 1
                self.total += 1
        self.chars = {i:j for i,j in self.chars.items() if j >= self.min_count} #最少频数过滤
        self.pairs = {i:j for i,j in self.pairs.items() if j >= self.min_count} #最少频数过滤
        self.strong_segments = set()
        for i,j in self.pairs.items(): #根据互信息找出比较“密切”的邻字
            _ = log(self.total*j/(self.chars[i[0]]*self.chars[i[1]]))
            if _ >= self.min_pmi:
                self.strong_segments.add(i)

    # 根据前述结果来找词语
    def find_words(self, texts):
        self.words = defaultdict(int)
        for text in self.text_filter(texts):
            s = text[0]
            for i in range(len(text)-1):
                if text[i:i+2] in self.strong_segments: #如果比较“密切”则不断开
                    s += text[i+1]
                else:
                    self.words[s] += 1 #否则断开，前述片段作为一个词来统计
                    s = text[i+1]
        self.words = {i:j for i,j in self.words.items() if j >= self.min_count} #最后再次根据频数过滤


fw = Find_Words(16, 1)
fw.count(load_texts('D:/WorkPython/tmp/articles.json'))
fw.find_words(load_texts('D:/WorkPython/tmp/articles.json'))

import pandas as pd
words = pd.Series(fw.words).sort_values(ascending=False)
words.to_csv('output.csv')

使用KenLM作为训练工具，使用预先分好词的语料进行训练，生成klm格式的语言模型后，在python中载入处理分词。

6. 基于全卷积网络的中文分词 https://spaces.ac.cn/archives/4195

7. 用词典完成深度学习分词 https://spaces.ac.cn/archives/4245

8. 改进新词发现算法 https://spaces.ac.cn/archives/6920

专业词汇的无监督挖掘 https://spaces.ac.cn/archives/6540

Keras seq2seq自动生成标题 https://kexue.fm/archives/5861