【语言处理与Python】7.3开发和评估分块器

读取IOB格式与CoNLL2000分块语料库

CoNLL2000，是已经加载标注的文本，使用IOB符号分块。

这个语料库提供的类型有NP,VP,PP。

例如：

hePRPB-NP
accepted VBDB-VP
the DTB-NP
positionNNI-NP
...

chunk.conllstr2tree()的函数作用：将字符串建立一个树表示。

例如：

>>>text = '''
... he PRPB-NP
... accepted VBDB-VP
... the DTB-NP
... position NNI-NP
... of IN B-PP
... vice NNB-NP
... chairman NNI-NP
... of IN B-PP
... CarlyleNNPB-NP
... GroupNNPI-NP
... , , O
... a DTB-NP
... merchantNNI-NP
... banking NNI-NP
... concernNNI-NP
... . . O
... '''
>>>nltk.chunk.conllstr2tree(text,chunk_types=['NP']).draw()

运行结果如图所示：

对于CoNLL2000分块语料，我们可以对他进行如下操作：

#访问分块语料文件
>>>from nltk.corpusimport conll2000
>>>print conll2000.chunked_sents('train.txt')[99]
(S
    (PP Over/IN)
    (NP a/DT cup/NN)
    (PP of/IN)
    (NP coffee/NN)
    ,/,
    (NP Mr./NNPStone/NNP)
    (VP told/VBD)
    (NP his/PRP$story/NN)
    ./.)

#如果只对NP感兴趣，可以这样写
>>>print conll2000.chunked_sents('train.txt',chunk_types=['NP'])[99]
(S
    Over/IN
    (NP a/DT cup/NN)
    of/IN
    (NP coffee/NN)
    ,/,
    (NP Mr./NNPStone/NNP)
    told/VBD
    (NP his/PRP$story/NN)
    ./.)

简单评估和基准

>>>grammar= r"NP: {<[CDJNP].*>+}"
>>>cp = nltk.RegexpParser(grammar)
>>>print cp.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 87.7%
Precision: 70.6%
Recall: 67.8%
F-Measure: 69.2%

我们可以构造一个Unigram标注器来建立一个分块器。

#我们定义一个分块器，其中包括构造函数和一个parse方法，用来给新的句子分块
例7-4. 使用unigram标注器对名词短语分块。
classUnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents): 
        train_data = [[(t,c) for w,t,cin nltk.chunk.tree2conlltags(sent)]
            for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data) 
    def parse(self, sentence): 
        pos_tags= [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags= [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags =[(word, pos,chunktag)for ((word,pos),chunktag)
                in zip(sentence, chunktags)]
    return nltk.chunk.conlltags2tree(conlltags)

注意parse这个函数，他的工作流程是这样的：

1、取一个已经标注的句子作为输入

2、从那句话提取的词性标记开始

3、使用在构造函数中训练过的标注器self.tagger，为词性添加标注IOB块标记。

4、提取块标记，与原句组合。

5、组合成一个块树。

做好块标记器之后，使用分块语料库库训练他。

>>>test_sents = conll2000.chunked_sents('test.txt',chunk_types=['NP'])
>>>train_sents = conll2000.chunked_sents('train.txt',chunk_types=['NP'])
>>>unigram_chunker= UnigramChunker(train_sents)
>>>print unigram_chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 92.9%
Precision: 79.9%
Recall: 86.8%
F-Measure: 83.2%

#我们可以通过这些代码，看到学习情况
>>>postags= sorted(set(pos for sent in train_sents
... for (word,pos) in sent.leaves()))
>>>print unigram_chunker.tagger.tag(postags)
[('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'),
(',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'),
('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'),
('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'),
('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'),
('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'),
('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'),
('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'),
('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'),
('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]

同样，我们也可以建立bigramTagger。

>>>bigram_chunker= BigramChunker(train_sents)
>>>print bigram_chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 93.3%
Precision: 82.3%
Recall: 86.8%
F-Measure: 84.5%

训练基于分类器的分块器

目前讨论的分块器有：正则表达式分块器、n-gram分块器，决定创建什么块完全基于词性标记。然而，有时词性标记不足以确定一个句子应如何分块。

例如：

(3) a. Joey/NNsold/VBD the/DT farmer/NN rice/NN ./.
b.Nick/NNbroke/VBD my/DTcomputer/NNmonitor/NN./.

虽然标记都一样，但是很明显分块并不一样。

所以，我们需要使用词的内容信息作为词性标记的补充。

如果想使用词的内容信息的方法之一，是使用基于分类器的标注器对句子分块。

基于分类器的NP分块器的基础代码如下面的代码所示：

#在第2个类上，基本上是标注器的一个包装器，将它变成一个分块器。训练期间，这第二个类映射训练预料中的块树到标记序列
#在parse方法中，它将标注器提供的标记序列转换回一个块树。
classConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history) 
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train( 
            train_set, algorithm='megam', trace=0)
    def tag(self, sentence):
        history = []
        for i, wordin enumerate(sentence):
            featureset = npchunk_features(sentence,i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)
classConsecutiveNPChunker(nltk.ChunkParserI):④
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
            nltk.chunk.tree2conlltags(sent)]
            for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags =[(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

然后，定义一个特征提取函数：

>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
... return {"pos": pos}
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 92.9%
Precision: 79.9%
Recall: 86.7%
F-Measure: 83.2%

对于这个分类标记器我们还可以做改进，增添一个前面的词性标记。

>>>def npchunk_features(sentence,i, history):
... word,pos= sentence[i]
..    . if i ==0:
...         prevword, prevpos= "<START>", "<START>"
...     else:
...         prevword, prevpos= sentence[i-1]
...     return {"pos": pos,"prevpos": prevpos}
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 93.6%
Precision: 81.9%
Recall: 87.1%
F-Measure: 84.4%

我们可以不仅仅以两个词性为特征，还可以再添加一个词的内容。

>>>def npchunk_features(sentence,i, history):
...     word,pos= sentence[i]
..    . if i ==0:
..        . prevword, prevpos= "<START>", "<START>"
...     else:
...         prevword, prevpos= sentence[i-1]
...     return {"pos": pos,"word": word,"prevpos": prevpos}
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 94.2%
Precision: 83.4%
Recall: 88.6%
F-Measure: 85.9%

我们可以试着尝试多加几种特征提取，来增加分块器的表现，例如下面代码中增添了预取特征、配对功能和复杂的语境特征。最后一个特征是tags-since-dt，创建了一个字符串，描述自最近的限定词以来遇到的所有的词性标记。

>>>def npchunk_features(sentence,i, history):
...     word,pos= sentence[i]
...     if i ==0:
...         prevword, prevpos= "<START>", "<START>"
...     else:
...         prevword, prevpos= sentence[i-1]
...     if i ==len(sentence)-1:
...         nextword, nextpos= "<END>", "<END>"
...     else:
...         nextword, nextpos= sentence[i+1]
...     return {"pos": pos,
...         "word": word,
...         "prevpos": prevpos,
...         "nextpos": nextpos,
..        . "prevpos+pos": "%s+%s" %(prevpos, pos),
...         "pos+nextpos": "%s+%s" %(pos, nextpos),
...         "tags-since-dt": tags_since_dt(sentence, i)}
>>>def tags_since_dt(sentence, i):
...     tags = set()
...     for word,pos in sentence[:i]:
...         if pos=='DT':
...             tags = set()
...         else:
...             tags.add(pos)
...     return '+'.join(sorted(tags))
>>>chunker = ConsecutiveNPChunker(train_sents)
>>>print chunker.evaluate(test_sents)
ChunkParsescore:
IOB Accuracy: 95.9%
Precision: 88.3%
Recall: 90.7%
F-Measure: 89.5%