Python自然语言处理学习笔记(63)：7.4 语言结构中的递归

7.4 Recursion in Linguistic Structure 语言结构中的递归

Building Nested Structure with Cascaded Chunkers 用逐位分块器构建嵌套结构

So far, our chunk structures have been relatively flat. Trees consist of tagged tokens, optionally grouped under a chunk node such as NP. However, it is possible to build chunk structures of arbitrary depth, simply by creating a multi-stage chunk grammar containing recursive rules. Example 7.10 has patterns for noun phrases, prepositional phrases, verb phrases, and sentences. This is a four-stage chunk grammar, and can be used to create structures having a depth of at most four.

grammar = r"""

NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN

PP: {<IN><NP>} # Chunk prepositions followed by NP

VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments

CLAUSE: {<NP><VP>} # Chunk NP, VP

"""

cp = nltk.RegexpParser(grammar)

sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),

("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

>>> print cp.parse(sentence)

(NP Mary/NN)

saw/VBD

(CLAUSE

(NP the/DT cat/NN)

(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

Example 7.10 (code_cascaded_chunker.py)

Unfortunately this result misses the VP headed by saw. It has other shortcomings too. Let's see what happens when we apply this chunker to a sentence having deeper nesting. Notice that it fails to identify the VP chunk starting at.

>>> sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),

... ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),

... ("on", "IN"), ("the", "DT"), ("mat", "NN")]

>>> print cp.parse(sentence)

(NP John/NNP)

thinks/VBZ

(NP Mary/NN)

saw/VBD # [_saw-vbd]

(CLAUSE

(NP the/DT cat/NN)

(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

The solution to these problems is to get the chunker to loop over its patterns: after trying all of them, it repeats the process. We add an optional second argument loop to specify the number of times the set of patterns should be run:

>>> cp = nltk.RegexpParser(grammar, loop=2)

>>> print cp.parse(sentence)

(NP John/NNP)

thinks/VBZ

(CLAUSE

(NP Mary/NN)

(VP

saw/VBD

(CLAUSE

(NP the/DT cat/NN)

(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))

Note

This cascading process enables us to create deep structures. However, creating and debugging a cascade is difficult, and there comes a point where it is more effective to do full parsing (see Chapter 8). Also, the cascading process can only produce trees of fixed depth (no deeper than the number of stages in the cascade), and this is insufficient for complete syntactic analysis.

Trees 树

A tree is a set of connected labeled nodes, each reachable by a unique path from a distinguished root node. Here's an example of a tree (note that they are standardly drawn upside-down（颠倒的）):

（4）

We use a 'family' metaphor（比喻） to talk about the relationships of nodes in a tree: for example, S is the parent of VP; conversely VP is a child of S. Also, since NP and VP are both children of S, they are also siblings（兄弟）. For convenience, there is also a text format for specifying trees:

(NP Alice)

(VP

(V chased)

(NP

(Det the)

(N rabbit))))

Although we will focus on syntactic trees, trees can be used to encode any homogeneous hierarchical structure（齐次分层结构） that spans a sequence of linguistic forms (e.g. morphological structure（形态结构）, discourse structure（话语结构）). In the general case, leaves and node values do not have to be strings.

In NLTK, we create a tree by giving a node label and a list of children:

>>> tree1 = nltk.Tree('NP', ['Alice'])

>>> print tree1

(NP Alice)

>>> tree2 = nltk.Tree('NP', ['the', 'rabbit'])

>>> print tree2

(NP the rabbit)

We can incorporate these into successively larger trees as follows:

>>> tree3 = nltk.Tree('VP', ['chased', tree2])

>>> tree4 = nltk.Tree('S', [tree1, tree3])

>>> print tree4

(S (NP Alice) (VP chased (NP the rabbit)))

Here are some of the methods available for tree objects:

>>> print tree4[1]

(VP chased (NP the rabbit))

>>> tree4[1].node

'VP'

>>> tree4.leaves()

['Alice', 'chased', 'the', 'rabbit']

>>> tree4[1][1][1]

'rabbit'

The bracketed representation for complex trees can be difficult to read. In these cases, the draw method can be very useful. It opens a new window, containing a graphical representation of the tree. The tree display window allows you to zoom in and out, to collapse and expand subtrees, and to print the graphical representation to a postscript file（备注文件） (for inclusion in a document).

>>> tree3.draw()

Tree Traversal 树遍历

It is standard to use a recursive function to traverse a tree. The listing in Example 7.11 demonstrates this.

def traverse(t):

try:

t.node

except AttributeError:

print t,

else:

# Now we know that t.node is defined

print '(', t.node,

for child in t:

traverse(child)

print ')',

>>> t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')

>>> traverse(t)

( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) )

Example 7.11 (code_traverse.py):

Note

We have used a technique called duck typing（动态类型） to detect that t is a tree (i.e. t.node is defined).