8 分析句子结构

1.一些语法困境

普遍存在的歧义

2.文法的用途

学习文法的一个好处是,它提供了一个概念性的框架和词汇拼写这些直觉。

成分结构基于对词与其他词结合在一起形成单元的观察。一个词序列形成这样一个单元被证明是可替代的——也就是说，

在一个符合语法规则的句子中的词序列可以被一个更小的序列替代而不会导致句子不符合语法规则。

3.上下文无关文法

一种简单的文法

groucho_grammer = nltk.CFG.fromstring("""
s -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> "saw" | "ate" | "walked"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
""")

sent = ['Mary','saw' ,'Bob']
parser = nltk.RecursiveDescentParser(groucho_grammer)
trees = parser.parse(sent)
for tree in trees:
    print(tree)  #(s (NP Mary) (VP (V saw) (NP Bob)))

4.上下文无关法分析

grammar1 = nltk.data.load('file:mygrammar.cfg')
sent = 'Mary saw Bob'.split()
rd_parser = nltk.RecursiveDescentParser(grammer1,trace=2)
for tree in rd_parser.parse(sent):
    print(tree)#(s (NP Mary) (VP (V saw) (NP Bob)))
for p in grammer1.productions():
    print(p)

#递归下降解析器[自顶向下]
rd_parser = nltk.RecursiveDescentParser(grammar1)
sent = 'Mary saw a dog'.split()
for t in rd_parser.parse_all(sent):
    print(t)#(s (NP Mary) (VP (V saw) (NP (Det a) (N dog))))

#移进-归约分析[自底向上]
sr_parse = nltk.ShiftReduceParser(grammar1)
sent = 'Mary saw a dog'.split()
for t in sr_parse.parse(sent):
    print(t)#(s (NP Mary) (VP (V saw) (NP (Det a) (N dog))))

#左角落解析器
#带自下而上过滤的自上而下的解析器，它不会陷入左递归产生式的陷阱
# 分析器每次考虑产生式时，它会检查下一个输入词是否与左角落表格中至少一种非终结符的类别相容。

#符合语句规则的子串表
# 采用动态规划存储中间结果，并在适当的时候重用它们，能显著提高效率。——图表分析
def init_wfst(tokens, grammar):
    numtokens = len(tokens)
    wfst = [[None for i in range(numtokens+1)] for j in range(numtokens+1)]
    for i in range(numtokens):
        productions = grammar.productions(rhs=tokens[i])
        wfst[i][i+1] = productions[0].lhs()
    return wfst

def complete_wfst(wfst, tokens, grammar, trace=False):
    index = dict((p.rhs(), p.lhs()) for p in grammar.productions())
    numtokens = len(tokens)
    for span in range(2, numtokens + 1):
        for start in range(numtokens + 1):
            end = start + span
            if end > numtokens: break
            for mid in range(start+1, end):
                nt1, nt2 = wfst[start][mid], wfst[mid][end]
                if nt1 and nt2 and (nt1, nt2) in index:
                    wfst[start][end] = index[(nt1, nt2)]
                    if trace:
                        print("[%s] %3s [%s] %3s [%s] ==> [%s] %3s [%s]"
                              %(start, nt1, mid, nt2, end, start, index[(nt1, nt2)], end))
    return wfst

def display(wfst, tokens):
    print('
WFST ' + ' '.join([("%-4d" % i) for i in range(1, len(wfst))]))
    for i in range(len(wfst)-1):
        print("%d    " %i, end="")
        for j in range(1, len(wfst)):
            print("%-4s" % (wfst[i][j] or '.'), end="")
        print("")
        
tokens = "I shot an elephant in my pajamas".split()
wfst0 = init_wfst(tokens, grammar1)
display(wfst0, tokens)
   WFST 1    2    3    4    5    6    7
        0    NP  .   .   .   .   .   .
        1    .   V   .   .   .   .   .
        2    .   .   Det .   .   .   .
        3    .   .   .   N   .   .   .
        4    .   .   .   .   P   .   .
        5    .   .   .   .   .   Det .
    	6    .   .   .   .   .   .   N

wfst1 = complete_wfst(wfst0,tokens,grammar1,trace=True)
display(wfst1,tokens)
[2] Det [3]   N [4] ==> [2]  NP [4]
[5] Det [6]   N [7] ==> [5]  NP [7]
[1]   V [2]  NP [4] ==> [1]  VP [4]
[4]   P [5]  NP [7] ==> [4]  PP [7]
[0]  NP [1]  VP [4] ==> [0]   S [4]
[1]  VP [4]  PP [7] ==> [1]  VP [7]
[0]  NP [1]  VP [7] ==> [0]   S [7]

   WFST 1    2    3    4    5    6    7
        0    NP  .   .   S   .   .   S
        1    .   V   .   VP  .   .   VP
        2    .   .   Det NP  .   .   .
        3    .   .   .   N   .   .   .
        4    .   .   .   .   P   .   PP
        5    .   .   .   .   .   Det NP
        6    .   .   .   .   .   .   N

5.依存关系和依存文法

短语结构文法是关于词和词序列如何结合形成句子成分的。

一种独特且互补的方式，依存文法，集中关注的是词与其他词之间的关系。

依存关系是一个中心词与其从属之间的二元非对称关系。一个句子的中心词通常是动词，所有其他词要么依赖于中心词，要么通过依赖路径与它相关联。

与短语结构文法相比，依存文法可以作为一种依存关系用来直接表示语法功能。

groucho_dep_grammer = nltk.grammar.DependencyGrammar.fromstring("""
'shot' -> 'I' | 'elephant' | 'in'
'elephant' -> 'an' | 'in'
'in' -> 'pajamas'
'pajamas' -> 'my'
""")
print(groucho_dep_grammer)#依存文法只能捕捉依存关系信息，不能指定依存关系类型
Dependency grammar with 7 productions
   'shot' -> 'I'
   'shot' -> 'elephant'
   'shot' -> 'in'
   'elephant' -> 'an'
   'elephant' -> 'in'
   'in' -> 'pajamas'
   'pajamas' -> 'my'

pdp = nltk.ProjectiveDependencyParser(groucho_dep_grammer)
sent = 'I shot an elephant in my pajamas'.split()
trees = pdp.parse(sent)
for tree in trees:
    print(tree)
(shot I (elephant an (in (pajamas my))))
(shot I (elephant an) (in (pajamas my)))

配价与词汇

及物动词、不及物动词、...被认为具有不同的配价，配价限制不仅适用于动词，也适用于其他类的中心词。

6.文法开发

树库和文法

from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
print(t)
 (S
   (NP-SBJ
     (NP (NNP Pierre) (NNP Vinken))
     (, ,)
     (ADJP (NP (CD 61) (NNS years)) (JJ old))
     (, ,))
   (VP
     (MD will)
     (VP
       (VB join)
       (NP (DT the) (NN board))
       (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
       (NP-TMP (NNP Nov.) (CD 29))))
   (. .))

#搜索树库找出句子的补语
def filter(tree):
    child_nodes = [child.label() for child in tree if isinstance(child, nltk.Tree)]
    return (tree.label() == 'VP') and ('S' in child_nodes)

from nltk.corpus import treebank
res = [subtree for tree in treebank.parsed_sents()
       for subtree in tree.subtrees(filter)]
print(res)


entries = nltk.corpus.ppattach.attachments('training')
table = nltk.defaultdict(lambda: nltk.defaultdict(set))
for entry in entries:
    key = entry.noun1 + '-' + entry.prep + '-' + entry.noun2
    table[key][entry.attachment].add(entry.verb)

for key in sorted(table):
    if len(table[key]) > 1:
        print(key, 'N:', sorted(table[key]['N']), 'V:', sorted(table[key]['V']))

 %-below-level N: ['left'] V: ['be']
 %-from-year N: ['was'] V: ['declined', 'dropped', 'fell', 'grew', 'increased', 'plunged', 'rose', 'was']


nltk.corpus.sinica_treebank.parsed_sents()[3450].draw()

#有害的歧义
grammar = nltk.CFG.fromstring("""
 S -> NP V NP
 NP -> NP Sbar
 Sbar -> NP V
 NP -> 'fish'
 V -> 'fish'
 """)
tokens = ["fish"] * 5
cp = nltk.ChartParser(grammar)
for tree in cp.parse(tokens):
    print(tree)#(S (NP fish) (V fish) (NP (NP fish) (Sbar (NP fish) (V fish))))
 (S (NP (NP fish) (Sbar (NP fish) (V fish))) (V fish) (NP fish))

#加权文法
#宾州树库样本中give和gave的用法

def give(t):
    return (t.label() == 'VP' and len(t) > 2 and t[1].label() == 'NP'
            and (t[2].label() == 'PP-DTV' or t[2].label() == 'NP')
            and ('give' in t[0].leaves() or 'gave' in t[0].leaves()))

def sent(t):
    return ' '.join(token for token in t.leaves() if token[0] not in '*-0')

def print_node(t, width):
    output = "%s %s: %s / %s: %s" %
    (sent(t[0]), t[1].label(), sent(t[1]), t[2].label(), sent(t[2]))
    if len(output) > 
        output = output[:width] + "..."
    print(output)

for tree in nltk.corpus.treebank.parsed_sents():
    for t in tree.subtrees(give):
        print_node(t, 72)

 gave NP: the chefs / NP: a standing ovation
 give NP: advertisers / NP: discounts for maintaining or increasing ad sp...
 give NP: it / PP-DTV: to the politicians
 gave NP: them / NP: similar help
 give NP: them / NP:
 give NP: only French history questions / PP-DTV: to students in a Europe...
 give NP: federal judges / NP: a raise
 give NP: consumers / NP: the straight scoop on the U.S. waste crisis
 gave NP: Mitsui / NP: access to a high-tech medical product
 give NP: Mitsubishi / NP: a window on the U.S. glass industry
 give NP: much thought / PP-DTV: to the rates she was receiving , nor to ...
 give NP: your Foster Savings Institution / NP: the gift of hope and free...
 give NP: market operators / NP: the authority to suspend trading in futu...
 gave NP: quick approval / PP-DTV: to $ 3.18 billion in supplemental appr...
 give NP: the Transportation Department / NP: up to 50 days to review any...
 give NP: the president / NP: such power
 give NP: me / NP: the heebie-jeebies
 give NP: holders / NP: the right , but not the obligation , to buy a cal...
 gave NP: Mr. Thomas / NP: only a `` qualified '' rating , rather than ``...
 give NP: the president / NP: line-item veto power


#概率上下文无关文法  所有产生式给定的左侧的概率之和必须为1
grammar = nltk.PCFG.fromstring("""
S -> NP VP         [1.0]
VP -> TV NP        [0.4]
VP -> IV           [0.3]
VP -> DatV NP NP   [0.3]
TV -> 'saw'        [1.0]
IV -> 'ate'        [1.0]
DatV -> 'gave'     [1.0]
NP -> 'telescopes' [0.8]
NP -> 'Jack'       [0.2]
""")
print(grammar)

 Grammar with 9 productions (start state = S)
     S -> NP VP [1.0]
     VP -> TV NP [0.4]
     VP -> IV [0.3]
     VP -> DatV NP NP [0.3]
     TV -> 'saw' [1.0]
     IV -> 'ate' [1.0]
     DatV -> 'gave' [1.0]
     NP -> 'telescopes' [0.8]
     NP -> 'Jack' [0.2]

viterbi_parser = nltk.ViterbiParser(grammar)
for t in viterbi_parser.parse(['Jack','saw','telescopes']):#parse返回的分析树中包含了概率
    print(t)  #(S (NP Jack) (VP (TV saw) (NP telescopes))) (p=0.064)