自然语言处理3.5——正则表达式的有益应用

1、提出字符串

通过re.findall()方法找出所有（无重叠）匹配指定正则表达式的字符串。例如找出一个词的所有元音字符，并计数

>>>import re
>>>word='supercalifragilisticexpialidocious'
>>>print(re.findall(r'[aeiou]',word))
['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
>>>print(len(re.findall(r'[aeiou]',word)))
16

在例如找出文本中两个或者两个以上的原因序列，并确定它们的相对频率

>>>wsj=sorted(set(nltk.corpus.treebank.words()))
>>>fd=nltk.FreqDist(vs for word in wsj
					for vs in re.findall(r'[aeiou]{2,}',word))
>>>print(list(fd.items()))
[('iao', 1), ('ioa', 1), ('oa', 59), ('ao', 6), ('uu', 1), ('eou', 5), ('eo', 39), ('aiia', 1), 
('uo', 8), ('eea', 1), ('ai', 261), ('ui', 95), ('oei', 1), ('iai', 1), ('oui', 6), ('uie', 3), ('aii', 1), ('ooi', 1), ...)]

2、在字符串上做更多的事情

英文文本是高度冗余的，忽略掉词内部的元音仍然可以轻松的阅读，有些时候这很明显。下面一个例子，正则表达式匹配词首元音序列，词尾元音序列和所有的辅音，其他的被省略。这三个阶段从左到右依次处理。如果词匹配了三个部分的任意一个，正则表达式后面的部分将会省略。

>>>regexp=r'^[aeiouAEIOU]+|[aeiouAEIOU]+$|[^aeiouAEIOU]' ##匹配模式
>>>def compress(word):
	   pieces=re.findall(regexp,word)
	   return ''.join(pieces)   #通过join（）连接
>>> english_udhr=nltk.corpus.udhr.words('English-Latin1')
>>>print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and

下面，我们将正则表达式和条件频率结合起来。在这里，从罗托卡特词汇中提取所有的辅音-元音序列，如ka和si。因为是成对出现的，他可以用来初始化条件分布。

>>>rotokas_words=nltk.corpus.toolbox.words('rotokas.dic')
>>>cvs=[cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]',w)]
>>>cfd=nltk.ConditionalFreqDist(cvs)
>>>cfd.tabulate()
    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49

观察s和t行，发现他们是互补的，这个证据表明他们不是这种语言中的不同元素。从而可以从罗托卡特字母表中去除s，加入一个发音规则，当t跟在i后面时候发s的音。

如果想要检查表格中数字背后的词汇，需要找到包含给定辅音-元音对应的单词列表。如：cv_index['su']表示所有包含'su'的词汇。

>>>cv_word_pairs=[(cv,w) for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]',w)]
>>>cv_index=nltk.Index(cv_word_pairs)
>>>print(cv_index['su'])
['kasuari']
>>>print(cv_index['ri'])
['kaipori', 'kaiporipie', 'kaiporivira', 'kairi', 'kairiro', 'kakiri', 'kapokari', 'kapokarito', 'Karepirie',...]

3。查找词干

在使用网络搜索引擎时候，通常不介意文档中的词汇与搜索条件的后缀形式是否相同。例如'laptops'和'laptop'是同一个词的两种形式而已。对于一些处理任务，需要忽略结尾，只需要处理词干。

抽取词干的方式很多。这里采用一种简单直观的方法：直接去掉任何看上去像是后缀的字符。

>>>def stem(word):
	   for suffix in ['ing','ly','ed','ious','ies','ive','es','s','ment']:
		   if word.endswith(suffix):
			   return word[:-len(suffix)]
>>>print(stem('string'))
str

第一步建立所有后缀的连接，并把它放在括号内限制这个连接的范围。

>>>print(re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$','processing'))
['ing']

尽管正则表达式匹配整个单词，但是只是给出后缀。因为括号有第二个功能：选择要提取的子字符串。如果想要使用括号来制定连接的范围，但是又不想选择输出的字符串，必须添加'?:'.下面改进：

>>>print(re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$','processing'))
['processing']

然而，我们想要把词分成词干和后缀两部分，所以，应该只是用括号括起来的如下两部分

>>>print(re.findall(r'(^.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$','processing'))
[('process', 'ing')]

看上去我们已经成功了，但是还存在一个问题。我们换一个词试试

>>>print(re.findall(r'(^.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$','processes'))
[('processe', 's')]

正则表达式错误的找到了后缀‘-s‘，而不是后缀'-es'。这表明另外一个微妙之处：'*'操作符是贪婪的，所以'.*'表达式试图尽量多的匹配输入的字符串。如果我们改成非贪婪的'*?'就能得到我们想要的结果：

>>>print(re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$','processes'))
[('process', 'es')]

还可以通过将第二个括号后缀变为可选来得到空后缀：

>>>print(re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$','language'))
[]
>>>print(re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$','language'))
[('language', '')]

仔细比较上面的区别，你会发现什么？

这种方法虽然存在很多问题，但我们现在还是继续定义一个函数来获取词干，并将它运用到整个文本中去。

>>>def stem(word):
	regexp=r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
	stem,suffix=re.findall(regexp,word)[0]
	return stem
>>>raw="""DENNIS: Listen, strange women lying in ponds distributing swords

    is no basis for a system of government. Supreme executive power derives from

    a mandate from the masses, not from some farcical aquatic ceremony."""
>>>tokens=nltk.word_tokenize(raw)
>>>print([stem(t) for t in tokens])
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

4、搜索以分词文本

可以使用一种特殊的正则表达式搜索一个文本中的多个词，例如'<a><man>'找到文本中所有'a man'的实例。尖括号用在标识符的边界，尖括号之间的所有空白都被忽略（这一点只对NLTK的findall()方法处理文本有效）。在下面的例子中，使用<.*>，让其匹配单个标识符，并且放在括号内。这样就只匹配词（如monied）而不匹配短语（a monied man）.

>>>from nltk.corpus import gutenberg,nps_chat
>>>moby=nltk.Text(gutenberg.words('melville-moby_dick.txt'))
>>>moby.findall(r'<a>(<.*>)<man>')
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
###对比不加括号的情况
>>>moby.findall(r'<a><.*><man>')
a monied man; a nervous man; a dangerous man; a white man; a white
man; a white man; a pious man; a queer man; a good man; a mature man;
a white man; a Cape man; a great man; a wise man; a wise man; a
butterless man; a white man; a fiendish man; a pale man; a furious
man; a better man; a certain man; a complete man; a dismasted man; a
younger man; a brave man; a brave man; a brave man; a brave man

下面一个例子用于找出以词'bro'作为结尾的3个词组成的短语。和‘l‘字母开头的3个或更多字母组成的词组序列

>>>chat=nltk.Text(nps_chat.words())
>>>chat.findall(r'<.*><.*><bro>')
you rule bro; telling you bro; u twizted bro
>>>chat.findall(r'<l.*>{3,}')
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la