1.文本切分

文本切分

之前讨论了文本结构、成文和表示。具体来说，标识（token）是具有一定的句法语义且独立的最小文本成分。一段文本或一个文本文件具有几个组成部分，包括可以进一步细分为从句、短语和单词的语句。最流行的文本切分技术包括句子切分和词语切分，用于将文本语料库分解成句子，并将每个句子分解成单词。因此，文本切分可以定义为将文本数据分解或拆分为具有更小且有意义的成文（即标识）的过程。

句子切分

句子切分（sentence tokenization）是将文本语料库分解成句子的过程，这些句子是组成语料库的第一级切分结果。这个过程也称为句子分隔，因为尝试将文本分割成有意义的句子。任何文本语料库都是文本的集合，其中每一段落包含多个句子。

执行句子切分有多种技术，基本技术包括在句子之间寻找特定的分隔符，例如句号 ( . )、换行符 ( ) 或者分号 ( ; )。将使用 NLTK 框架进行切分，该框架提供用于执行句子切分的各种接口。将主要关注以下句子切分器：

sent_tokenize
PunktSentenceTokenizer
RegexpTokenizer

在将文本分割成句子之前，需要一些测试该系统的文本。下面将加载一些示例文本，以及部分在 NLTK 中可用的古腾堡（Gutenberg）资料库。可以使用以下代码段加载必要的依存项：

import nltk
from nltk.corpus import gutenberg
from pprint import pprint

注意：

如果第一次执行则需要执行：

import nltk
nltk.download('gutenberg')

则会下载所需要的书籍列表。下载成功后执行代码进行查看：

In [7]: nltk.corpus.gutenberg.fileids()
Out[7]:
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

如果执行时出现以下错误：

error 折叠源码

In [14]: alice = gutenberg.raw(fileids='carrolll-alice.txt')
---------------------------------------------------------------------------
BadZipFile                                Traceback (most recent call last)
<ipython-input-14-158d1a6a9aa4> in <module>()
----> 1 alice = gutenberg.raw(fileids='carrolll-alice.txt')
 
/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py in __getattr__(self, attr)
    114             raise AttributeError("LazyCorpusLoader object has no attribute '__bases__'")
    115
--> 116         self.__load()
    117         # This looks circular, but its not, since __load() changes our
    118         # __class__ to something new:
 
/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py in __load(self)
     76         else:
     77             try:
---> 78                 root = nltk.data.find('{}/{}'.format(self.subdir, self.__name))
     79             except LookupError as e:
     80                 try: root = nltk.data.find('{}/{}'.format(self.subdir, zip_name))
 
/usr/local/lib/python3.6/site-packages/nltk/data.py in find(resource_name, paths)
    653                                      [pieces[i] + '.zip'] + pieces[i:])
    654             try:
--> 655                 return find(modified_name, paths)
    656             except LookupError:
    657                 pass
 
/usr/local/lib/python3.6/site-packages/nltk/data.py in find(resource_name, paths)
    639                 if os.path.exists(p):
    640                     try:
--> 641                         return ZipFilePathPointer(p, zipentry)
    642                     except IOError:
    643                         # resource not in zipfile
 
/usr/local/lib/python3.6/site-packages/nltk/compat.py in _decorator(*args, **kwargs)
    219     def _decorator(*args, **kwargs):
    220         args = (args[0], add_py3_data(args[1])) + args[2:]
--> 221         return init_func(*args, **kwargs)
    222     return wraps(init_func)(_decorator)
    223
 
/usr/local/lib/python3.6/site-packages/nltk/data.py in __init__(self, zipfile, entry)
    486         """
    487         if isinstance(zipfile, string_types):
--> 488             zipfile = OpenOnDemandZipFile(os.path.abspath(zipfile))
    489
    490         # Normalize the entry string, it should be relative:
 
/usr/local/lib/python3.6/site-packages/nltk/compat.py in _decorator(*args, **kwargs)
    219     def _decorator(*args, **kwargs):
    220         args = (args[0], add_py3_data(args[1])) + args[2:]
--> 221         return init_func(*args, **kwargs)
    222     return wraps(init_func)(_decorator)
    223
 
/usr/local/lib/python3.6/site-packages/nltk/data.py in __init__(self, filename)
   1012         if not isinstance(filename, string_types):
   1013             raise TypeError('ReopenableZipFile filename must be a string')
-> 1014         zipfile.ZipFile.__init__(self, filename)
   1015         assert self.filename == filename
   1016         self.close()
 
/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
   1106         try:
   1107             if mode == 'r':
-> 1108                 self._RealGetContents()
   1109             elif mode in ('w', 'x'):
   1110                 # set the modified flag so central directory gets written
 
/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/zipfile.py in _RealGetContents(self)
   1173             raise BadZipFile("File is not a zip file")
   1174         if not endrec:
-> 1175             raise BadZipFile("File is not a zip file")
   1176         if self.debug > 1:
   1177             print(endrec)
 
BadZipFile: File is not a zip file

则说明网络问题，请使用可以连接国外服务器资源的服务器。

alice = gutenberg.raw(fileids='carroll-alice.txt')
sample_text = 'We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming language!'

可以使用以下代码查看 "Akuce ub Wibderkabd" 语料库的长度及其前几行内容：

In [12]: print(len(alice))
144395
  
In [13]: print(alice[0:100])
[Alice's Adventures in Wonderland by Lewis Carroll 1865]
 
CHAPTER I. Down the Rabbit-Hole
 
Alice was

nltk.sent_tokenize 函数是 nltk 推荐的默认的句子切分函数。它内部使用了一个 PunktSentenceTokenizer 类的示例。然而，它不仅仅是一个普通的对象或示例，它依据在几种语言模型上完成了预训练，并且在除英语外的许多语言上取得了良好的运行效果。

以下是代码段展示了该函数在示例文本中的基本操作：

注意：

第一次执行需要执行：

nltk.download('punkt')

default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)
 
print('Total sentences in sample_text:', len(sample_sentences))
print('Sample text sentences :-')
pprint(sample_sentences)
print('
Total sentences in alice:', len(alice_sentences))
pprint(alice_sentences[0:5])

运行上述代码段，你将得到以下输出，该输出给出句子总数以及这些句子在文本语料库中的模样：

Total sentences in sample_text: 3
Sample text sentences :-
['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']
Total sentences in alice: 1625
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I.",
 'Down the Rabbit-Hole
'
 '
'
 'Alice was beginning to get very tired of sitting by her sister on the
'
 'bank, and of having nothing to do: once or twice she had peeped into the
'
 'book her sister was reading, but it had no pictures or conversations in
'
 "it, 'and what is the use of a book,' thought Alice 'without pictures or
"
 "conversation?'",
 'So she was considering in her own mind (as well as she could, for the
'
 'hot day made her feel very sleepy and stupid), whether the pleasure
'
 'of making a daisy-chain would be worth the trouble of getting up and
'
 'picking the daisies, when suddenly a White Rabbit with pink eyes ran
'
 'close by her.',
 'There was nothing so VERY remarkable in that; nor did Alice think it so
'
 "VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!",
 'Oh dear!']

现在，应该可以看出，句子切分器其实是非常智能的，它不仅会使用句号来划分语句。它还会考虑到其他标点符号以及单词大小写。

我们也可以对其他语言的文本进行语句切分。如果正在处理德语文本，可以使用已经训练好的 sent_tokenize，或者在德语文本中加载一个预先训练好的切分模型得到一个 PunktSentenceTokenizer 实例中并执行相同的操作。以下代码段显示了德语中的语句切分过程。

首先加载德语文本语料库并检查它：

注意：

第一次执行需要执行：

nltk.download('europarl_raw')

In [34]: german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
 
In [35]: from nltk.corpus import europarl_raw
 
In [36]: german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
 
In [37]: print(len(german_text))
157171
 
In [38]: print(german_text[0:100])
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit

然后，使用默认的 sent_tokenize 切分器和一个从 nltk 源加载的预训练的德语切分器来讲文本语料库分割成句子：

In [40]: german_sentences_def = default_st(text=german_text, language='german')
 
In [41]: german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
 
In [42]: german_sentences = german_tokenizer.tokenize(german_text)
 
In [43]: print(type(german_tokenizer))
<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>

有此可以看出 german_tokenizer 是 PunktSentenceTokenizer 的一个实例，它专门用来处理德语。

接下来，对此从默认切分器获得的句子是否与从预训练切分器获得的句子相同，理想情况下应为 True。之后，显示部分示例句子的切分结果：

In [45]: print(german_sentences_def == german_sentences)
True
  
In [46]: for sent in german_sentences[0:5]:
   ....:     print(sent)
   ....:
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .
Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .
Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .

从结果可以看出前端的假设是正确的，可以用两种方式来切分英语之外的语言句子。使用默认的 PunktSentenceTokenizer 类也能很方便的实现句子切分，如下所示：

In [47]: punkt_st = nltk.tokenize.PunktSentenceTokenizer()
 
In [48]: sample_sentences = punkt_st.tokenize(sample_text)
 
In [49]: pprint(sample_sentences)
['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']

可以看到，得到了与预期一致的输出。在句子切分这部分知识中，要介绍的是使用 RegexpTokenizer 类的示例将文本切分为句子，将使用基于正则表达式的模式莱切分句子。

以下代码显示了如何使用正则表达式来分隔句子：

In [50]: SENTENCE_TOKENS_PATTERN = r'(?<!w.w.)(?<![A-Z][a-z].])(?<![A-Z].)(?<=.|?|!)s'
 
In [51]: regex_st = nltk.tokenize.RegexpTokenizer(
   ....:        pattern=SENTENCE_TOKENS_PATTERN,
   ....:        gaps=True
   ....: )
 
In [52]: sample_sentences = regex_st.tokenize(sample_text)
 
In [53]: pprint(sample_sentences)
['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']

通过上面的输出可以看出，获得的切分结果与使用其他切分器切分的结果相同。

词语切分

词语切分（word tokeninzation）是将句子分解或分割成其组成单词的过程。句子是单词的集合，通过词语切分，在本质上，将一个句子分割成单词列表，该单词列表又可以重建句子。词语分隔在很多过程中都是非常重要的，特别是在文本清晰和规范化时，诸如磁感提取和词型还原基于词干、标识信息的操作会在每个单词实施。与句子切分类似，nltk 为词语切分提供了各种有用的接口。

work_tokenize
TreebankWordTokenizer
RegexpTokenizer
从 RegexoTokenizer 继承的切分器

将使用例句 "The brown fox wasn't that quick and he couldn't win the race" 作为各种切分器的输入。nltk.word_tokenize 函数是 nltk 默认并推荐的词语切分器。该切分器实际上是 TreebankWordTokenizer 类的一个实例或对象，并且是该核心类的一个封装。以下代码可与说其用法：

In [9]: sentence = "The brown fox wasn't that quick and he couldn't win the race"
 
In [10]: default_wt = nltk.word_tokenize
 
In [11]: words = default_wt(sentence)
 
In [12]: print(words)
['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']

TreebankWordTokenizer 基于 Penn Treebank，并使用各种正则表达式来分隔文本。当然，这里的一个主要假设是我们已经预先执行了句子切分。Penn Treebank 使用的原始切分器是一个 sed 脚本，可以在 https://catalog.ldc.upenn.edu/ldc99t42 下载，从而了解句子切分为单词的简要模式。该切分器的一些主要功能包括：

分隔和分离出现在句子末尾的句点。
分隔和分离空格前的逗号和单引号。
将大多数表标点符号分隔成独立标识。
分隔常规的缩写词，例如将 “don't” 分割成 “do” 和 “n‘t”。

以下代码段展示了 TreebankWordTokenizr 的语句切分中的用法：

In [13]: treebank_wt = nltk.TreebankWordTokenizer()
 
In [14]: words = treebank_wt.tokenize(sentence)
 
In [15]: print(words)
['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']

可以看出，正如所预期的那样，上述代码段的输出与 word_tokenize() 的输出相似，因为他们使用了相同的分词机制。

现在来看看如何使用正则表达式的 RegexpTokenizer 类切分句子。请切记，在词语切分中有两个主要参数：pattern 参数和 gaps 参数。pattern 参数用于构建切分器；gaps 参数如果设置为 True，用于查找标识之间的间隙。否则，它用于查找标识本身。

以下代码段展示了一些实用正则表达式执行词语切分的示例：

In [21]: TOKEN_PATTERN = r'w+'
 
In [22]: regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN,gaps=False)
 
In [23]: words = regex_wt.tokenize(sentence)
 
In [24]: print(words)
['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']

In [25]: GAP_PATTERN = r's+'
 
In [26]: regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,gaps=True)
 
In [27]: words = regex_wt.tokenize(sentence)
 
In [28]: print(words)
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']

In [29]: word_indices = list(regex_wt.span_tokenize(sentence))
 
In [30]: print(word_indices)
[(0, 3), (4, 9), (10, 13), (14, 20), (21, 25), (26, 31), (32, 35), (36, 38), (39, 47), (48, 51), (52, 55), (56, 60)]
 
In [31]: print([sentence[start:end] for start, end in word_indices])
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']

除了基础的 RegexpTokenizer 类之类，还有几个派生类可以执行不同类型的词语切分。WordPunktTokenizer 使用 r'w+|[^ws]+' 模式将句子切分成独立的字母和非字母标识。WhitespaceTokenizer 基于诸如缩进符、换行符及空格的空白字符将句子分割成单词。

以下代码说明了上述派生类的用法：

In [32]: wordpunkt_wt = nltk.WordPunctTokenizer()
 
In [33]: words = wordpunkt_wt.tokenize(sentence)
 
In [34]: print(words)
['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'race']

In [35]: whitespace_wt = nltk.WhitespaceTokenizer()
 
In [36]: words = whitespace_wt.tokenize(sentence)
 
In [37]: print(words)
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']