python 自然语言处理(二)____获得文本语料和词汇资源

一, 获取文本语料库


1. 古腾堡语料库

  nltk包含古腾堡项目(Project Gutenberg)电子文本档案的一小部分文本。要使用该语料库通常需要用Python解释器加载nltk包,然后尝试nltk.corpus.gutenberg.fileids().实例如下:

1 >>> import nltk
2 >>> nltk.corpus.gutenberg.fileids()
3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt'
4 , 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-a
5 lice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.t
6 xt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', '
7 shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'w
8 hitman-leaves.txt']
9 >>>



1 >>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
2 >>> len(emma)
3 192427
4 >>>


1 >>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
2 >>> emma.concordance("surprise")
3 Displaying 1 of 1 matches:
4  that Emma could not but feel some surprise , and a little displeasure , on he
5 >>>


279 >>> for fileid in gutenberg.fileids():
280 ...     raw = gutenberg.raw(fileid)
281 ...     num_chars = len(raw)
282 ...     words = gutenberg.words(fileid)
283 ...     num_words = len(words)
284 ...     sents = gutenberg.sents(fileid)
285 ...     num_sents = len(sents)
286 ...     vocab = set([w.lower() for w in gutenberg.words(fileid)])
287 ...     num_vocab = len(vocab)
288 ...     print("%d %d %d %s" % (num_chars, num_words, num_sents, fileid))
289 ...
290 887071 192427 7752 austen-emma.txt
291 466292 98171 3747 austen-persuasion.txt
292 673022 141576 4999 austen-sense.txt
293 4332554 1010654 30103 bible-kjv.txt
294 38153 8354 438 blake-poems.txt
295 249439 55563 2863 bryant-stories.txt
296 84663 18963 1054 burgess-busterbrown.txt
297 144395 34110 1703 carroll-alice.txt
298 457450 96996 4779 chesterton-ball.txt
299 406629 86063 3806 chesterton-brown.txt
300 320525 69213 3742 chesterton-thursday.txt
301 935158 210663 10230 edgeworth-parents.txt
302 1242990 260819 10059 melville-moby_dick.txt
303 468220 96825 1851 milton-paradise.txt
304 112310 25833 2163 shakespeare-caesar.txt
305 162881 37360 3106 shakespeare-hamlet.txt
306 100351 23140 1907 shakespeare-macbeth.txt
307 711215 154883 4250 whitman-leaves.txt
309 >>> raw[:1000]
310 "[Leaves of Grass by Walt Whitman 1855]

Come, said my soul,
Such verses fo
311 r my Body let us write, (for we are one,)
That should I after return,
Or, long
312 , long hence, in other spheres,
There to some group of mates the chants resumin
313 g,
(Tallying Earth's soil, trees, winds, tumultuous waves,)
Ever with pleas'd
314 smile I may keep on,
Ever and ever yet the verses owning--as, first, I here and
315  now
Signing for Soul and Body, set to them my name,

Walt Whitman


}  One's-Self I Sing

One's-self I sing, a simple sepa
317 rate person,
Yet utter the word Democratic, the word En-Masse.

Of physiology
318  from top to toe I sing,
Not physiognomy alone nor brain alone is worthy for th
319 e Muse, I say
    the Form complete is worthier far,
The Female equally with t
320 he Male I sing.

Of Life immense in passion, pulse, and power,
Cheerful, for
321 freest action form'd under the laws divine,
The Modern Man I sing.

}  As
322  I Ponder'd in Silence

As I ponder'd in silence,
Returning upon my poems, c"
323 >>>
324 >>> words
325 ['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]
326 >>> sents
327 [['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '1855', ']'], ['Come',
328 ',', 'said', 'my', 'soul', ',', 'Such', 'verses', 'for', 'my', 'Body', 'let', 'u
329 s', 'write', ',', '(', 'for', 'we', 'are', 'one', ',)', 'That', 'should', 'I', '
330 after', 'return', ',', 'Or', ',', 'long', ',', 'long', 'hence', ',', 'in', 'othe
331 r', 'spheres', ',', 'There', 'to', 'some', 'group', 'of', 'mates', 'the', 'chant
332 s', 'resuming', ',', '(', 'Tallying', 'Earth', "'", 's', 'soil', ',', 'trees', '
333 ,', 'winds', ',', 'tumultuous', 'waves', ',)', 'Ever', 'with', 'pleas', "'", 'd'
334 , 'smile', 'I', 'may', 'keep', 'on', ',', 'Ever', 'and', 'ever', 'yet', 'the', '
335 verses', 'owning', '--', 'as', ',', 'first', ',', 'I', 'here', 'and', 'now', 'Si
336 gning', 'for', 'Soul', 'and', 'Body', ',', 'set', 'to', 'them', 'my', 'name', ',
337 '], ...]

raw表示的是文本中所有的标识符,words是词,sents是句子。显然句子都是划分成一个个词来进行存储的。除了words(), raw() 和 sents()以外,大多数nltk语料库阅读器还包括多种访问方法。

2. 网络和聊天文本

古腾堡项目包含的是成千上万的书籍,它们比较正式,代表了既定的文学。除此之外, nltk中还有很多的网络文本小集合,其内容包括Firefox交流论坛,在纽约无意中听到的对话,《加勒比海盗》的电影剧本,个人广告和葡萄酒的评论。访问该部分的文本实例如下:

 1 >>> for fileid in webtext.fileids():
 2 ...     print("%s   %s ..." % (fileid, webtext.raw(fileid)[:65]))
 3 ...
 4 firefox.txt   Cookie Manager: "Don't allow sites that set removed cookies to se
 5 ...
 6 grail.txt   SCENE 1: [wind] [clop clop clop]
 7 KING ARTHUR: Whoa there!  [clop ...
 8 overheard.txt   White guy: So, do you have any plans for this evening?
 9 Asian girl ...
10 pirates.txt   PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr
11 ...
12 singles.txt   25 SEXY MALE, seeks attrac older single lady, for discreet encoun
13 ...
14 wine.txt   Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
16 >>>

3. 即时消息聊天会话语料库





 1 >>> from nltk.corpus import brown
 2 >>> brown.categories()
 3 ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
 4  'humor', 'learned', 'lore', 'mystery', 'new', 'news', 'religion', 'reviews', 'r
 5 omance', 'science_fiction']
 6 >>> brown.words(categories='news')
 7 ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
 9 >>> brown.words(fileids=['cg22'])
10 ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
11 >>> brown.sents(categories=['news', 'editorial', 'reviews', ])
12 [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investiga
13 tion', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no
14 ', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['T
15 he', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the',
16  'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'o
17 f', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks',
18 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which
19 ', 'the', 'election', 'was', 'conducted', '.'], ...]
20 >>>


 1 >>> from nltk.corpus import brown
 2 >>> news_text = brown.words(categories='news')
 3 >>> fdist=nltk.FreqDist([w.lower() for w in news_text])
 4 >>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
 5 >>> for m in modals:
 6 ...     print("%s:%d" %(m, fdist[m]))
 7 ...
 8 can:94
 9 could:87
10 may:93
11 might:38
12 must:53
13 will:389
14 >>>

5. 路透社语料库


 1 >>> reuters.categories()
 2 ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', ...]
 3 >>> reuters.categories('training/9865')
 4 ['barley', 'corn', 'grain', 'wheat']
 5 >>> reuters.categories(['training/9865', 'training/9880'])
 6 ['barley', 'corn', 'grain', 'money-fx', 'wheat']
 7 >>> reuters.fileids('barley')
 8 ['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/158
 9 75',....]
10 >>> reuters.fileids(['barley', 'corn'])
11 ['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/152
12 87', 'test/15341', 'test/15618', 'test/15648', 'test/15649', ...]
13 >>>
14 >>> reuters.words('training/9865')[:14]
15 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', '
16 operators', 'have', 'requested', 'licences', 'to', 'export']
17 >>> reuters.words(['training/9865', 'training/9880'])
18 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
19 >>> reuters.words(categories='barley')
20 ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]
21 >>> reuters.words(categories=['barley', 'corn'])
22 ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]
23 >>>
View Code



 1 >>> from nltk.corpus import inaugural
 2 >>> inaugural.fileids()
 3 ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson
 4 .txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monro
 5 e.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.t
 6 xt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt
 7 ', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt
 8 ', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1
 9 885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.tx
10 t', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt
11 ', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt'
12 , '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosev
13 elt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961
14 -Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Car
15 ter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.t
16 xt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
17 >>> [fileid[:4] for fileid in inaugural.fileids()]
18 ['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825',
19  '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865',
20  '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905',
21  '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945',
22  '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985',
23  '1989', '1993', '1997', '2001', '2005', '2009']
24 >>>
View Code


1 >>> import nltk
2 >>> cfd=nltk.ConditionalFreqDist(
3 ... (target, fileid[:4]
4 ... )
5 ... for fileid in inaugural.fileids()
6 ... for w in inaugural.words(fileid)
7 ... for target in ['america', 'citizen']
8 ... if w.lower().startswith(target))
9 >>> cfd.plot()
View Code






 1 >>> from nltk.corpus import udhr
 2 >>> languages=['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut'
 3 , 'Hungarian_Magyar', 'Ibibio_Efik']
 4 >>>
 5 >>> cfd=nltk.ConditionalFreqDist(
 6 ... (lang, len(word))
 7 ... for lang in languages
 8 ... for word in udhr.words(lang+'-Latin1'))
 9 >>> cfd.plot(cumulative=True)
10 >>>
View Code


示例 描述
fileids() 语料库中的文件
fileids([categories]) 分类对应的语料库中的文件
categories() 语料库中的分类
categories([fileids]) 文件对应的语料库中的分类
raw() 语料库的原始内容
raw([fileids=[f1, f2, f3]) 指定文件的原始内容
raw(categories=[c1, c2]) 指定分类的原始内容
words() 整个语料库中的词汇
words(fileids=[f1,f2,f3]) 指定文件中的词汇
words(categories=[c1,c2]) 指定分类中的词汇
sents() 指定分类中的句子
sents(fileids=[f1,f2,f3]) 指定文件中的句子
sents(categories=[c1,c2]) 指定分类中的句子
abspath(fileid) 指定文件在磁盘上的位置
encoding(fileid) 文件编码(如果知道的话)
open(fileid) 打开指定语料库文件的文件流
root() 到本地安装的语料库根目录的路径




 1 >>> from nltk.corpus import *
 2 >>> corpus_root = r"E:corpora"             //本地存放文本的目录,原始的nltk数据库存放目录为D:
 3 >>> wordlists=PlaintextCorpusReader(corpus_root, '.*')
 4 >>> wordlists.fileids()                    //获取文件列表
 5 ['README', 'aaaaaaaaaaa.txt', 'austen-emma.txt', 'austen-persuasion.txt', 'auste              //其中的aaaaaaaaaaa.txt是自定义的文件
 6 n-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess
 7 -busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown
 8 .txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'luo.txt', 'melville-
 9 moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-ha
10 mlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
11 >>>

