python nltk 学习笔记(2)

Example	Description
`fileids()`	the files of the corpus
`fileids([categories])`	the files of the corpus corresponding to these categories
`categories()`	the categories of the corpus
`categories([fileids])`	the categories of the corpus corresponding to these files
`raw()`	the raw content of the corpus
`raw(fileids=[f1,f2,f3])`	the raw content of the specified files
`raw(categories=[c1,c2])`	the raw content of the specified categories
`words()`	the words of the whole corpus
`words(fileids=[f1,f2,f3])`	the words of the specified fileids
`words(categories=[c1,c2])`	the words of the specified categories
`sents()`	the sentences of the whole corpus
`sents(fileids=[f1,f2,f3])`	the sentences of the specified fileids
`sents(categories=[c1,c2])`	the sentences of the specified categories
`abspath(fileid)`	the location of the given file on disk
`encoding(fileid)`	the encoding of the file (if known)
`open(fileid)`	open a stream for reading the given corpus file
`root()`	the path to the root of locally installed corpus
`readme()`	the contents of the README file of the corpus

Load your own corpus
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict' 
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*') 
>>> wordlists.fileids()

def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab.difference(english_vocab)
    return sorted(unusual)
Set:

Operation	Equivalent	Result
`len(s)`		cardinality of set s
`x in s`		test x for membership in s
`x not in s`		test x for non-membership in s
`s.issubset(t)`	`s <= t`	test whether every element in s is in t
`s.issuperset(t)`	`s >= t`	test whether every element in t is in s
`s.union(t)`	`s \| t`	new set with elements from both s and t
`s.intersection(t)`	`s & t`	new set with elements common to s and t
`s.difference(t)`	`s - t`	new set with elements in s but not in t
`s.symmetric_difference(t)`	`s ^ t`	new set with elements in either s or t but not both
`s.copy()`		new set with a shallow copy of s

>>> from nltk.corpus import stopwords

>>> stopwords.words('english')

WordNet:

>>> from nltk.corpus import wordnet as wn

>>> wn.synsets('motorcar')