Python 自然语言处理(1) 计数词汇

Python有一个自然语言处理的工具包，叫做NLTK（Natural Language ToolKit），可以帮助你实现自然语言挖掘，语言建模等等工作。但是没有NLTK，也一样可以实现简单的词类统计。

假如有一段文字：

a = 'Return a list of the words in the string S, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done. If sep is not specified or is None, any whitespace string is a separator and empty strings are removed from the result.'

单词个数查询：我想查这段文字有多少个单词，那么可以用下面这段代码：

def words(text):
    return text.split()

--> words(a)

['Return', 'a', 'list', 'of', 'the', 'words', 'in', 'the', 'string', 'S,', 'using', 'sep', 'as', 'the', 'delimiter', 'string.', 'If', 'maxsplit', 'is', 'given,', 'at', 'most', 'maxsplit', 'splits', 'are', 'done.', 'If', 'sep', 'is', 'not', 'specified', 'or', 'is', 'None,', 'any', 'whitespace', 'string', 'is', 'a', 'separator', 'and', 'empty', 'strings', 'are', 'removed', 'from', 'the', 'result.']

这样我就知道这段话有多少个词。

单词数量查询：然后我又想知道这段话中用来多少个词，相当于对这段话中的词汇做一个dicstinct，可以这么做：

-->print set(words(a))

set(['and', 'sep', 'is', 'in', 'as', 'at', 'S,', 'done.', 'any', 'given,', 'string.', 'Return', 'whitespace', 'specified', 'empty', 'from', 'string', 'result.', 'most', 'words', 'not', 'using', 'removed', 'a', 'None,', 'splits', 'of', 'maxsplit', 'list', 'strings', 'delimiter', 'separator', 'the', 'If', 'or', 'are'])

个别单词数量查询：那如果我想知道这段话中包含多少个'string'呢。

-->c= a.count('string')
-->print c

4

个别单词数所占百分比：想要知道某个单词在单词总数中占到的百分比，那就像下面以下样：

-->from __future__ import division #引入浮点型除法
-->d = a.count('string') / len(words(a))*100
-->print d

8.33333333333