nltk常用函数

所含的包可以在这里下载:http://www.nltk.org/nltk_data/

1.WordNetLemmatizer提取词干

确定词根

import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()#确定词源
print(lemmatizer.lemmatize('gathering', 'v'))
print(lemmatizer.lemmatize('gathering', 'n'))

输出:

gather
gathering

2.word_tokenize分词

https://kite.com/python/docs/nltk.word_tokenize

分词: 

import nltk
nltk.download('punkt')

sentence = "At eight o'clock on Thursday morning, Arthur didn't feel very good."
print(word_tokenize(sentence))

#
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', ',', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

2020-9-18更新————————

1.基本功能

https://www.jianshu.com/p/9d232e4a3c28

  • 分句
  • 分词
  • 词性标注
  • 命名实体识别

分句和分词:

#分句
sents = nltk.sent_tokenize("And now for something completely different. I love you.")#对于这种比较简单的句子,是可以处理的很好的。
word = []
for s in sents:
    print(s)
#在句子内部分词
for sent in sents:
    word.append(nltk.word_tokenize(sent))
print(word)

词性标注:

nltk.download('averaged_perceptron_tagger')
s="And now for something completely different."
text = nltk.word_tokenize(s)
print(text)
#词性标注
tagged = nltk.pos_tag(text)#这里需要是分词的结果,否则就会将单个char作为单位

tagged
[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ'),
 ('.', '.')]

分块:

entities = nltk.chunk.ne_chunk(tagged)#如果使用这个函数的话,输入的变量必须是经过词性标注之后的,进行分块
print (entities)

输出:

Tree('S', [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ'), ('.', '.')])

2.统计文本中词的概率分布FreqDist

from nltk.book import FreqDist#这个类是在book中的

text1 = nltk.word_tokenize("And now for something completely different. I love you. This is my friend. You are my friend.")

# FreqDist()获取在文本中每个出现的标识符的频率分布
fdist = FreqDist(text1)
print(fdist)
# 词数量
print(fdist.N())
# 不重复词的数量
print(fdist.B())

>>><FreqDist with 16 samples and 21 outcomes>
21
16
# 获取频率
print(fdist.freq('friend') * 100)
# 获取频数
print(fdist['friend'])
#出现次数最多的词
fdist.max()

>>>9.523809523809524
2
'.'

后面还有对文章进行词干化的代码,仔细看了一下,但是觉得目前用不到,所以就不贴了。

3.Text类和TextCollection类

前者是对单个文本的分析,后者是前者的集合,可以计算某一单词的逆文档频率IDF等。

原文地址:https://www.cnblogs.com/BlueBlueSea/p/13154590.html