综合练习：词频统计

词频统计预处理

下载一首英文的歌词或文章

news = '''
She hangs out every day near by the beach  .  Havin’ a HEINEKEN fallin’ asleep She looks so sexy when she’s walking the sand    Nobody ever put a ring on her hand    She is the story the story is she   She sings to the moon and the stars in the sky   Shining from high above you shouldn’t ask why  She is the one that you never forget  She is the heaven-sent angel you met   Oh, she must be the reason why God made a girl She is so pretty all over the world  She puts the rhythm, the beat in the drum  She comes in the morning and the evening she’s gone  Every little hour every second you live   Trust in eternity that’s what she gives   She looks like Marilyn, walks like Suzanne Suzanne  She talks like Monica and Marianne  Monica Marianne  She wins in everything that she might do   And she will respect you forever just you  She is the one that you never forget She is the heaven-sent angel you met    Oh, she must be the reason why God made a girl  She is so pretty all over the world  She is so pretty all over the world  She is so pretty  She is like you and me    Like them like we She is in you and me   She is the one that you never forget   She is the heaven-sent angel you met   Oh, she must be the reason why God made a girl   She is so pretty all over the world  (She is the one) She is the one   (That you never forget) That you never forget  She is the heaven-sent angel you met  She’s the reason (oh she must be the reason) why God made a girl    She is so pretty all over the world (oh...)   Na na na na na ….
'''

将所有,.？！’:等分隔符全部替换为空格

sep = '''.,?!'''
for c in sep:
	news = news.replace(c,'')

将所有大写转换为小写

生成单词列表　　

wordList = news.lower().split()
for w in wordList:
	print(w)

生成词频统计

wordDict = {}
wordSet = set(wordList)
for w in wordSet:
	wordDict[w] = wordList.count(w)

排序

dictList = list(wordDict.items())
dictList.sort(key = lambda x: x[1], reverse=True)

排除语法型词汇，代词、冠词、连词

exclude ={'she','the','you','a','and','why','that','one','she’s','me','by','when'}
wordSet = set(wordList)-exclude
for w in wordSet:
	wordDict[w] = wordList.count(w)

输出词频最大TOP20

for i in range(20):
	print(dictList[i])

将分析对象存为utf-8编码的文件，通过文件读取的方式获得词频分析内容。

f = open('news.txt','r',encoding='utf-8')
news = f.read()
f.close()
print(news)

f = open('newscount.txt','a')
for i in range(20):
    f.write(dictList[i][0]+''+str(dictList[i][1])+'
')
f.close()

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open('gzccnews.txt','r',encoding = 'utf-8')

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20（或把结果存放到文件里）

import jieba
f = open('princekin.txt','r',encoding='utf-8')
princekin = f.read()
f.close()
str1 = '''“”，。？：、（）’！
   …'''
exclude = {'那','你','的','我','一','它','他','这','人','了','啊','呢','唉'}
jieba.add_word('小王子')
for i in str1:
    princekin = princekin.replace(i,'')
result = list(jieba.cut(princekin))
wordDict = {}
words = list(set(result)-exclude)
for i in words:
    wordDict[i]= result.count(i)
wordList = list(wordDict.items())
wordList.sort(key = lambda x: x[1], reverse=True)
print(wordList)
f = open('saveall.txt','a',encoding='utf-8')
for i in range(20):
    f.write(wordList[i][0] + '' + str(wordList[i][1]) + '
')
f.close()