综合练习:词频统计

下载一首英文的歌词或文章

生成词频统计

news='''At the same time, the market of TV dramas has also maintained rapid development. In 2017, the production volume of TV dramas in China reaches 310 and 13,000 sets, and continues to be the no.1 in the world. The "national treasure", "national treasure", "if the national treasure can talk" and other TV variety shows, documentaries, vividly spread the excellent Chinese traditional culture.
With modern technology, traditional culture is rejuvenated. Hangzhou songcheng group with new technology to interpret ancient Chinese traditional story, the Qingdao publishing group is using virtual reality, 3 d printing technology, the audience can feel the charm of traditional culture anytime and anywhere.
In recent years, China's cultural industry has been growing rapidly, and the pace of "going out" has been accelerating. As of last year, China's publishing enterprises set up more than 400 branches overseas and established cooperative partnership with over 500 publishing institutions in over 70 countries. People's day boat publishing co., LTD. Was set up in less than two years, has published "Chinese traditional festival" (in Arabic), "in a pocket of father" (French version) and so on more than 40 foreign language books.
 '''
sep = ''',.;:''""'''
for c in sep:
    news = news.replace(c, ' ')

wordlist = news.lower().split()

wordDict = {}
for w in wordlist:
    wordDict[w] = wordDict.get(w, 0) + 1
'''
wordSet=set(wordlist)
for w in wordSet:
    wordDict[w]=wordlist.count(w)
'''
for w in wordDict:
    print(w, wordDict[w])

  

排序

wordSet=set(wordlist)
for w in wordSet:
    wordDict[w]=wordlist.count(w)
dictList=list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
 
print(dictList)

  

排除语法型词汇,代词、冠词、连词

exclude={'the','a','an','and','of','with','to','by','am','are','is','which','on'}
wordSet=set(wordlist)-exclude
for w in wordSet:
    wordDict[w]=wordlist.count(w)
dictList=list(wordDict.items())
dictList.sort(key=lambda x:x[1],reverse=True)
 
print(dictList)

输出词频最大TOP20以及将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

for i in range(20):
    print(dictList[i])


print('author:xujinpei')
f=open('news.txt','r')
news=f.read()
f.close()
print(news)

  中文频词

    print(dictList)
t = '在国内电影票房连创新高的同时,电视剧市场同样保持快速发展,2017年,我国电视剧生产量达310部、1.3万集,继续稳居世界第一。《中国诗词大会》《国家宝藏》《如果国宝会说话》等电视综艺节目、纪录片,生动传播了中华优秀传统文化。
text = jieba.cut(t)
print(list(jieba.cut(t)))

  

原文地址:https://www.cnblogs.com/zhongchengzhe/p/8658569.html