简单NLT

操作文本

I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

需求

  1. 读取文件
  2. 去除所有标点符号和换行符,并把所有大写变成小写
  3. 合并相同的词,统计每个词出现的频率,并按照词频从大到小排序
  4. 将结果按行输出到文件 out.txt

代码实现

import re


def parse(text):
    # 使用正则表达式去除标点符号和换行符
    text = re.sub(r'[^w ]', ' ', text)

    # 转为小写
    text = text.lower()

    # 生成所有单词的列表
    world_list = text.split(' ')

    # 去除空白单词
    world_list = filter(None, world_list)

    # 生成单词和词频的字典
    word_cnt = {}
    for word in world_list:
        if word not in word_cnt:
            word_cnt[word] = 0
        word_cnt[word] += 1

    # 按照词频排序
    sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True)

    return sorted_word_cnt


with open('in.txt', 'r', encoding='utf-8') as fin:
    text = fin.read()

word_and_freq = parse(text)

with open('out.txt', 'w') as fout:
    for word, freq in word_and_freq:
        fout.write(f'{word} {freq}
')
原文地址:https://www.cnblogs.com/wuyongqiang/p/10935816.html