小程序实现读数据、统计词频、建词典

这一次的学习内容应该是我这几天收货最大的一次，干货满满！也增加了自己的信心，希望自己能继续静心修炼，奔跑在能打比赛的道路上！（感谢我的大Boss手把手教学，不再说了，怕有人组团偷我大Boss），这几天有人私信我说自己失去了信心，希望跟着我的博客也学习起来。是的，或许有一个好的领路者是非常幸运的，当然并不是每个人都像我这样幸运，哈哈哈，拉了一波仇恨。所以只有自己不放弃才是王道，加油，奔跑的你！

下边读数据、统计词频、建词典各步骤之间用回车隔开，具体每一步都使用了注释，讲解很详细。有特别需要说明的一点就是：如果在某一步你想知道某个变量输出的形式，你可以在这一步的后面将变量打印出来。print(想知道的变量)就像我在程序中注释掉的print()就是我想知道的变量形式。

读数据：就是将您需要训练的文本的所有句子、标签读出来，后续进行
统计词频：看每个单词出现的次数，后边可以根据这个确立自己建的词典大小。
建词典：就是给一个单词唯一的标识符。

我的数据格式是这样的（从中选了五行进行参考）：前边的0和1就是句子情感（标签label）

   0 one long string of of cliches .
   1 if you 've ever entertained the notion of doing what the title of this film implies , what sex with strangers actually shows may put you off the idea forever .
   0 k-19 exploits our substantial collective fear of nuclear holocaust to generate cheap hollywood tension .
   0 it 's played in the most straight-faced fashion , with little humor to lighten things up .
   1 there is a fabric of complex ideas here , and feelings that profoundly deepen them .

下边就是具体的代码和注释：

 1 import  collections
 2 class Instance:·#这部分可以看一下python中类和对象的内容
 3     def __init__(self):
 4         self.words=[]
 5         self.label=[]
 8 insts=[]#要将所有读出来的句子和标签放在里边
 9 path ="./dev.txt"#当前目录下文件路径
10 with open(path,encoding="utf-8") as f:
11     for line in f.readlines():#读取整个文件
12         inst = Instance()
13         # print(line)
14         line=line.strip()#将读出的每个句子后的空行删掉
15         line = line.split(" ")#将句子分开成单个单词
16         # print(line)
17         label=line[0]#读出标签
18         word=line[1:]#读出一个句子的所有单词        # print(label)
19         # print(word)
20         inst.words=word#将一句话的所有单词放在inst对象中
21         inst.label.append(label)#将一个句子的标签添加到inst对象中
22         # print(inst.words)
23         # print(inst.label)
24         insts.append(inst)#将inst对象中的内容添加到总的insts中
25         # print(insts)
26         if len(insts)==5:#只读五个句子
27             break
28 print("all {} sentences".format(len(insts)))#将format()里的值传给{}
29 print("****" * 40)
30 
31 



32 word_dict={}#初始化所有单词的词典
33 label_dict={}#初始化所有标签的词典
34 for inst in insts:#遍历所有的句子
35     for word in inst.words:#遍历每句话中所有的单词
36         if word in word_dict:#如果单词出现在了单词词典中，就加1
37             word_dict[word] +=1
38         else:#如果遍历的时候没有，把单词添加到词典中并且赋值为1
39             word_dict[word]=1
40 
41     for label in inst.label:#同上
42         if label in label_dict:
43             label_dict[label] +=1
44         else:
45             label_dict[label] =1
46 
47 print(word_dict)
48 print("****" *20)
49 print(label_dict)
50 print("#"* 40)
51 
52 



53 id2words=[]#通过下标（唯一标识符）找到对应单词
54 words2id ={}#通过单词写出下标
55 unk="<unk>"
56 pad="<pad>"
57 id2words.append(unk)
58 id2words.append(pad)
59 words2id[unk]=0
60 words2id[pad]=1
61 newid =2 #每遍历一个单词，记录下标（唯一标识）变化
62 min_freq = 2#根据词频获取有用单词
63 for key in word_dict:#遍历单词词典的所有单词
64     if word_dict[key] >=min_freq:
65         id2words.append(key)#把每一个单词都添加到id2words中
66         words2id[key] = newid#给每一个或的单词唯一标识
67         newid +=1
68 print(id2words)
69 print("#"*40)
70 print(words2id)
71 print("*"*40)
72 
73 



74 test_word = 'you'
75 print(id2words[2])
76 print(words2id[test_word])
77 
78 unkword='of'
79 if unkword not in words2id:
80     print(words2id[unk])
81 else:
82     print(words2id[unkword])

代码看完之后如果理解的差不多了，自己手动敲一遍（即使你不是很懂也要自己看着写一遍）在写的过程中会收获很多，也会发现自己的很多错误，就比如我就出现过两次代码不缩进的毛病，然后现在已经长记性不会犯错了。