第一次个人编程作业

一、GitHub链接
GitHub链接
二、设计思路以及实现方法
本来是打算使用c++的,但是后面发现python的实现更加方便,可以调用库。
关于分词的方法和相似度的计算,查阅了一些资料后最后选用了jieba分词以及jaccard系数。使用jieba是因为python直接调用库方便实现,使用jaccard是因为在对比了各个分词之后发现jaccard更加适合文本类型的相似度计算。
代码设计图:

  • jieba分词法:
    支持四种分词模式:
    * 精确模式,试图将句子最精确地切开,适合文本分析;
    jieba.cut(str, cut_all=False)
    * 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
    jieba.cut(str, cut_all=True)
    * 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
    jieba.cut_for_search(str)
    * paddle模式,利用PaddlePaddle深度学习框架,训练序列标注(双向GRU)网络模型实现分词。同时支持词性标注。
    jieba.cut(str,use_paddle=True)
    这次代码中我选择了精准模式。

  • jaccard相似度:
    给定两个集合A,B,Jaccard 系数定义为A与B交集的大小与A与B并集的大小的比值,定义如下:

三、模块介绍

  • 输入输出

      f1 = open(argv[1],'r')
      f2 = open(argv[2],'r')
      f3 = open(argv[3],'w')

      f1_text=f1.read()
      f2_text=f2.read()
      f3.write("...")

      #print (f1.read())
      f1.close()
      f2.close()
      f3.close()
  • jieba分词

    def jieba_list(text):
        items=""
        s=""
        for i in range(0,len(text)):
            #存储中文
            if 'u4e00' <= text[i] <= 'u9fff':
                s += text[i]
            elif text[i] == '。':
                if s != "":
                    items += s
                    s = ""
        if s != "":
            items += s
            s = ""
        #print(items)
        test_items = jieba.lcut(items, cut_all=True)
        return test_items
    
  • jaccard相似度

    def jaccard(text1,text2):
        #将分词去重
        delete_text1 = set(text1)
        delete_text2 = set(text2)
        #print(delete_text1)
        #print(delete_text2)
    
        #记录相交分词的个数
        temp = 0
        for i in delete_text1:
            if i in delete_text2:
                temp += 1
        fenmu = len(delete_text2) + len(delete_text1) - temp  # 并集
        jaccard_coefficient = float(temp / fenmu)  # 交集
        return jaccard_coefficient
    

四、结果输出

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8s1.txt D:studysim_0.8s1.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.501 seconds.
Prefix dict has been built successfully.
1.0

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_add.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.772 seconds.
Prefix dict has been built successfully.
0.4635416666666667

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_del.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.799 seconds.
Prefix dict has been built successfully.
0.6505073280721533

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_1.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.834 seconds.
Prefix dict has been built successfully.
0.9166666666666666

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_3.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.310 seconds.
Prefix dict has been built successfully.
0.8378220140515222

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_7.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.915 seconds.
Prefix dict has been built successfully.
0.7338842975206612

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_10.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.036 seconds.
Prefix dict has been built successfully.
0.6769558275678552

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_dis_15.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 1.846 seconds.
Prefix dict has been built successfully.
0.4877932024892293

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_mix.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.271 seconds.
Prefix dict has been built successfully.
0.6966233766233766

D:study编程专用pythonlearn1>python jaccard.py D:studysim_0.8orig.txt D:studysim_0.8orig_0.8_rep.txt
Building prefix dict from the default dictionary ...
Loading model from cache C:Users57457AppDataLocalTempjieba.cache
Loading model cost 2.032 seconds.
Prefix dict has been built successfully.
0.3995140576188823

  结果分析:和其他人的结果差距较大,根据分析应该是因为主要用的set和list并交集,把重复的字都省略了,导致最后得出的相似度较低。本来打算换成利用了sklearn的CounterVectorizer类和numpy的,但是sklearn库一直下载失败,就作罢了。

五、PSP表格

原文地址:https://www.cnblogs.com/yaningscnblogs/p/13688022.html