第一次个人编程作业

软件工程	点我
作业要求	点我
作业目标	熟练使用Git，并且与GitHub搭配；PSP模型；熟悉测试工具

Github

链接：

https://github.com/MrEdge123/MrEdge123/tree/master/3118005414

目录结构：

module  		#模块
|-- gen_vec.py			        #生成词频向量
|-- simhash.py			        #simhash算法

unit_test		#单元测试
|-- gen_vec
    |-- test.py			        #测试代码
    |-- htmlcov
	|-- test_py.html	        #代码覆盖率
|-- simhash
    |-- test.py
    |-- htmlcov
        |-- test_py.html

performance		#性能测试
|-- time1.txt			        #修改前的时间
|-- time2.txt			        #修改后的时间
|-- memory.txt				#内存使用

main.py             		        #主程序
orig.txt     				#原论文
orig_0.8*.txt				#对比论文
requirements.txt			#依赖文件

PSP

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划	50	100
· Estimate	· 估计这个任务需要多少时间	50	100
Development	开发	300	470
· Analysis	· 需求分析 (包括学习新技术)	100	200
· Design Spec	· 生成设计文档	30	50
· Design Review	· 设计复审	20	20
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	20	20
· Design	· 具体设计	30	30
· Coding	· 具体编码	40	60
· Code Review	· 代码复审	20	30
· Test	· 测试 (自我测试，修改代码，提交修改)	40	60
Reporting	报告	80	120
· Test Repor	· 测试报告	30	40
· Size Measurement	· 计算工作量	20	30
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	30	50
	· 合计	430	690

接口设计与实现

库

因为涉及中文的分词，所以使用了比较方便的结巴分词库。

设计思路

在这里插入图片描述

接口

gen_vec.py

包含2个函数：gen_vec()和check()

1.函数原型：

check(str: word) -> bool : 检测word是否合法
gen_vec(list: words, dict: vec) -> none : 根据words, 生成词权值向量vec

2.函数check()：
作用：检测word是否合法
原理：通过word中每一个字符的Unicode编码，把中文和数字保留下来

3.函数gen_vec()：
作用：生成词权值向量vec
原理：通过字典的hash，建立词语和权值的映射。由于TF-IDF中的IDF比较难求，因此把词语的长度作为近似的权值。

simhash.py

包含3个函数：hash64()，simhash()和cmp_simhash()

1.函数原型：

hash64(str: word) -> num：得到64位word的hash值
simhash(dict: vec) -> num：通过向量vec生成simhash签名
cmp_simhash(num: simhash1, num: simhash2) -> num：对比两个simhash的海明距离

2.函数hash64()：
作用：通过字符串word生成64位的hash值
原理：考虑到直接通过hash()函数，可能只会生成32位的hash值，因此对hash()的值进行平方，然后取模得到64位hash值。

3.函数simhash()：
作用：把词权值向量转化为simhash签名
原理：simhash算法

4.函数cmp_simhash()：
作用：对比两个simhash的海明距离，然后得到相似度
原理：相似度 = (64 - 海明距离) / 64，海明距离：把simhash值转化为64位二进制后，位置相同但值不同的位的个数。

接口性能改进：simhash

性能分析图（时间）

在这里插入图片描述
由上可知，最耗时的是结巴分词的时候，原因：分词时需要构建一个字典树，对于较长文本需要花费较多时间。所以我们可以看到，对抄袭文章进行分词时，速度大大减少。

第二耗时的模块，是生成simhash签名的时候。时间复杂度为：(O(64 * cnt))，其中(cnt)为vec中词语的个数。现在想办法改进这个模块使用的时间。

改进前代码

# 计算64位hash
def hash64(str) :
    return (hash(str) ** 2) % (2 ** 64) 

# 计算vec的simhash
def simhash(vec) :
    cnt = [0] * 64
    for key in vec :
        val = hash64(key)
        pos = 0
        while val > 0 :
            if val % 2 == 1 : cnt[pos] += vec[key]
            else : cnt[pos] -= vec[key]
            val = val / 2
            pos += 1
    
    pos = 0
    ans = 0
    while pos < 64 :
        ans = ans * 2
        if cnt[pos] > 0 : ans += 1
        pos += 1

    return ans

通过位运算符优化后：

改进后代码

# 计算64位hash
def hash64(str) :
    return (hash(str) ** 2) % (1 << 64) 

# 计算vec的simhash
def simhash(vec) :
    cnt = [0] * 64
    for key in vec :
        val = hash64(key)
        pos = 0
        while val > 0 :
            if val & 1 : cnt[pos] += vec[key]
            else : cnt[pos] -= vec[key]
            val = val >> 1
            pos += 1
    
    pos = 0
    ans = 0
    while pos < 64 :
        ans = ans << 1
        if cnt[pos] > 0 : ans += 1
        pos += 1

    return ans

对比图

改进前：
在这里插入图片描述
改进后：

模块单元测试：gen_vec

模块及测试代码

# 检查word是否合法
def check(word) :
    if len(word) == 0 : return False

    ok = True
    for ch in word :
        if ch >= 'u4e00' and ch <= 'u9fa5' : ok = True # 中文字符范围
        # elif ch >= 'a' and ch <= 'z' : ok = True
        # elif ch >= 'A' and ch <= 'Z' : ok = True
        elif ch >= '0' and ch <= '9' : ok = True
        else : return False

    return True

# 根据words, 生成词频向量vec
def gen_vec(words) :
    vec = {}
    for key in words :
        if check(key) : vec[key] = len(key)
    return vec

if __name__ == "__main__" :
    words = ["我", "", "
", "去", "跑步", "2", "w", "word", "HaHa"]
    vec = gen_vec(words)

    for key in vec :
        print(key + ":" + str(vec[key]))

测试数据及构造思路

测试数据：words = ["我", "", " ", "去", "跑步", "2", "w", "word", "HaHa"]

思路：由上代码可知，我们主要应该测试check()函数。check()函数返回True的条件：word的长度不为0，且其中的字符由中文（因为是中文论文查重）或者数字构成。所有有：
返回为True的数据：["我", "去", "跑步", "2"]
返回为False的数据：["", " ", "w", "word", "HaHa"]