20190919-3 效能分析

此作业要求参见：https://edu.cnblogs.com/campus/nenu/2019fall/homework/7628

链接：https://e.coding.net/hejw031/hejw08.git

要求0 以战争与和平作为输入文件，重读向由文件系统读入。连续三次运行，给出每次消耗时间、CPU参数。

第一次运行：

第二次运行：

第三次运行：

第一次运行时间：1.507秒

第二次运行时间：1.274秒

第三次运行时间：1.393秒

要求1 给出你猜测程序的瓶颈。你认为优化会有最佳效果，或者在上周在此处做过优化 (或考虑到优化，因此更差的代码没有写出) 。

def count(words):
    collect = collections.Counter(words)
    num = 0
    for i in collect:
        num += 1
    print('total %d words
' % num)
    result = collect.most_common(10)
    for j in result:
        print('%-8s%5d' % (j[0], j[1]))

def doCount(accept):
    s = '.txt'
    if s in accept:
        path = accept
    else:
        path = accept + '.txt'
    f = open(path, encoding='utf-8')
    count(words)
    words = re.findall(r'[a-z0-9^-]+', f.read().lower())
    count(words)

瓶颈：读取文档之后需要对文本进行正则化，然后再对文本中的单词进行遍历，感觉这个过程会花费比较多的时间，属于这个程序的瓶颈吧。

要求2 通过 profile 找出程序的瓶颈。给出程序运行中最花费时间的3个函数(或代码片断)。要求包括截图。

从截图中可以看出最花费时间的三个函数分别是：read()，findall()和collections().

要求3 根据瓶颈，"尽力而为"地优化程序性能。

优化前代码：

def doCountByPurText(inputText):
    words = re.findall(r'[a-z0-9^-]+', inputText.lower())
    collect = collections.Counter(words)
    num = 0
    for i in collect:
        num += 1
    print('total %d words
' % num)
    result = collect.most_common(10)
    for j in result:
        print('%-8s%5d' % (j[0], j[1]))

优化后代码：

def doCountByPurText(inputText):
    words = re.findall(r'[a-z0-9^-]+', inputText.lower())
    count(words)

将统计词频的代码封装成函数，需要计算词频时调用该函数，节省了程序运行时间。

要求4 再次 profile，给出在要求1 中的最花费时间的3个函数此时的花费。要求包括截图。

findall 0.201s

collections 0.079s

read 0.044s

要求5 程序运行时间。根据在教师的机器 (Windows8.1) 上运行的速度排名，分为3档。此题得分，第1档20分，第2档10分，第3档5分。功能测试不能通过的，0分。