202103226-1编程作业

这个作业属于哪个课程	https://edu.cnblogs.com/campus/zswxy/computer-science-class1-2018
这个作业要求在哪里	https://edu.cnblogs.com/campus/zswxy/computer-science-class1-2018/homework/11877
这个作业的目标	<学会GitHub简单使用、完成词频统计编程>
其他参考文献	<《构建之法》>

Github地址：

https://github.com/Duyaqian-259/-

https://github.com/Duyaqian-259/-/blob/master/Codewords

PSP表格：

PSP2.1	Personal Software Process Stages	预估耗时（分钟）	实际耗时（分钟）
Planning	计划
· Estimate	· 估计这个任务需要多少时间	20	10
Development	开发
· Analysis	· 需求分析 (包括学习新技术)	120	120
· Design Spec	· 生成设计文档	30	20
· Design Review	· 设计复审	20	20
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	30	20
· Design	· 具体设计	40	50
· Coding	· 具体编码	100	120
· Code Review	· 代码复审	30	20
· Test	· 测试（自我测试，修改代码，提交修改）	150	180
Reporting	报告
· Test Repor	· 测试报告	20	30
· Size Measurement	· 计算工作量	20	30
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	20	20
合计	600	640

解题思路描述：

统计文件的字符数（对应输出第一行）：
统计文件的单词总数（对应输出第二行），单词：至少以4个
统计文件的有效行数（对应输出第三行）：任何包含非空白字符的行，都需要统计。
统计文件中各单词的出现次数（对应输出接下来10行），最终只输出频率最高的10个。

代码规范链接：

https://github.com/Duyaqian-259/-/blob/master/codedefine

设计与实现过程：

1.获取单词总个数

public int getWordsCount() {
	int cnt = 0;
	String pattern = "[A-Za-z]{4,}[A-Za-z0-9]*";
    Matcher m = null;
    String s = null;
    try (BufferedReader reader =
                 new BufferedReader(new InputStreamReader(new FileInputStream(inPath)))) 	{
        Pattern r = Pattern.compile(pattern);
        while ((s = reader.readLine()) != null) {
            String[] s1 = s.split("[^a-zA-Z0-9]");
            for (String tp : s1) {
                m = r.matcher(tp);
                if (m.matches()) {
                    cnt++;
                    map.merge(tp.toLowerCase(), 1, Integer::sum);
                }
            }
        }
        return cnt;
    } catch (IOException ie) {
        ie.printStackTrace();
    }
    return 0;
}

2.获取有效行数，每次读入文件中一行数据,调用trim()删除首位空格，判断是否等于""

public int getValidLines() {
    String s;
    int cnt = 0;
    try (BufferedReader reader =
                 new BufferedReader(new InputStreamReader(new FileInputStream(inPath)))) {
        while ((s = reader.readLine()) != null) {
            if (!"".equals(s.trim())) {
                cnt++;
            }
        }
        return cnt;
    } catch (IOException ie) {
        ie.printStackTrace();
    }
    return 0;
}

3.获取字符数每次读入文件中一个,判断是否小于128（ASCII范围0~127）,若小于则计数器+1

public int getCharCount() {
    int tp;
    int cnt = 0;
    try (BufferedReader reader =
                 new BufferedReader(new InputStreamReader(new FileInputStream(inPath)))) {
        while ((tp = reader.read()) != -1) {
            if (tp < 128) {
                cnt++;
            }
        }
        return cnt;
    } catch (IOException ie) {
        ie.printStackTrace();
    }
    return 0;
}

4.排序选择单词频度最高的10个,重复值多，采用三向切分快速排序。

void quick_sort(String[] list, int l, int r) {
    if(l>=r) return;
    int i = l, j = r;
    int k = i+1;
    int x=map.get(list[l]);
    if(k<j){
        while (k<=j) {
            if (map.get(list[k]) < x){
                String tp = list[j];
                list[j] = list[k];
                list[k] = tp;
                j--;
            }
            else if (map.get(list[k]) > x){
                String tp = list[i];
                list[i] = list[k];
                list[k] = tp;
                k++;
                i++;
            }else{
                k++;
            }
        }
        if(r-l>20){
            quick_sort(list, l, i - 1);
        }else{
            quick_sort(list, l, i - 1);
            quick_sort(list, j + 1, r);
        }
    }
}

性能改进：

单词查找：一开始使用正则表达式，100000000个字符需要6.9s，然后发现分割开的字符串很短，不需要每个都去匹配，加上s.length()和tp.length()>=4判断，此时需要5.8s。

单元测试：

WordCountTest：正确情况，传入null，传入无效的文件，传入不存在的文件，传入空文件，传入空的CountCore对象。
CountCore：使用随机字符填充文件，然后进行测试
getCharCount()：根据初始化时已知的添加的字符数和具体的方法进行比较Assert.assertEquals(count, countCore.getCharCount());

getValidLines()：根据文件有效行数和具体方法比较Assert.assertEquals(c,countCore.getValidLines());

异常处理说明：

CountCore接收一个String参数作为文件名，可能发生的异常文件名无效，文件不存在
涉及到文件读取的方法:getValidLines();getWordsCount();getCharCount();

心路历程与收获：

开发时因为没有配置好.gitignore导致代码都是在本地vs上的一个项目内写好，再拷贝到github目录下commit，代码搬来搬去的不仅容易乱还出了一堆错误，比如两个编辑器的代码规范配置不一样导致代码规范一改再改，比如拷贝的时候拷错了地方，下次应该先配置好.gitignore再编写。