统计文本中单词出现的频率

需求：写一个程序，分析一个文本文件中各个词出现的频率，并且把频率最高的10个词打印出来。文本文件大约是30KB~300KB大小。

1.思路

①数据结构：Word类封装单词String和频率count，并重写equals方法，以key(String)相同则认为Word对象相同。

先从dictionary.txt一行一行读取字符串，使用正则表达式过滤出单词并存放在ArrayList中，遍历list,将每个string都封装成Word放入一个WordList中；再使用Collections工具类的sort()方法添加一个按照count值的comparator进行排序。

2.分析

使用YourKit Java Profiler进行性能分析。CPU和内存。

可以看计算340KB的文本用了561毫秒。从图中可以看出，update1方法相当耗时间。因为方法内部，将每一个string进行封装成对象，并使用了indexOf()方法寻找位置后更改。

3.改进分析

如果继续不使用ArrayList存储Word来寻找，而使用HashMap来将String作为key,count作为value，在update方法中，就可以直接使用map的get方法判断是否存在唯一的key（string）。最后使用Map的根据count排序即可。

4.改进实现

public class CountOfWordsTest1 {

    public static void main(String[] args) {

        test();
    }

    public static void test() {
        ArrayList<String> list = readFromFile();
        List<Map.Entry<String, Integer>> WordList = countOfWords(list);
        printList(WordList);
    }

    /*
     * @Description read from file and split String  with regex
     * @return the ArrayList contains words
     */
    @SuppressWarnings("resource")
    public static ArrayList<String> readFromFile() {

        ArrayList<String> list = new ArrayList<String>();
        BufferedReader br = null;
        StringBuilder sb = null;

        try {
            br = new BufferedReader(new InputStreamReader(new FileInputStream(
                    "./src/dictionary.txt")));
            sb = new StringBuilder();

            String line;
            while ((line = br.readLine()) != null) {
                sb.append(line + " ");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        // 提取读出来的单词（出去特殊字符）
        String regEx = "[a-zA-Z]+";
        Pattern p = Pattern.compile(regEx);
        Matcher m = p.matcher(sb);
        while (m.find()) {
            String temp = m.group();
            // System.out.println(temp);
            list.add(temp);
        }

        return list;
    }

    /*
     * @Description get every word from list and add it to map,then sorted
     * @param list contains all the word
     * @return map with key-String and value-count
     */
    public static List<Map.Entry<String, Integer>> countOfWords(
            ArrayList<String> list) {

        Map<String, Integer> adjWords = new HashMap<String, Integer>();

        for (int i = 0; i < list.size(); i++) {
            String temp = list.get(i);
            update(adjWords, temp);
        }

        // 将adjWords并转换成list按count排序
        ArrayList<Map.Entry<String, Integer>> WordList = new ArrayList<Map.Entry<String, Integer>>(
                adjWords.entrySet());
        Collections.sort(WordList,
                new Comparator<Map.Entry<String, Integer>>() {
                    @Override
                    public int compare(Map.Entry<String, Integer> o1,
                            Map.Entry<String, Integer> o2) {
                        Integer temp1 = o1.getValue();
                        Integer temp2 = o2.getValue();
                        return temp2.compareTo(temp1);
                    }
                });

        return WordList;
    }

    /*
     * @Description count++ if existed or put(string,1) to the map
     * @param map adjWords to record string and count
     * @param s the new String
     */
    public static void update(Map<String, Integer> map, String s) {

        Integer count = map.get(s);
        if (count == null) {
            count = 1;
        } else {
            count++;
        }
        map.put(s, count);
    }

    /*
     * @Description print the list Sorted by count
     */
    public static void printList(List<Map.Entry<String, Integer>> WordList) {

        // for (Map.Entry<String, Integer> entry : map.entrySet()) {
        // String key = entry.getKey();
        // Integer value = entry.getValue();
        // System.out.println(key + "..." + value);
        // }
        for (int i = 0; i < 10; i++) {
            Map.Entry<String, Integer> entry = WordList.get(i);
            String key = entry.getKey();
            Integer value = entry.getValue();
            System.out.println(key + "..." + value);
        }
    }

}

5.性能分析

可以看出，update方法使用的时间明显的改进。整个程序的时间消耗也有了较大的提高。现在最耗时的消耗是对文件的读取和使用正则表达式的匹配处。