个人作业项目报告(三)输出结果及测试样例的结果(附代码)

代码调用图

代码调用图的图例:(感谢刘泽@kfk的vs2015企业版的强大功能)

 

输出结果

因为助教给的程序是在WIndows平台上运行代码得到的结果,所以主要放出的是Windows平台下的结果,Linux平台下的结果留待以后分析。

 

我的结果

The number of character is:173669785
The number of line is:2278666
The number of word is:16639829
The top 10 words is:
HAVE    107383
WITH    158745
CLASS   192004
THIS    152454
THEY    145945
SPAN    116118
THAT    259186
SAID    208861
HARRY   184732
SPAN CLASS  62861
REFERENCE INTERNAL  26668
SPAN SPAN   41286
HREF LEAP   22569
THAT GOOD   61427
INTERNAL HREF   26668
SAID HARRY  24981
CLASS SPAN  23146
SAID HERMIONE   19193
CLASS REFERENCE 31289
Use Time :36
 

助教的结果

char_number :173654417
line_number :2278666
word_number :16629955

the top ten frequency of word :
THAT    259186
SAID    208861
CLASS   192004
HARRY   184732
WITH    158745
THIS    152454
THEY    145945
Span    116118
HAVE    107383
FROM    105494


the top ten frequency of phrase :
Span CLASS  62861
THAT GOOD   61427
Span Span   41286
CLASS Reference 31289
Reference INTERNAL  26668
INTERNAL href   26668
SAID HARRY  24981
CLASS Span  23146
href LEAP   22569
SAID HERMIONE   19193
 

结果对比

可以看到的是,行数的结果正确,字符和单词数的误差在一万个左右,跟总体来相比误差在0.1%以内,Top10的单词和词组的频率也是一模一样,只不过没有正序输出。

 

测试结果

我生成了10个测试样例,用来测试一些特殊情况。 
当然,能够测试本身就说明在平台下已经可以做到命令行参数是单个文件或者文件夹了。 
下面是对其的展示和分析:

 

1.pptx

测试结果: 
Linux平台:

The number of character is:184816
The number of line is:1565
The number of word is:1718
The top 10 words is:
SLIDELAYOUTS    48
RELSPK  38
RELS    114
SLIDES  48
SLIDE5  48
NOTESSLIDE10    40
NOTESSLIDES 40
XMLPK   46
SLIDELAYOUT3    48
STEVT   36
SLIDES SLIDE2   24
NOTESSLIDES NOTESSLIDE5 20
SLIDELAYOUTS RELS   24
SLIDELAYOUTS SLIDELAYOUT12  24
SLIDES RELS 24
RELS SLIDE5 24
RELSPK SLIDELAYOUTS 13
RELS NOTESSLIDE10   20
NOTESSLIDES RELS    20
RELS SLIDELAYOUT3   24
Use Time :0

Windows平台:

The number of character is:52
The number of line is:1
The number of word is:2
The top 10 words is:
CONTENT 1
TYPES   1
CONTENT TYPES   1
Use Time :13

看起来很尴尬的答案,我感觉是编码方式有不同,然后在Win平台下读到了某个-1(在汉字或其他码字编码里)

 

1.txt

文件内容:

Good123 AREE Good456 AREE214 yesh

特性:普通的英文单词文件

Win平台:

The number of character is:33
The number of line is:1
The number of word is:5
The top 10 words is:
GOOD123 2
AREE    2
YESH    1
GOOD123 AREE    2
AREE GOOD456    1
AREE214 YESH    1
Use Time :4

Linux平台:

The number of character is:33
The number of line is:1
The number of word is:5
The top 10 words is:
GOOD123 2
AREE    2
YESH    1
GOOD123 AREE    2
AREE GOOD456    1
AREE214 YESH    1
Use Time :0

PS:不含英文的txt文件真是良心啊

 

2.txt

文件特性:空文件,看看效果

Win平台:

The number of character is:0
The number of line is:0
The number of word is:0
The top 10 words is:
Use Time :0

Linux平台:

The number of character is:0
    The number of line is:0
    The number of word is:0
    The top 10 words is:
    Use Time :0

分析:说明两个平台对空文件都能适应

 

8.txt

文件内容:

翘课购房款签到表来开会多花费了可分类尽快不考虑呢情况你嚄很浪费和库布里克冰冷的空气不卡的我翻了翻和

发
asf
efwf
 2过3
跟4gfdgdsvsdvv去污粉3
跟65和K(……》《……》&。……*(。?&
¥JH/qQs>kFAHRH3%$t^%u5jREB zx.
V 
GF

F
 G3H4hg%wg?
5.J58
J4.5Fz
S
u
$ 

特性:共有19行,和很多汉字。

Win平台:

The number of character is:90
The number of line is:19
The number of word is:4
The top 10 words is:
EFWF    1
GFDGDSVSDVV 1
KFAHRH3 1
JREB    1
EFWF GFDGDSVSDVV    1
GFDGDSVSDVV KFAHRH3 1
KFAHRH3 JREB    1
Use Time :0

Linux平台:

The number of character is:90
The number of line is:19
The number of word is:4
The top 10 words is:
EFWF    1
GFDGDSVSDVV 1
KFAHRH3 1
JREB    1
EFWF GFDGDSVSDVV    1
GFDGDSVSDVV KFAHRH3 1
KFAHRH3 JREB    1
Use Time :0

令人惊讶,中文的问题居然也不大,只不过这些都什么破玩意。。。。

 

1.pdf

Linux平台:

The number of character is:1174015
The number of line is:16897
The number of word is:15033
The top 10 words is:
TYPE    566
RECT    232
ENDOBJ  993
LENGTH  340
STREAM  234
LINK    232
ANNOT   232
FILTER  234
SUBTYPE 431
GOTO    233
GOTO BORDER 232
XOBJECT SUBTYPE 139
RECT ENDOBJ 232
ENDSTREAM ENDOBJ    198
SUBTYPE LINK    232
BORDER RECT 232
ENDOBJ ENDOBJ   215
TYPE ANNOT  232
ENDOBJ TYPE 524
FILTER FLATEDECODE  226
Use Time :1

Windows平台:

The number of character is:859
The number of line is:20
The number of word is:28
The top 10 words is:
TYPE    1
XOBJECT 2
TEXT    1
STREAM  1
PROCSET 1
IMAGEI  1
FLATEDECODE 1
FORMTYPE    1
LENGTH  1
PTEX    3
TYPE XOBJECT    1
XOBJECT SUBTYPE 1
XOBJECT STREAM  1
SHADING XOBJECT 1
PROCSET TEXT    1
IMAGEI SHADING  1
FLATEDECODE FORMTYPE    1
FORMTYPE LENGTH 1
LENGTH PTEX 1
PTEX FILENAME   1
Use Time :0

分析:看到这种结果我感觉真的伤不起。。。看起来Liunux平台下的编码更适合,至少上面的答案感觉更加合理一些,至少没有莫名其妙的遇到文件结束符而结束遍历。

 

1.php

文件特性:随便找的自己写过的php文件,既有中午又有英文

Windows平台:

The number of character is:1539
The number of line is:45
The number of word is:130
The top 10 words is:
INLINE  6
HTML    3
HREF    8
GLASS   3
ECHO    10
BOOK    4
INFO    4
CLASS   6
DISPLAY 6
STYLE   11
STYLE DISPLAY   6
HEIGHT AUTO 2
INFO HREF   3
ECHO TABLE  2
DISPLAY INLINE  6
ADMIN STYLE 2
STYLE HEIGHT    3
GLASS STYLE 3
SYSTEM EXCELSIOR    2
CLASS GLASS 3
Use Time :0

Linux平台:

The number of character is:1539
The number of line is:45
The number of word is:130
The top 10 words is:
INLINE  6
HTML    3
HREF    8
GLASS   3
ECHO    10
BOOK    4
INFO    4
CLASS   6
DISPLAY 6
STYLE   11
STYLE DISPLAY   6
HEIGHT AUTO 2
INFO HREF   3
ECHO TABLE  2
DISPLAY INLINE  6
ADMIN STYLE 2
STYLE HEIGHT    3
GLASS STYLE 3
SYSTEM EXCELSIOR    2
CLASS GLASS 3
Use Time :0

分析: 只要是能够是两个平台上都用文本文档打开而不会出乱码的文件,得到的结果都含有沁人心脾的味道。

 

1.css

文件特性: 类似于1.php,随便找了个模板库里的。

Windows平台下:

The number of character is:6590
The number of line is:320
The number of word is:534
The top 10 words is:
COLOR   26
TEXT    23
BACKGROUND  22
HSLA    33
BORDER  27
WIDTH   21
SHADOW  22
FONT    26
GLASS   16
MARGIN  20
SHADOW HSLA 22
RGBA TEXT   11
FONT SIZE   13
MARGIN AUTO 14
SIZE FONT   13
OVERFLOW HIDDEN 11
FAMILY AVENIR   13
TEXT SHADOW 11
FONT FAMILY 13
BORDER RADIUS   12
Use Time :0

Linux平台:

The number of character is:6590
The number of line is:320
The number of word is:534
The top 10 words is:
COLOR   26
TEXT    23
BACKGROUND  22
HSLA    33
BORDER  27
WIDTH   21
SHADOW  22
FONT    26
GLASS   16
MARGIN  20
SHADOW HSLA 22
RGBA TEXT   11
FONT SIZE   13
MARGIN AUTO 14
SIZE FONT   13
OVERFLOW HIDDEN 11
FAMILY AVENIR   13
TEXT SHADOW 11
FONT FAMILY 13
BORDER RADIUS   12
Use Time :0

分析:效果不错。

 

jieshi.docx

文件特性:类似于pptx,属于会爆乱码的文件。

Linux平台下:

The number of character is:4791
The number of line is:63
The number of word is:68
The top 10 words is:
CONTENT 2
TYPES   2
RELS    6
WORD    14
DOCUMENT    4
DOCPROPS    4
CORE    2
THEME   4
STYLES  2
XMLPK   9
XMLPK WORD  5
XMLPK DOCPROPS  2
WORD STYLES 2
RELS WORD   2
WORD RELS   2
WORD WEBSETTINGS    2
WORD SETTINGS   2
WORD DOCUMENT   2
WORD FONTTABLE  2
THEME THEME1    2
Use Time :0

Windows平台下:

The number of character is:96
The number of line is:3
The number of word is:2
The top 10 words is:
CONTENT 1
TYPES   1
CONTENT TYPES   1
Use Time :0

分析:这种文件,我都已经不抱希望了,没有办法按照合适的方式解码的话,两个平台的结果大相径庭是很显然的。 
当然,Linux下的结果还是看起来好一点。

 

toefl文件夹

文件特性:一个装toefl资料的文件夹,里面有一个rar和四个pdf

Linux平台:

The number of character is:2671446
The number of line is:37968
The number of word is:22789
The top 10 words is:
TYPE    784
ENDSTREAM   440
LEFT    405
STREAM  440
GROUP   360
RIGHT   405
FONT    379
LENGTH  453
ENDOBJ  1011
PAGE    342
FILTER FLATEDECODE  273
TYPE GROUP  180
ENDOBJ TYPE 203
TYPE PAGE   180
ENDOBJ FILTER   217
LENGTH STREAM   327
PROCSET TEXT    180
PARENT RESOURCES    180
FLATEDECODE LENGTH  244
ENDSTREAM ENDOBJ    440
Use Time :0

Windows 平台下:

The number of character is:3121
The number of line is:54
The number of word is:141
The top 10 words is:
TYPE    13
TRUE    3
ENDOBJ  10
DEVICERGB   3
STREAM  4
FLATEDECODE 4
IMAGE23 4
GROUP   6
FILTER  4
PAGES   6
ENDOBJ TYPE 6
LANG STRUCTTREEROOT 3
TRUE ENDOBJ 3
LENGTH STREAM   3
MARKINFO MARKED 3
PAGES LANG  3
STRUCTTREEROOT MARKINFO 3
TYPE CATALOG    3
KIDS ENDOBJ 3
FILTER FLATEDECODE  4
Use Time :1

分析:显然Linux平台下的结果看起来是更加符合这个文件里面含有的东西和字数。

 

11.txt

文件内容:

文件特性: 专门用来测试单词和词组的保存

Linux平台下:

The number of character is:62
The number of line is:1
The number of word is:7
The top 10 words is:
TEST123 5
TESTAFS 1
TEST123TEST324  1
TEST123 TEST3456    2
TEST3456 TESTAFS    1
TESTAFS TEST13  1
TEST13 TEST123TEST324   1
TEST123TEST324 TEST123  1
Use Time :0

Windows平台下:

The number of character is:62
The number of line is:1
The number of word is:7
The top 10 words is:
TEST123 5
TESTAFS 1
TEST123TEST324  1
TEST123 TEST3456    2
TEST3456 TESTAFS    1
TESTAFS TEST13  1
TEST13 TEST123TEST324   1
TEST123TEST324 TEST123  1
Use Time :0

分析:并没有什么问题~

总结: 设计了11个测试样例,其中发现在Win平台下,许多文件的编码格式会导致读取出现严重问题,因此,觉得助教选择最终的测试平台为linux ubuntu 是更加正确的选择。 
在可以用ASCII方式编解码及可以用文本文档或gedit打开的文件中,两个平台的结果都是一致的,而词组和单词的样例测试也证明了这一点。

实验代码:

 
    #include <fstream>  
    #include<string>
    #include <vector>
    #include <sstream>
    #include <iostream>
    #include <stdio.h>
    
    
    
    #include <vector>
    #include <sstream>
    
    #include<functional>
    #include <time.h>
    #include<unordered_map>
    
    
    
    
    using namespace std;
    
    
    
    class word_count;
    
    void getAllFiles(string path, vector<string>& files);
    int fin_to_s(string &str, vector<string> &files, int i);
    
    class word_classifier {
    public:
        string* str;
        //string* temp;//
        string* num_rear;
        int num;
        word_classifier();
        ~word_classifier();
        int judge(char c, word_count* word);//
                                            //void classify(char c);//
        void clear();
        void set(word_count* word);
    
    };
    
    class word_count
    {
    public:
        //string* temp;//
        string* str;//    
        string* num;//
        int num_rear;//
        int str_count;//
        int flag;//
        int size;//
        string* word;//
    
        word_count* next_ptr;
        word_count();
        ~word_count();
    };
    
    
    class phrase_count {
    public:
        word_count* phrase1;
        word_count* phrase2;
        int phr_count;//
        int flag;//
        phrase_count* next_ptr;
        phrase_count();
        ~phrase_count();
    
    
    };
    
    
    void word_to_word(word_count* word, word_count* word1);
    int freq_count(word_count* &arr1, word_count* temp, int flag);
    int compare(word_count* word, word_count* word1, int& flag);
    void phrase_to_phrase(phrase_count* phrase, phrase_count* phrase1);//
    int freq_countP(phrase_count* &arr1, phrase_count* temp, int flag);//
    int compareP(phrase_count* phrase, phrase_count* phrase1, int& flag1, int &flag2);//
    
    
    
    #ifdef WIN32
    #include <io.h>
    void getAllFiles(string path, vector<string>& files)
    {
    
        long   hFile = 0;
        struct _finddata_t fileinfo;
        string p;
        if ((hFile = _findfirst(p.assign(path).append("\*").c_str(), &fileinfo)) != -1)
        {
            do
            {
                if ((fileinfo.attrib &  _A_SUBDIR))
                {
                    if (strcmp(fileinfo.name, ".") != 0 && strcmp(fileinfo.name, "..") != 0)
                    {
                        files.push_back(p.assign(path).append("\").append(fileinfo.name));
                        getAllFiles(p.assign(path).append("\").append(fileinfo.name), files);
                    }
                }
                else
                {
                    files.push_back(p.assign(path).append("\").append(fileinfo.name));
                }
    
            } while (_findnext(hFile, &fileinfo) == 0);
    
            _findclose(hFile);
        }
    
    }
    #endif
    #ifdef __linux__
    #include <dirent.h>
    void getAllFiles(string path, vector<string>& files)
    {
        string name;
        DIR* dir = opendir(path.c_str());
        dirent* p = NULL;
        while ((p = readdir(dir)) != NULL)
        {
            if (p->d_name[0] != '.')
            {
                string name = path + "/" + string(p->d_name);
                files.push_back(name);
                //cout << name << endl;
                if (p->d_type == 4) {
                    getAllFiles(name, files);
                }
            }
    
        }
        closedir(dir);
    
    }
    #endif
    
    
    int fin_to_s(string &str, vector<string> &files, int i) {
        ifstream infile;
        infile.open(files[i]);
        infile >> str;
        infile.close();
        return 0;
    }
    
    
    
    void word_to_word(word_count* word, word_count* word1) {
        //
        *(word1->str) = *(word->str);
    
        *(word1->num) = *(word->num);
    
        word1->num_rear = word->num_rear;
        //
        word1->str_count = word->str_count;
        //
        word1->flag = word->flag;
        word1->size = word->size;
        *(word1->word) = *(word->word);
        word1->next_ptr = word->next_ptr;
    
    }
    
    
    word_classifier::word_classifier() {
        //
        num = 0;
        str = new string();
        num_rear = new string();
        //temp = new string();
    
    }
    
    word_classifier::~word_classifier() {
        //
        delete str;
        delete num_rear;
    
    
    }
    
    void word_classifier::set(word_count* word) {
        //
        string stri;
        //word->temp = temp;
        stri = *str + *num_rear;
        *(word->str) = stri;
        *(word->word) = *str;
        word->str_count = 1;
        word->size = num;
        word->num = num_rear;
        word->num_rear = num - num_rear->size();//
        return;
    
    }
    void word_classifier::clear() {
        //
        num = 0;
    
        str->clear();
        num_rear->clear();
    
    }
    
    int word_classifier::judge(char c, word_count* word) {
        //
    
        if (c >= 'a'&&c <= 'z') c = c - 32;//
        if (c >= '0'&&c <= '9' || c <= 'Z'&&c >= 'A') {
            //*
            if (c >= '0'&&c <= '9') {
                // **
                if (num < 4) {
                    // ***
                    clear();
                    return 0;
                }
                else {
                    //
                    num_rear->append(1, c);
    
                    num++;
                    return 2;
                }
            }
            else {
                //
                if (num_rear->empty()) {
    
                    str->append(1, c);
    
                    num++;
                    return 2;
                }
                else {
    
                    str->append(*num_rear);
                    num += num_rear->size();
                    str->append(1, c);
                    num_rear->clear();
                    return 2;
                }
            }
        }
        else {
    
            if (num < 4) {
    
                clear();
                return 0;
            }
            else {
    
                set(word);
                clear();
                return 1;
            }
        }
    
    }
    
    word_count::word_count()
    {
        str = new string();
        num = new string();
        word = new string();
        next_ptr = NULL;
        size = 0;
        str_count = 0;
        num_rear = 0;
        flag = 0;
    }
    
    
    word_count::~word_count()
    {
        delete str;
    }
    
    bool operator==(const word_count& word1, const word_count& word2)
    {
    
        return (*(word1.word) == *(word2.word)) && (word1.num_rear == word2.num_rear);
    
    }
    
    int freq_count(word_count* &arr1, word_count* temp, int flag) {
    
        word_count* arr = arr1;
        if (arr == NULL) {
            //
            arr = new word_count();
            arr->str_count = -1;
        }
    
        if (arr->next_ptr == NULL)
        {
            //
            arr->next_ptr = temp;
            flag = 1;
            temp->flag = 1;
        }
        else
        {
            //
            if (temp->flag == 1) return 0;//
            word_count* parent = arr;
            word_count* change = NULL;
            arr = arr->next_ptr;
            int i = 0;//
            int t = temp->str_count;//
            int result = -100;//
            int flag_equal = -1;//
                                //
    
            while (i < 10 && arr->next_ptr != NULL) {
                //
                if (t > arr->str_count) {
                    //)    *
                    flag_equal = -1;
                    if (change == NULL)
                        change = parent;//  **
                    else {
                        //**
                        int j = change->str_count - arr->str_count;
                        if (j>0) {
                            //change  **
                            change = parent;
    
                        }
                        else if (j == 0) {
                            //j
                            int k = compare(change, arr, flag);
                            if (k == -1) change = parent;
                        }
                        //else 
                    }
    
                }//end if
                else if (t == arr->str_count) {
                    //
                    if (change == NULL) {
                        //
                        result = compare(arr, temp, flag);
                        if (result == -1) {
                            //
                            change = parent;
                            flag_equal = 1;//
                        }//end if    
                         //result=0,
                    }
                    else {
                        //
                        if (flag_equal == 1) {
                            //
                            result = compare(change, temp, flag);
                            if (result == -1) {
                                //
                                change = parent;
                            }//end if    
                        }//end if
                         //flag_equal!=1,
    
                    }//end else
    
    
                }//end else if
    
    
                i++;
                parent = arr;
                arr = arr->next_ptr;
    
            }//end while
    
            if (i<10) {
                //
                arr->next_ptr = temp;
                temp->flag = 1;
            }//end if
            else if (i == 10) {
                //10
                if (change != NULL) {
                    //change
                    temp->next_ptr = change->next_ptr->next_ptr;
                    change->next_ptr->flag = 0;
                    temp->flag = 1;
                    change->next_ptr = temp;
                }//end if
            }//end else if
        }//end else
    
    }//end freq_count
    
    int freq_countP(phrase_count* &arr1, phrase_count* temp, int flag) {
    
        phrase_count* arr = arr1;
        if (arr == NULL) {
    
            arr = new phrase_count();
            arr->phr_count = -1;
        }
    
        if (arr->next_ptr == NULL)
        {
    
            arr->next_ptr = temp;
            flag = 1;
            temp->flag = 1;
        }
        else
        {
    
            if (temp->flag == 1) return 0;
            phrase_count* parent = arr;
            phrase_count* change = NULL;
            arr = arr->next_ptr;
            int i = 0;//
            int t = temp->phr_count;//
            int para1, para2 = 0;//
            int result = -100;//
            int flag_equal = -1;
    
            while (i < 10 && arr->next_ptr != NULL) {
                //
                if (t > arr->phr_count) {
                    //    *
                    flag_equal = -1;
                    if (change == NULL)
                        change = parent;//  **
                    else {
                        //    **
                        int j = change->phr_count - arr->phr_count;
                        if (j>0) {
                            //  **
                            change = parent;
    
                        }
                        else if (j == 0) {
                            //
                            int k = compareP(change, arr, para1, para2);
                            if (k == -1) change = parent;
                        }
                        //else 
                    }
    
                }//end if
                else if (t == arr->phr_count) {
                    //
                    if (change == NULL) {
                        //
                        result = compareP(arr, temp, para1, para2);
                        if (result == -1) {
                            //
                            change = parent;
                            flag_equal = 1;//
                        }//end if    
                         //result=0,
                    }
                    else {
                        //
                        if (flag_equal == 1) {
                            //
                            result = compareP(change, temp, para1, para2);
                            if (result == -1) {
                                //
                                change = parent;
                            }//end if    
                        }//end if
    
                    }//end else
    
    
                }//end else if
    
                 //
                i++;
                parent = arr;
                arr = arr->next_ptr;
    
            }//end while
    
            if (i<10) {
                //
                arr->next_ptr = temp;
                temp->flag = 1;
            }//end if
            else if (i == 10) {
    
                if (change != NULL) {
                    //
                    temp->next_ptr = change->next_ptr->next_ptr;
                    change->next_ptr->flag = 0;
                    temp->flag = 1;
                    change->next_ptr = temp;
                }//end if
            }//end else if
        }//endelse
    
    }//end freq_count
    
    phrase_count::phrase_count() {
        phrase1 = new word_count();
        phrase2 = new word_count();
        next_ptr = NULL;
        phr_count = 0;
    
    }
    
    phrase_count::~phrase_count() {
        delete phrase1;
        delete phrase2;
    
    }
    
    void phrase_to_phrase(phrase_count* phrase, phrase_count* phrase1) {
    
        phrase1->flag = phrase->flag;
        phrase1->next_ptr = phrase->next_ptr;
        phrase1->phr_count = phrase->phr_count;
        word_to_word((phrase->phrase1), (phrase1->phrase1));
        word_to_word((phrase->phrase2), (phrase1->phrase2));
    
    }
    
    int compare(word_count* word, word_count* word1, int& flag) {
        //
        int w0 = word->num_rear, w1 = word1->num_rear;

个人作业项目报告(三)输出结果及测试样例的结果(附代码)

    flag = 0;
        string s0(*(word->str), 0, w0), s1(*(word1->str), 0, w1);
        if (s0 < s1)  return -1;//
        else if (s0 > s1) return 1;//
        else {
    
    
            if (word->num < word1->num) flag = -1;
            else if (word->num > word1->num) flag = 1;
    
            return 0;
        }
    }//end compare
    int compareP(phrase_count* phrase, phrase_count* phrase1, int& flag1, int &flag2) {
        //
        int s1, s2 = 0;
        s1 = compare(phrase->phrase1, phrase1->phrase1, flag1);
        if (s1<0) {
            return -1;
        }
        else if (s1 > 0) {
            return 1;
        }
        else {
            s2 = compare(phrase->phrase2, phrase1->phrase2, flag2);
            if (s2 < 0) {
                return -1;
            }
            else if (s2 > 0) {
                return 1;
            }
            else {
                return 0;
            }
        }
    
    }//end compareP
    
    
    int main(int argc, char* argv[])
    {
        unordered_map<string, word_count> wordmap;//Hash table for word
        unordered_map<string, phrase_count> phrasemap;//hash table for phrase
        vector<string> files;
        string path;
        if (argv[1] == NULL) {
    
            path.append("C:/test/11.txt");
        }
        else {
            path.append(argv[1]);
    
        }
    
        time_t start, stop;
        start = time(NULL);
        if (path.find(".") != string::npos) {// if the path is a file path
            files.push_back(path);
    
        }
        else {
            getAllFiles(path, files);   //get all file paths  
    
        }
    
    
    
        int size = files.size();//length of file
        int con = 1;//parameter;
        int line_count = 0;//count of line
        int char_count = 0;//count of character
        int word_all_count = 0;//count of word
        word_classifier classifier_word;//char-word analyzer
        word_count* word_temp = NULL;
        word_count* word_temp1 = NULL;
        word_count* arr = new word_count();
        word_count* word_test = NULL;//using for pointing to word_count in H_table;
        string str_test;//used to store *(word_temp->word)
        int flag = 0; //flag of judging word
        int phrase_all_count = 0;//count of phrase
        phrase_count* phrase_temp = NULL;
        phrase_count* arrp = new phrase_count();
        phrase_count* phrase_test = NULL;//using for pointing to phrase_count in H_table;
        word_temp = new word_count();
        word_temp1 = new word_count();
        phrase_temp = new phrase_count();
    
    
    
        int phr_flag = 0;//fag of phrase,judging if should get phrase
    
    
        char c = 0, optr = 0;//optr is a copy of c
    
        ifstream infile;//ptr of file
        for (int i = 0; i < size; i++) {
            //going to all files
            infile.open(files[i],ios::in);
            
            //judge if the path is a folder or document
            if (infile.fail()) {
                continue;}// fail,meaning a folder
            else {
                //get the length of file,and store in FileSize
                int begin = infile.tellg();
                int end = begin;
                int FileSize = 0;
                infile.seekg(0, ios_base::end);
                end = infile.tellg();
                infile.seekg(0, ios_base::beg);
                FileSize = end - begin;
                //end of getting file
                if (FileSize != 0) {
                    line_count += 1;
                    for (int j = 0; j <= FileSize; j++) {
                        //operation in each File
                        //get a char and count
                        infile.get(c);
                        if (32 <= c&&c <= 126) char_count = char_count + 1;
                        if (c == '
') line_count = line_count + 1;
                        optr = c;
                        c = 0;// clear c,avoiding mistakes                                        
                              //end of counting char and line
                        flag = classifier_word.judge(optr, word_temp);
                        //judging if the word is ok
                        if (flag == 1) {
                            //get a word,
                            flag = 0;
                            //cout << *(word_temp->str) << endl;
                            word_all_count += 1;
                            str_test = *(word_temp->word);
                            if (wordmap.find(str_test) == wordmap.end()) {
                                //if don't exist
                                word_to_word(word_temp, &wordmap[str_test]);
                                word_test = &wordmap[str_test];
                                freq_count(arr, word_test, 1);
                            }
                            else {
                                //if exist
                                wordmap[str_test].str_count++;
                                word_test = &wordmap[str_test];
                                //change the rear
                                con = (*(word_test->num) > *(word_temp->num));
                                if (con == 1) {
                                    //word_temp has a smaller rear
                                    *(word_test->num) = *(word_temp->num);
                                    *(word_test->str) = *(word_test->word) + *(word_test->num);
    
                                }
                                freq_count(arr, word_test, 1);
    
                            }
                            if (phr_flag == 0) {
                                phr_flag = 1;
                                word_to_word(word_test, word_temp1);
                            }
                            else if (phr_flag == 1) {
    
                                phrase_temp->phrase2 = word_temp;
                                phrase_temp->phrase1 = word_temp1;
                                str_test = *(word_temp1->word) + *(word_temp->word);
                                if (phrasemap.find(str_test) == phrasemap.end()) {
                                    //if don;t exist
                                    phrase_to_phrase(phrase_temp, &phrasemap[str_test]);
                                    phrasemap[str_test].phr_count = 1;
                                    phrase_test = &phrasemap[str_test];
                                    freq_countP(arrp, phrase_test, 1);
                                    word_to_word(word_temp, word_temp1);
    
                                }
                                else {
                                    //if exist
                                    phrasemap[str_test].phr_count++;
                                    phrase_test = &phrasemap[str_test];
                                    freq_countP(arrp, phrase_test, 1);
                                    word_to_word(word_temp, word_temp1);
    
                                }
                                if (optr == 0) phr_flag = 0;
                            }
    
                        }//end if else 
    
    
                    }//end for 2
    
                    
                        infile.get(optr);
    
                        if(infile.eof())infile.close();
    
                }//end if
            }//end else
    
        }//end for1
        string dist = "Result.txt";
        ofstream ofn(dist);
        ofn << "The number of character is:" << char_count << endl;
        ofn << "The number of line is:" << line_count << endl;
        ofn << "The number of word is:" << word_all_count << endl;
        word_count* ptr_temp = arr;
        phrase_count* ptr_tempp = arrp;
        word_count* q = arr;
        phrase_count*qp = arrp;
    
        int i = 0, j = 0;//
    
        ofn << "The top 10 words is:" << endl;
    
        while (ptr_temp->next_ptr != NULL&&i<10) {
    
            //print the top 10 word
            ptr_temp = ptr_temp->next_ptr;
            ofn << *(ptr_temp->str) + '    ';
            ofn << ptr_temp->str_count << endl;
            i++;
        }
        i = 0;
    
        while (ptr_tempp->next_ptr != NULL&&i<10) {
            //print the top 10 phrase
    
            ptr_tempp = ptr_tempp->next_ptr;
            ofn << *(ptr_tempp->phrase1->str) + ' ' + *(ptr_tempp->phrase2->str) + '    ';
            ofn << ptr_tempp->phr_count << endl;
            i++;
        }
        stop = time(NULL);
        ofn << "Use Time :" << stop - start << endl;
    
        infile.close();
        ofn.close();
    }//end main
原文地址:https://www.cnblogs.com/ZucksLiu/p/8678520.html