Hive实现wordCount

a. 创建一个数据库

create database word;

b. 建表

create external table word_data(line string) row format delimited fields terminated by '
' stored as textfile location '/home/hadoop/worddata';
这里假设我们的数据存放在hadoop下，路径为：/home/hadoop/worddata，里面主要是一些单词文件，内容大概为：

+-------------------------+--+
|     word_data.line      |
+-------------------------+--+
|                         |
| hello man               |
| what are you doing now  |
| my running              |
| hello                   |
| kevin                   |
| hi man                  |
| hadoop hive es          |
| storm hive es           |
|                         |
|                         |
+-------------------------+--+

执行了上述hql就会创建一张表src_data，内容是这些文件的每行数据，每行数据存在字段line中，

	select * from word_data;
	#就可以看到这些数据:
+-------------------------+--+
|     word_data.line      |
+-------------------------+--+
|                         |
| hello man               |
| what are you doing now  |
| my running              |
| hello                   |
| kevin                   |
| hi man                  |
| hadoop hive es          |
| storm hive es           |
|                         |
|                         |
+-------------------------+--+

c. 根据MapReduce的规则，需要进行拆分

把每行数据拆分成单词，这里需要用到一个hive的内置表生成函数（UDTF）：explode(array)，参数是array，
其实就是行变多列：

create table words(word string);
insert into table words select explode(split(line, " ")) as word from word_data;

0: jdbc:hive2://bd004:10000> select * from words;
+-------------+--+
| words.word  |
+-------------+--+
|             |
| hello       |
| man         |
| what        |
| are         |
| you         |
| doing       |
| now         |
| my          |
| running     |
| hello       |
| kevin       |
| hi          |
| man         |
| hadoop      |
| hive        |
| es          |
| storm       |
| hive        |
| es          |
|             |
|             |
+-------------+--+

split是拆分函数，跟java的split功能一样，这里是按照空格拆分，所以执行完hql语句，words表里面就全部保存的单个单词

d. 基本实现

因为hql可以group by，所以最后统计语句为：

select word, count(*) from word.words group by word;
#word.words 库名称.表名称，group by word这个word是create table words(word string) 命令创建的word string

+----------+------+--+
|   word   | _c1  |
+----------+------+--+
|          | 3    |
| are      | 1    |
| doing    | 1    |
| es       | 2    |
| hadoop   | 1    |
| hello    | 2    |
| hi       | 1    |
| hive     | 2    |
| kevin    | 1    |
| man      | 2    |
| my       | 1    |
| now      | 1    |
| running  | 1    |
| storm    | 1    |
| what     | 1    |
| you      | 1    |
+----------+------+--+

总结：对比写MR和使用hive，还是hive比较简便，对于比较复杂的统计操作可以建一些中间表，或者一些视图之类的。