Hive实现wordCount

a. 创建一个数据库

create database word;

b. 建表

create external table word_data(line string) row format delimited fields terminated by '
' stored as textfile location '/home/hadoop/worddata';
这里假设我们的数据存放在hadoop下,路径为:/home/hadoop/worddata,里面主要是一些单词文件,内容大概为:

+-------------------------+--+
|     word_data.line      |
+-------------------------+--+
|                         |
| hello man               |
| what are you doing now  |
| my running              |
| hello                   |
| kevin                   |
| hi man                  |
| hadoop hive es          |
| storm hive es           |
|                         |
|                         |
+-------------------------+--+

执行了上述hql就会创建一张表src_data,内容是这些文件的每行数据,每行数据存在字段line中,

	select * from word_data;
	#就可以看到这些数据:
+-------------------------+--+
|     word_data.line      |
+-------------------------+--+
|                         |
| hello man               |
| what are you doing now  |
| my running              |
| hello                   |
| kevin                   |
| hi man                  |
| hadoop hive es          |
| storm hive es           |
|                         |
|                         |
+-------------------------+--+

c. 根据MapReduce的规则,需要进行拆分

把每行数据拆分成单词,这里需要用到一个hive的内置表生成函数(UDTF):explode(array),参数是array,
其实就是行变多列:

create table words(word string);
insert into table words select explode(split(line, " ")) as word from word_data;
0: jdbc:hive2://bd004:10000> select * from words;
+-------------+--+
| words.word  |
+-------------+--+
|             |
| hello       |
| man         |
| what        |
| are         |
| you         |
| doing       |
| now         |
| my          |
| running     |
| hello       |
| kevin       |
| hi          |
| man         |
| hadoop      |
| hive        |
| es          |
| storm       |
| hive        |
| es          |
|             |
|             |
+-------------+--+

split是拆分函数,跟java的split功能一样,这里是按照空格拆分,所以执行完hql语句,words表里面就全部保存的单个单词

d. 基本实现

因为hql可以group by,所以最后统计语句为:

select word, count(*) from word.words group by word;
#word.words 库名称.表名称,group by word这个word是create table words(word string) 命令创建的word string

+----------+------+--+
|   word   | _c1  |
+----------+------+--+
|          | 3    |
| are      | 1    |
| doing    | 1    |
| es       | 2    |
| hadoop   | 1    |
| hello    | 2    |
| hi       | 1    |
| hive     | 2    |
| kevin    | 1    |
| man      | 2    |
| my       | 1    |
| now      | 1    |
| running  | 1    |
| storm    | 1    |
| what     | 1    |
| you      | 1    |
+----------+------+--+

总结:对比写MR和使用hive,还是hive比较简便,对于比较复杂的统计操作可以建一些中间表,或者一些视图之类的。

原文地址:https://www.cnblogs.com/ernst/p/12819169.html