使用hive streaming 统计Wordcount

一、 编译处理程序

使用python编写脚本

1、编写map对应的脚本

map.py

#!/usr/bin/env python
import sys
for i in sys.stdin:
    worlds = i.strip().split()
    for word in worlds:
        print("%s	1" % word.lower())

2、编写reduce对应的脚本

reduce.py

#!/usr/bin/env python
import sys
wordDict=dict()
for i in sys.stdin:
    i = i.split()[0]
    if i in wordDict:
        wordDict[i] += 1
    else:
        wordDict[i] = 1

for i in wordDict.keys():
    print(i + ":" + str(wordDict[i]))

二、 创建对应的表结构

1、创建docs表

hive (default)> create table docs(line string);

2、 准备docs表对于的数据

[hduser@yjt test]$ cat f.txt 
hadoop beijin
yjt hadoop test
spark hadoop shanghai
yjt
mm
jj
gg
gg
this is a beijin

3、加载数据到docs表

hive (default)> load data local inpath '/data1/studby/test/f.txt' into table docs;

4、准备wordcount表,用于存放最终的数据结果

hive (default)> create  table wordcount(word string, count int) row format delimited fields terminated by ':';

三、执行

hive (default)> from (from docs select transform (line) using "/data1/studby/test/map.py" as (word, count) cluster by word) as wc insert overwrite table wordcount  select transform(wc.word, wc.count) using "/data1/studby/test/reduce.py" as (word, count);

查看结果

hive (default)> select * from wordcount;
OK
wordcount.word	wordcount.count
a	1
mm	1
is	1
shanghai	1
beijin	2
hadoop	3
gg	2
this	1
yjt	2
jj	1
test	1
spark	1
原文地址:https://www.cnblogs.com/yjt1993/p/13266436.html