Hive 求出连续三天有销售记录的店铺

原始数据

A,2017-10-11,300
A,2017-10-12,200
A,2017-10-13,100
A,2017-10-15,100
A,2017-10-16,300
A,2017-10-17,150
A,2017-10-18,340
A,2017-10-19,360
B,2017-10-11,400
B,2017-10-12,200
B,2017-10-15,600
C,2017-10-11,350
C,2017-10-13,250
C,2017-10-14,300
C,2017-10-15,400
C,2017-10-16,200
D,2017-10-13,500
E,2017-10-14,600
E,2017-10-15,500
D,2017-10-14,600

求出连续三天有销售记录的店铺,给每个用户一个编号,用日期减去编号,如果是同一天,那么就是连续的 。

A,2017-10-11,300,1,2017-10-10
A,2017-10-12,200,2,2017-10-10
A,2017-10-13,100,3,2017-10-10
A,2017-10-15,100,4,2017-10-11
A,2017-10-16,300,5,2017-10-11
A,2017-10-17,150,6,2017-10-11
A,2017-10-18,340,7,2017-10-11
A,2017-10-19,360,8,2017-10-11
B,2017-10-11,400
B,2017-10-12,200
B,2017-10-15,600
C,2017-10-11,350
C,2017-10-13,250
C,2017-10-14,300
C,2017-10-15,400
C,2017-10-16,200
D,2017-10-13,500
E,2017-10-14,600
E,2017-10-15,500
D,2017-10-14,600

step1 建表 并加载数据

hive (default)> drop table if exists t_jd;
OK
Time taken: 8.41 seconds
hive (default)> create table t_jd(shopid string,dt string,sale int)
              > row format delimited fields terminated by ',';
OK
Time taken: 0.119 seconds
hive (default)> load data local inpath '/opt/module/data/sales.txt' into table t_jd;
Loading data to table default.t_jd
Table default.t_jd stats: [numFiles=1, totalSize=363]
OK
Time taken: 1.811 seconds

下面在hive中执行的代码 省略了日志

step2 每一条记录按照店铺id分区 按照日期全局排序 row_number()编号

hive (default)> select shopid,dt,sale,
              > row_number() over(partition by shopid order by dt) as rn
              > from t_jd;
shopid  dt      sale    rn
A       2017-10-11      300     1
A       2017-10-12      200     2
A       2017-10-13      100     3
A       2017-10-15      100     4
A       2017-10-16      300     5
A       2017-10-17      150     6
A       2017-10-18      340     7
A       2017-10-19      360     8
B       2017-10-11      400     1
B       2017-10-12      200     2
B       2017-10-15      600     3
C       2017-10-11      350     1
C       2017-10-13      250     2
C       2017-10-14      300     3
C       2017-10-15      400     4
C       2017-10-16      200     5
D       2017-10-13      500     1
D       2017-10-14      600     2
E       2017-10-14      600     1
E       2017-10-15      500     2

step3 根据编号 生成连续日期

hive (default)> select shopid,dt,sale,rn,
              > date_sub(to_date(dt),rn)
              > from
              > (select shopid,dt,sale,
              > row_number() over(partition by shopid order by dt) as rn
              > from t_jd) tmp;

shopid  dt      sale    rn      _c4
A       2017-10-11      300     1       2017-10-10
A       2017-10-12      200     2       2017-10-10
A       2017-10-13      100     3       2017-10-10
A       2017-10-15      100     4       2017-10-11
A       2017-10-16      300     5       2017-10-11
A       2017-10-17      150     6       2017-10-11
A       2017-10-18      340     7       2017-10-11
A       2017-10-19      360     8       2017-10-11
B       2017-10-11      400     1       2017-10-10
B       2017-10-12      200     2       2017-10-10
B       2017-10-15      600     3       2017-10-12
C       2017-10-11      350     1       2017-10-10
C       2017-10-13      250     2       2017-10-11
C       2017-10-14      300     3       2017-10-11
C       2017-10-15      400     4       2017-10-11
C       2017-10-16      200     5       2017-10-11
D       2017-10-13      500     1       2017-10-12
D       2017-10-14      600     2       2017-10-12
E       2017-10-14      600     1       2017-10-13
E       2017-10-15      500     2       2017-10-13

step4 分组 求count

hive (default)> select shopid,count(1) as cnt
              > from
              > (select shopid,dt,sale,rn,
              > date_sub(to_date(dt),rn) as flag
              > from
              > (select shopid,dt,sale,
              > row_number() over(partition by shopid order by dt) as rn
              > from t_jd) tmp) tmp2
              > group by shopid,flag;

shopid  cnt
A       3
A       5
B       2
B       1
C       1
C       4
D       2
E       2

step5 筛选出连续天数大于等于3的

hive (default)> select shopid from
              > (select shopid,count(1) as cnt
              > from
              > (select shopid,dt,sale,rn,
              > date_sub(to_date(dt),rn) as flag
              > from
              > (select shopid,dt,sale,
              > row_number() over(partition by shopid order by dt) as rn
              > from t_jd) tmp) tmp2
              > group by shopid,flag) tmp3
              > where tmp3.cnt>=3;

shopid
A
A
C

step6 去重

hive (default)> select distinct shopid from
              > (select shopid,count(1) as cnt
              > from
              > (select shopid,dt,sale,rn,
              > date_sub(to_date(dt),rn) as flag
              > from
              > (select shopid,dt,sale,
              > row_number() over(partition by shopid order by dt) as rn
              > from t_jd) tmp) tmp2
              > group by shopid,flag) tmp3
              > where tmp3.cnt>=3;

shopid
A
C
原文地址:https://www.cnblogs.com/eugene0/p/13302958.html