hive 之行转列两种方法

一、方法1：with

2017年月度各店铺单客户平均消费（店铺消费/店铺unique客户数量）
环比：每个月和上一个月比

with
tab as (
    select 
        date,store_id,avgCusPrice-lastAvgPrice priceDiff
    from
    (
        select 
            date,store_id,avgCusPrice,
            lag(avgCusPrice,1,0) over(partition by store_id order by date) lastAvgPrice
        from 
        (
            select totalprice/cusCount avgCusPrice,date,store_id
            from 
            (
                select count(distinct customer_name) cusCount,sum(price) totalprice,date,store_id
                from 
                (
                    select customer_name,price
                    ,from_unixtime(unix_timestamp(date,'yyyy-MM-dd'),'yyyy-MM') date,store_id
                    from transaction_details
                )T
                group by date,store_id
            )T 
        )T 
    )T 
)
select 
    t4.store_id,
    t4.priceDiff as diff4,
    t5.priceDiff as diff5,
    t6.priceDiff as diff6
from (
    select 
        store_id,priceDiff
    from tab
    where
        date = "2017-04"
)t4
inner join tab t5 on t4.store_id=t5.store_id and t5.date="2017-05"
inner join tab t6 on t4.store_id=t6.store_id and t6.date="2017-06"

二、方法2：case when

例1

select store_id,
max(case date when'2017-04' then priceDiff else null end) as `4月diff`,
max(case date when'2017-05' then priceDiff else null end) as `5月diff`,
max(case date when'2017-06' then priceDiff else null end) as `6月diff`,
max(case date when'2017-07' then priceDiff else null end) as `7月diff`
from 
(
select 
date,store_id,avgCusPrice-lastAvgPrice priceDiff
from
(
select 
date,store_id,avgCusPrice,
lag(avgCusPrice,1,0) over(partition by store_id order by date) lastAvgPrice
from 
(
select totalprice/cusCount avgCusPrice,date,store_id
from 
(
select count(distinct customer_name) cusCount,sum(price) totalprice,date,store_id
from 
(
select customer_name,price
,from_unixtime(unix_timestamp(date,'yyyy-MM-dd'),'yyyy-MM') date,store_id
from transaction_details
)T
group by date,store_id
)T 
)T 
)T 
)T 
group by store_id

例2

stu_id, question_id, score分为 1 0 0.5，对应 right，error，half。需要根据stu_id求他们每人right，error，half的题目集合

思路1

select
student_id,
concat_ws('',collect_list(t.right)) right,
concat_ws('',collect_list(t.half)) half,
concat_ws('',collect_list(t.error)) error
from 
(
select
student_id,
case when score=1.0 then concat_ws(",",collect_list(question_id)) else null end right,
case when score=0.5 then concat_ws(",",collect_list(question_id)) else null end half,
case when score=0.0 then concat_ws(",",collect_list(question_id)) else null end error
from ex_exam_record
group by student_id,score
) t
group by student_id;

思路2

select 
student_id,
concat_ws(',',collect_list(`right`)) as `right`,
concat_ws(',',collect_list(`half`)) as `half`,
concat_ws(',',collect_list(`error`)) as `error`
from 
(
select 
student_id,
max(case score when 0 then question_id else null end) as `error`,
max(case score when 1 then question_id else null end) as `right`,
max(case score when 0.5 then question_id else null end) as `half`
from ex_exam_record 
group by student_id,question_id
)T
group by student_id

select 
ddd.d_date,
sum(case when datediff('2018-10-20',ddd.d_date)<=1 then 1 else 0 end) two_count
from dw_sales_source.dwd_fact_sales_order dfo
inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
where ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'
group by 
ddd.d_date;

select 
ddd.d_date,
case when datediff('2018-10-20',ddd.d_date)<=1 then 1 else 0 end two_count # 近两天的order次数
from dw_sales_source.dwd_fact_sales_order dfo
inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
where ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'
group by 
ddd.d_date;

case when 后的d_date在group by 后面，“10-19 1”“10-19 1”分组加起来“10-19 出现次数”

思路3

只groupby stuid，是否就能够分类出对应的question_id集合？

答：不行

三、方法3：Lateral view explode

val friends = spark.read.format("csv").option("header","true").load("hdfs://192.168.56.111:9000/party/data/user_friends.csv")

方法1：lateral view explode之 hive

friends.createOrReplaceTempView("friends")
spark.sql("""
select 
count(1)
from
(
select distinct user,friends_1
from friends 
LATERAL VIEW explode(split(friends,' '))friends as friends_1 
where friends_1 is not null and friends_1 !="" // hive在split(空格)的时候会把两个空格中间视为""
)T
""").show()
>>> 30386387

方法2：explode之 sparkSQL-API

friends.select($"user".alias("user_id"),explode(split($"friends"," ")).alias("friends_1")).filter("friends_1 is not null").distinct.count
>>> 30386387

结果都是：

hive 之 行转列两种方法