电商离线数仓项目实战(下)--第五阶段模块二作业

题目

1.用拉链表实现核心交易分析中DIM层商家维表,并实现该拉链表的回滚(自己构造数据,编写SQL,并要有相应的文字说明);

2. 在会员分析中计算 沉默会员数流失会员数

    沉默会员的定义:只在安装当天启动过App,而且安装时间是在7天前
    流失会员的定义:最近30天未登录的会员

3. 在核心交易分析中完成如下指标的计算
    统计2020年每个季度的销售订单笔数、订单总额
    统计2020年每个月的销售订单笔数、订单总额
    统计2020年每周(周一到周日)的销售订单笔数、订单总额
    统计2020年国家法定节假日、休息日、工作日的订单笔数、订单总额

详解(1)

一、创建表加载数据(准备工作)

1.创建商家信息表--拉链表(dim层)

特点:

  • 相较于商家维表,多了两个字段startdate,enddate
  • 拉链表不是分区表
DROP TABLE IF EXISTS `dim.dim_trade_shops`;
CREATE EXTERNAL TABLE `dim.dim_trade_shops`(
`shopid` int,
`userid` int,
`areaid` int,
`shopname` string,
`shoplevel` tinyint,
`status` tinyint,
`createtime` string,
`modifytime` string,
`startdate` string,
`enddate` string
)COMMENT
'商家店铺表';

2.准备三天的数据

/root/data/shops/shop-2020-11-20.dat
100050,1,100225,WSxxx营超市,1,1,2020-06-28,2020-11-20 13:22:22
100052,2,100236,新鲜xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100053,3,100011,华为xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100054,4,100159,小米xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22
100055,5,100211,苹果xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22


/root/data/shops/shop-2020-11-21.dat
100057,7,100311,三只xxx鼠零食,1,1,2020-06-28,2020-11-21 13:22:22
100058,8,100329,良子xxx铺美食,1,1,2020-06-28,2020-11-21 13:22:22
100054,4,100159,小米xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22
100055,5,100211,苹果xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22


/root/data/shops/shop-2020-11-22.dat
100059,9,100225,乐居xxx日用品,1,1,2020-06-28,2020-11-22 13:22:22
100060,10,100211,同仁xxx大健康,1,1,2020-06-28,2020-11-22 13:22:22
100052,2,100236,新鲜xxx旗舰店,1,2,2020-06-28,2020-11-22 13:22:22

3.创建商家信息维表(ods层)

DROP TABLE IF EXISTS `ods.ods_trade_shops`;
CREATE EXTERNAL TABLE `ods.ods_trade_shops`(
`shopid` int,
`userid` int,
`areaid` int,
`shopname` string,
`shoplevel` tinyint,
`status` tinyint,
`createtime` string,
`modifytime` string
)
COMMENT '商家店铺表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by ','
location '/user/data/trade.db/shops';

二、拉链表的实现

  • ods_trade_shops(分区表) => 每日变更的数据(修改的+新增的) / 历史数据(第一天)
  • dim_trade_shops(拉链表)=> 多了两个字段 startdate 、 enddate

1、ods_trade_shops 初始化(2020-11-20)。获取历史数据

load data local inpath '/root/data/shops/shop-2020-11-20.dat' into table ods.ods_trade_shops partition(dt = '2020-11-20');

100050
,1,100225,WSxxx营超市,1,1,2020-06-28,2020-11-20 13:22:22 100052,2,100236,新鲜xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22 100053,3,100011,华为xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22 100054,4,100159,小米xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22 100055,5,100211,苹果xxx旗舰店,1,1,2020-06-28,2020-11-20 13:22:22

2.初始化拉链表(2020-11-20)。ods_trade_shops => dim_trade_shops

insert overwrite table dim.dim_trade_shops
select shopid,userid,areaid,shopname,shoplevel,status,createtime,modifytime,
       dt as startdate,
       '9999-12-31' as enddate
   from ods.ods_trade_shops
where dt = '2020-11-20';

3.导入2020-11-21数据,构建拉链表


load data local inpath '/root/data/shops/shop-2020-11-21.dat' into table ods.ods_trade_shops partition(dt = '2020-11-21');

100057
,7,100311,三只xxx鼠零食,1,1,2020-06-28,2020-11-21 13:22:22
100058,8,100329,良子xxx铺美食,1,1,2020-06-28,2020-11-21 13:22:22
100054,4,100159,小米xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22
100055,5,100211,苹果xxx旗舰店,2,1,2020-06-28,2020-11-21 13:22:22

4.构建拉链表(新增数据 + 历史数据)

新增数据处理逻辑与家在历史数据相似,startdate=dt enddate = '9999-12-31'
第一步:新增数据的处理办法

select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime,
       dt as startdate,
'9999-12-31' AS enddate
from ods.ods_trade_shops
where dt = '2020-11-21';

第二步:历史数据的处理办法

select b.shopid, b.userid ,b.areaid , b.shopname, b.shoplevel,b.status, b.createtime, b.modifytime, 
b.startdate,
CASE WHEN a.shopid is not null and b.enddate ='9999-12-31' THEN date_add('2020-11-21',-1) ELSE b.enddate
end as enddate from (select *
from ods.ods_trade_shops
where dt='2020-11-21') a
right join dim.dim_trade_shops b
on a.shopid = b.shopid;

综上,拉链表的最终处理(新增+历史数据)

insert overwrite table dim.dim_trade_shops
select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime,
       dt as startdate,
       '9999-12-31' AS enddate 
   from ods.ods_trade_shops
 where dt = '2020-11-21'
union all
select b.shopid, b.userid ,b.areaid , b.shopname, b.shoplevel,b.status, b.createtime, b.modifytime, b.startdate, CASE WHEN a.shopid is not null and b.enddate ='9999-12-31' THEN date_add('2020-11-21',-1) ELSE b.enddate end as enddate from (select * from ods.ods_trade_shops where dt='2020-11-21') a right join dim.dim_trade_shops b on a.shopid = b.shopid;

处理拉链表的脚本:shopszipper.sh

#!/bin/bash

source /etc/profile
if [ -n "$1" ] ;then
    do_date=$1
else
    do_date=`date -d "-1 day" +%F`
fi

sql="
insert overwrite table dim.dim_trade_shops
select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime,
       dt as startdate,
       '9999-12-31' AS enddate 
   from ods.ods_trade_shops
 where dt = '$do_date'

union all

select b.shopid, b.userid ,b.areaid , b.shopname, b.shoplevel,b.status, b.createtime, b.modifytime, 
       b.startdate,
          CASE WHEN a.shopid is not null and b.enddate ='9999-12-31' 
               THEN date_add('$do_date',-1) 
               ELSE b.enddate 
          end as enddate
 from (select * 
          from ods.ods_trade_shops 
         where dt='$do_date') a 
right join dim.dim_trade_shops b 
on a.shopid = b.shopid;
"

hive -e "$sql"

5.结果验证

ods维表初始化(导入2020-11-20的数据)

ods维表导入2020-11-21日的数据

拉链表更新后(维表导入2020-11-21数据,并执行拉链表的操作后的结果)

 三、拉链表的回滚

回滚逻辑:将拉链表恢复到rollback_date那一天的数据

  • end_date < rollback_date,即结束日期 < 回滚日期。表示该行数据在rollback_date 之前产生,这些数据需要原样保留
  • start_date <= rollback_date <= end_date,即开始日期 <= 回滚日期 <= 结束日期。这些数据是回滚日期之后产生的,但是需要修改。将end_date 改为 9999-12-31
  • 其他数据不用管

例如,回滚到2020-11-21的数据

1.处理end_date < rollback_date 的数据:保留

select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime,
       startdate,
       enddate 
1 as tag
from dim.dim_trade_shops where enddate < '2020-11-21';

2.处理start_date < rollback_date < end_date 的数据:设置enddate=9999-12-31

select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime,
       startdate,
       enddate 
       2 as tag
   from dim.dim_trade_shops
 where startdate <= '2020-11-21' and '2020-11-21' <= enddate;

3.将数据写入拉链表

drop table tmp.test;
create table tmp.test as
select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime,
       startdate,
       enddate, 
       1 as tag
   from dim.dim_trade_shops
 where enddate < '2020-11-21'union all

select shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime,
       startdate,
       enddate,
       2 as tag
   from dim.dim_trade_shops
 where startdate <= '2020-11-21' and '2020-11-21' <= enddate;

4.结果验证

回滚的表:

 原有的表:

 

详解(2)

2. 在会员分析中计算沉默会员数和流失会员数

沉默会员的定义:只在安装当天启动过App,而且安装时间是在7天前

select count(*) clientNum 
    from dws.dws_member_start_day 
    where dt <= date_add(current_date,-7) 
    and device_id in(
        select device_id 
            from (
                select device_id, count(*) cnt 
                    from dws.dws_member_start_day 
                    group by device_id 
                    having cnt = 1
                )tmp
        );

流失会员的定义:最近30天未登录的会员

select count(distinct device_id) lossNum 
    from dws.dws_member_start_day 
    where device_id not in (
        select distinct device_id 
        from dws.dws_member_start_day 
        where dt >=date_add(current_date,30) tem
    );

详解(3)

3. 在核心交易分析中完成如下指标的计算

分析:以createTime来确定当前订单属于哪个季度,哪个月,哪个周,维护一张节假日,休息日,工作日的配置表,判断属于哪个特殊日期,计算orderId的数量,计算totalMoney的总和

create table dwd.dwd_trade_orders(
`orderId` int,
`orderNo` string,
`userId` bigint,
`status` tinyint,
`productMoney` decimal,
`totalMoney` decimal,
`payMethod` tinyint,
`isPay` tinyint,
`areaId` int,
`tradeSrc` tinyint,
`tradeType` int,
`isRefund` tinyint,
`dataFlag` tinyint,
`createTime` string,
`payTime` string,
`modifiedTime` string,
`start_date` string,
`end_date` string
) COMMENT '订单事实拉链表'
partitioned by (dt string)
STORED AS PARQUET;
create table special_day(
`special_day_name` string,
`start_date` string,
`end_date` string
) COMMENT '订单事实拉链表'

    统计2020年每个季度的销售订单笔数、订单总额

select (weekofyear(createTime)/3.1)+1,count(orderId),sum(totalMoney)
    from dwd_trade_orders
  group by (weekofyear(createTime)/3.1)+1
 where year(createTime)='2020'

    统计2020年每个月的销售订单笔数、订单总额

select month(createTime),count(orderId),sum(totalMoney)
    from dwd_trade_orders
   group by month(createTime)
 where year(createTime)='2020'


    统计2020年每周(周一到周日)的销售订单笔数、订单总额

select weekofyear(createTime),count(orderId),sum(totalMoney)
    from dwd_trade_orders
   group by weekofyear(createTime)
 where year(createTime)='2020'

    统计2020年国家法定节假日、休息日、工作日的订单笔数、订单总额

select special_day_name,count(orderId),sum(totalMoney)
  from(
     select a.orderId,a.totalMoney,b.special_day_name
          from dwd_trade_orders a 
         right join special_day b 
        on b.start_date<=a.createTime<=end_date
        ) t1
  group by special_day_name
 where year(createTime)='2020';
原文地址:https://www.cnblogs.com/aloneme/p/15098777.html