hive常用函数和建表

1、分区表：
    在hdfs中显示为文件夹

    优化手段之一：避免全表扫描：select * from xxx where province='beijing'


元数据：关系型数据库
数据库：文件夹
表：文件夹
分区：文件夹


添加分区：
    alter table xxx add partition(province='beijing',city='beijing')

动态插入分区：将未分区表插入到分区表
    //设置动态分区非严格模式
    set hive.exec.dynamic.partition.mode=nonstrict
    insert into xxx partition(province,city) select * from xxx;


join：设置Map端join或reduce端join
    join暗示(hint)：写在sql语句中，声明小表，需要mapjoin
        /*+MAP JOIN(user0) */        //声明将哪个表加入到内存

        select /*+MAPJOIN(user0) */ a.id,a.name,b.orderno,b.price from user0 a left outer join orders b on a.id=b.uid;
        

    自动map端join：
        默认开启状态
          <property>
            <name>hive.auto.convert.join</name>
            <value>true</value>
            <description>优化手段之一：自动将reduce端join变为map端join，基于文件大小</description>
          </property>
    
        关闭自动map端join：
           set hive.auto.convert.join=false


bucket:桶表
=====================================

    创建桶表，相当于在MR中指定分区数(2)+Hash分区

    数据结构是文件

    如果分区过多会导致相当数量的文件夹产生

    所以可以指定表以文件分桶形式进行存储

    桶数是在建表时产生，当以id进行分桶时，会将id以hash形式分配到不同桶中

创建桶表：
    create table user1(id int, name string, age int, province string, city string)
    clustered by(province) into 2 buckets
    row format delimited fields terminated by ' ';

动态插入数据：
    insert into user1 select * from user0;


创建分区分桶表：
    分区：province+city
    分桶：id

    create table user2(id int, name string, age int ) partitioned by (province string , city string)
    clustered by(id) into 2 buckets
    row format delimited fields terminated by ' ';

动态插入数据：
    set  hive.exec.dynamic.partition.mode=nonstrict ;
    insert into user2 partition(province,city) select * from user0;

在分区表中插入多条数据
    INSERT into TABLE user2 PARTITION (province='henan', city='kaifeng') select id,name,age FROM user0;

在分区表中插入单条数据
    INSERT into TABLE user2 PARTITION (province='henan', city='kaifeng') select 0,'jerry',20;

对于    000000_0_copy_1文件,每次进行插入操作数字都会累加



常见函数：
==================================
    select size(arr) from xxx;            //集合(map array)的长度
    select array_contains(arr,'str') from xxx;    //判断集合中是否含有指定元素，返回boolean
    select arr[0] from xxx;
    select array(1,2,3) as arr;            //通过select方式创建array
    select map(1,'tom',2,'tomas',3,'tomson') as map;//通过select方式创建map
    select struct('tom,'tomas','tomson') as str;    //通过select方式创建struct    //默认key col1，col2, col3
    select named_struct('id',1,'name','tomas') as str;    //自定义key
    select current_date();
    select current_timestamp();

    select date_format('2018-11-22','yyyy/MM/dd');    //转换格式
    select cast(date_format(current_date,'yyyy') as int);
    select cast(date_format(current_timestamp,'yyyy') as int);
    select from_unixtime(123456789,'yyyyMMdd');    //将时间戳转换成时间格式，最多到秒


条件语句：if case
=====================================
    if相当于三元运算符：
        select id, if( id<5 , 'true', 'false') from user0;    //如果id<5, 返回true,否则false

    case when then else end：
         select age, case when age<30 then 'young' 
         when age<=45 and age>30 then 'middle' 
         else 'old' end 
         from user0;
        
Hive的文件格式(FileFormat):
=====================================
    

设置文件格式：
    1、create table t1(id int) stored as textfile;
    2、alter table t1 set fileformat textfile;
    3、set hive.default.fileformat=textfile

其他文件格式的数据加载：
    1、textfile可以通过直接load的方式进行加载
    2、其他格式要先在textfile格式表中通过insert into xx select 方式加载数据

    TextFile:
        问题：text文件可以压缩，但是需要选择可切割的压缩方式
        不然会变成一个map在单个节点进行运行

    SequenceFile:序列文件，可以通过如下配置指定压缩类型
        set hive.exec.compress.output=true
        set io.seqfile.compression.type=block

    
    然而，Text和SeqFile都是行级存储格式，对于hive的投影查询没有有效的优化措施。(select id,name from xx where id > 10)


列式存储：
-----------------------------------
    RCFile：
        和SeqFile类似，扁平化文件包括二进制的K-v，将文件水平切割成行组，一个或多个组存储在hdfs文件中
        在行组中，RCFile以列为单位存储所有的行组
        文件可切割，而且可以跳过列来读取数据

    ORCFile
        优化(optimize)后的RC文件,
        SeqFile块压缩是以1M为单位进行压缩
        RCFile         4M
        ORCFile         256M

        和RCFile的区别：RCFile通过存储元数据的方式得到数据类型
                    ORCFile通过指定的编码器了解数据类型，以便优化压缩
        
        ORCFile还存储数据的统计：Min、Max、Sum、Count


    parquet
        和ORC文件类型设计有类似之处，但是，parquet文件支持的Hadoop生态系统组件比ORCFile要多
        比ORCFile更好的支持嵌套结构的数据


比较几种文件的性能：
    
text：    create table text(id string, name string, pass string, email string, nick string) row format delimited
    fields terminated by '	'
    stored as textfile;

seqFile:create table seq(id string, name string, pass string, email string, nick string) row format delimited
    fields terminated by '	'
    stored as sequencefile;

RCFile:    create table rcfile(id string, name string, pass string, email string, nick string) row format delimited
    fields terminated by '	'
    stored as rcfile;

ORC:    create table orcfile(id string, name string, pass string, email string, nick string) row format delimited
    fields terminated by '	'
    stored as orcfile;

parquet:create table parquet(id string, name string, pass string, email string, nick string) row format delimited
    fields terminated by '	'
    stored as parquet;
    

加载数据到text表：load data local inpath 'd_users.txt' into table text;

将数据分别通过insert into table xxx select * from text命令，将文件分别导入到其他表 



text:        444.48M        count(*) 8307711    88.225s        
seqFile:    562.7M                    66.581s
RCFile:        232M                    43.123s
ORCFile:    244M                    41.309s
Parquet:    538.7M                    OOM        //使用广泛


SerDe：
======================================
    Serialize && DeSerialize
    hive通过此类技术将其字段映射到col中

    hive处理hdfs数据的基本过程：
        1、读取数据
        2、通过指定的InputFormat来确定recordReader
        3、通过Serde映射记录各个字段到表中的Col

    Serde和FileFormat的关系：
        SerDe是在文件格式下的对记录中的字段进行处理和映射的技术

    OpenCSVSerDe:    处理","分隔的csv文件类型
    LazySimpleSerde:先将所有的文件类型设置为string，然后将其设置为指定的格式
    JSONSerDe:    专门负责处理json串
    
    使用方法：
        0、将json-serde-1.3.8-jar-with-dependencies.jar发送到${HIVE_HOME}/lib下，并重启hive

        1、创建表：
        create table user_log(`_location` string, `_ip` string, `_action` string, `_uid` string, `_timestamp` string)
        row format serde 'org.openx.data.jsonserde.JsonSerDe'
        stored as textfile;

        注意：字段必须要和json中的key相对应
              字段如果以特殊符号开头，要用``反引号括起来


        2、加载表：
            load data local inpath '1.json' into table user_log;

        3、select * from user_log;



窗口函数：
=============================================
    开窗函数：
        over(sort by age rows between 1 preceding and 1 following)    //行(row)级别的界定
        over(sort by age range between 10 preceding and 10 following)    //range对数值范围进行界定
    
    数值范围包括：
        n preceding    //之前 unbounded preceding(之前无限)
        current row    //当前行
        n following    //之后行 unbounded following(之后无限)

    select name, age , sum(age)over(sort by age range between 10 preceding and 10 following) from user1;    //先排序再计算
    select name, age , sum(age)over(rows between 1 preceding and 1 following) from user1;    //值与其上下两行进行相加
    select name, age , province ,sum(age)over(partition by province) from user_a;        //partition by [order by]

    

8    tom1    60    beijing    beijing
9    Tom2    15    henan    kaifeng
10    tom3    20    hebei    xiongan
11    tom4    25    henan    xinyang
12    tom7    30    henan    kaifeng
13    tom6    35    beijing    beijing
14    tom30    40    henan    xinyang
15    tom100    45    beijing    beijing
16    jerry45    50    hebei    shijiazhuang
17    jerry20    55    hebei    langfang
18    jerry7    60    tianjin    tianjin


分析函数
=====================================================
    rank()        //并列跳跃    113
    //取得每个省份年龄最大的三个人
    select name, province, age , rank()over(partition by province order by age desc) from user_a;

    dense_rank()    //不跳跃    112
    select name, province, age , dense_rank()over(partition by province order by age desc) from user_a;

    row_number()    //顺序        123
    select name, province, age , row_number()over(partition by province order by age desc) from user_a;

    
    
    ntile(n)        //分组切片(三六九等)
    //将年龄分成三份
    select name, province, age , ntile(3)over(partition by province order by age desc) from user_a;


    first_value(age)    //取第一个值
    //取得所有省份的最大年龄和最小年龄
    select name, province, age , first_value(age)over(partition by province order by age desc) as big,
    first_value(age)over(partition by province order by age) as small from user_a;


lead和lag：
    统计连续2月活跃的用户



id    name    date
1    tom    2018-01-01
1    tom    2018-02-01
1    tom    2018-02-03
2    tomson