hive--分区表和分桶表

分区表和分桶表区别如下：

　　1、分区使用的是表外字段，需要指定字段类型；分桶使用的是表内字段，已经知道字段类型，不需要再指定。

　　2、分区通过关键字partitioned by(partition_name string)声明，分桶表通过关键字clustered by(column_name) into 3 buckets声明。

　　3、分区划分粒度较粗，分桶是更细粒度的划分、管理数据，可以对表进行先分区再分桶的划分策略。

　　4、分区是个伪列，只对应着文件存储路径上的一个层级。

一.hive分区表：

　　hive表分区是一种逻辑上的数据划分，分区字段使用的是表外字段，并且不保存数据，只是hdfs文件存储目录的一个层级。一个表可以指定多个分区，我们在插入数据的时候指定分区，就是新建一个子目录，或者在原来目录的基础上来添加数据。分区目的主要是避免全表扫描，从而提升查询和计算效率。按分区类型划分，可以分为静态分区、动态分区和混合分区。

分区表创建：

create table if not exists tab_partition(
id int,
name string,
age int
)
PARTITIONED BY (year string , month string)
row format delimited 
fields terminated by ','
stored as orc
;

静态分区加载数据时要指定分区：

load data local inpath '/data/test.txt' into table tab_partition partition(year='2019',month='05');

在使用静态分区的时候，加载数据要指定分区，这个操作过程比较麻烦；而动态分区不会有这些不必要的操作，动态分区可以根据查询得到的数据动态地分配到分区中去，动态分区与静态分区最大的区别是不指定分区目录，由系统自己进行过选择。

动态分区模式可以分为严格模式(strict)和非严格模式(non-strict),二者的区别是：严格模式在进行插入的时候至少指定一个静态分区，而非严格模式在进行插入的时候可以不指定静态分区。

首先启动动态分区的功能，在hive-site.xml文件中进行如下的配置：

<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
</property>
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
</property>
<property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
</property>
<property>
    <name>hive.compactor.worker.threads</name>
    <value>1</value>
</property>
<property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
</property>

动态分区加载数据不能使用load，要使用insert into方式：

insert into tab_partition partition(year,month) select id,name,age,year,month from part_tmp;

混合分区加载数据方式：

insert into tab_partition partition(year='2019',month) select id,name,age,month from part_tmp;

二.hive分桶表：

单个分区或者表中的数据量越来越大，当分区不能更细粒的划分数据时，所以会采用分桶技术将数据更细粒度的划分和管理。分桶表使用的是表内字段。

分桶表创建：

create table if not exists tab_bucket(
id int,
name string,
age int
)
clustered by (id) into 4 buckets
row format delimited
fields terminated by ','
stored as orc
;

分桶表加载数据要使用insert into方式：

需要确保reduce 的数量与表中的bucket 数量一致，有如下两种方式：

//方式一：让hive强制分桶，自动按照分桶表的bucket进行分桶（推荐）
set hive.enforce.bucketing = true;
insert into table tab_bucket select id,name,age from tmp;

//方式二：手动设置reduce数量，并在 SELECT 后增加CLUSTER BY 语句
set mapreduce.job.reduces = num;
set mapreduce.reduce.tasks = num;
insert into table tab_bucket select id,name,age from tmp cluster by id;

分桶表数据查询：

//查询全部数据
select * from tab_bucket;
//抽样查询，按id的哈希值对4取余，查询桶1的数据
select * from tab_bucket tablesample(bucket 1 out of 4 on id);

分区分桶表创建：

create table if not exists tab_partition_bucket(
id int,
name string,
age int
)
partitioned by (province string)
clustered by (id) sorted by (id desc) into 3 buckets
row format delimited 
fields terminated by ','
stored as orc
;