2.6 hive分区表

一、背景

#########
分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。
Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成更小的数据集。

在查询时通过WHERE子句中的表达式来选择查询所需要的指定的分区，这样的查询效率会提高很多。


#########
在Hive Select查询中一般会扫描整个表内容，会消耗很多时间做没必要的工作。有时候只需要扫描表中关心的一部分数据，因此建表时引入了partition概念。

分区表指的是在创建表时指定的partition的分区空间。

如果需要创建有分区的表，需要在create表的时候调用可选参数partitioned by，详见表创建的语法结构。

二、操作

1、创建一个分区表

#创建
hive (default)> create EXTERNAL table IF NOT EXISTS default.emp_partition(
              > empno int,
              > ename string,
              > job string,
              > mgr int,
              > hiredate string,
              > sal double,
              > comm double,
              > deptno int
              > )
              > partitioned by (month string)
              > ROW FORMAT DELIMITED FIELDS TERMINATED BY '	';
OK
Time taken: 0.592 seconds


#
hive (default)> show tables;
OK
tab_name
......
emp_partition

2、加载数据到分区表

##
hive (default)> load data local inpath '/opt/datas/emp.txt' into table default.emp_partition partition(month='201509');    #month：建表时定义的字段
Copying data from file:/opt/datas/emp.txt
Copying file: file:/opt/datas/emp.txt
Loading data to table default.emp_partition partition (month=201509)
Partition default.emp_partition{month=201509} stats: [numFiles=1, numRows=0, totalSize=659, rawDataSize=0]
OK
Time taken: 0.549 seconds


#查看HDFS目录
#可见创建了一个month=201509目录，数据在它下面；
hive (default)> dfs -ls -R /user/hive/warehouse/emp_partition/;
drwxr-xr-x   - root supergroup          0 2019-04-22 17:17 /user/hive/warehouse/emp_partition/month=201509
-rw-r--r--   1 root supergroup        659 2019-04-22 17:17 /user/hive/warehouse/emp_partition/month=201509/emp.txt

3、查询

###
hive (default)> select * from emp_partition where month = '201509';    #直接查这个目录下的即可；
OK
emp_partition.empno    emp_partition.ename    emp_partition.job    emp_partition.mgr    emp_partition.hiredate    emp_partition.sal    emp_partition.comm    emp_partition.deptno    emp_partition.month
7369    SMITH    CLERK    7902    1980-12-17    800.0    NULL    20    201509
7499    ALLEN    SALESMAN    7698    1981-2-20    1600.0    300.0    30    201509
7521    WARD    SALESMAN    7698    1981-2-22    1250.0    500.0    30    201509
7566    JONES    MANAGER    7839    1981-4-2    2975.0    NULL    20    201509
7654    MARTIN    SALESMAN    7698    1981-9-28    1250.0    1400.0    30    201509
7698    BLAKE    MANAGER    7839    1981-5-1    2850.0    NULL    30    201509
7782    CLARK    MANAGER    7839    1981-6-9    2450.0    NULL    10    201509
7788    SCOTT    ANALYST    7566    1987-4-19    3000.0    NULL    20    201509
7839    KING    PRESIDENT    NULL    1981-11-17    5000.0    NULL    10    201509
7844    TURNER    SALESMAN    7698    1981-9-8    1500.0    0.0    30    201509
7876    ADAMS    CLERK    7788    1987-5-23    1100.0    NULL    20    201509
7900    JAMES    CLERK    7698    1981-12-3    950.0    NULL    30    201509
7902    FORD    ANALYST    7566    1981-12-3    3000.0    NULL    20    201509
7934    MILLER    CLERK    7782    1982-1-23    1300.0    NULL    10    201509
Time taken: 0.355 seconds, Fetched: 14 row(s)

比如每个月的日志存一个分区表，现在统计三个月的日志中ip数：

#查询三个月的分区，然后union合并
select count(distinct ip) from emp_partition where month='201509'
union 
select count(distinct ip) from emp_partition where month='201508'
union 
select count(distinct ip) from emp_partition where month='201507'
;

bin/hive-f xx. sql        #可以放在脚本里执行

三、分区表的二级分区

上面的例子只有一级，下面看一下二级分区；

1、建表

###
create EXTERNAL table IF NOT EXISTS default.emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
partitioned by (month string, day string)        #再添加一个字段即可
ROW FORMAT DELIMITED FIELDS TERMINATED BY '	';

2、加载数据

###
load data local inpath '/opt/datas/emp.txt' into table default.emp_partition partition(month='201509', day='13');

3、查询

select * from emp_partition where month = '201509' and day = '13';

四、分区表注意事项

1、修复分区表-方式一

1、从上一节课的外部表可看出，只要定义好外部表的表结构，直接把数据放进表目录中就可以查询了；
但是分区表不能这样，创建好分区表，直接把数据放进数据目录，是查询不到的；
其中，mysql的metastore库的PARTITIONS表里记录了分区表的分区信息，如下：
mysql> select * from PARTITIONS;
+---------+-------------+------------------+--------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME    | SD_ID | TBL_ID |
+---------+-------------+------------------+--------------+-------+--------+
|       1 |  1555924635 |                0 | month=201509 |    22 |     21 |
+---------+-------------+------------------+--------------+-------+--------+

接下来创建一个分区表，来说明一下怎么能查到直接放到分区表中的数据：
#创建
hive (default)> create table if not exists default.dept_part(
              > deptno int,
              > dname string,
              > loc string
              > )
              > partitioned by (day string)              
              > ROW FORMAT DELIMITED FIELDS TERMINATED BY '	';
OK
Time taken: 0.039 seconds

hive (default)> show tables;
OK
tab_name
dept_part


2、把数据放到此分区表的目录中：
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_part/day=20150913;    #先创建一个分区表目录

hive (default)> dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_part/day=20150913;    #放数据

hive (default)> select * from dept_part;        #可见现在查不到数据
OK
dept_part.deptno    dept_part.dname    dept_part.loc    dept_part.day
Time taken: 0.034 seconds

而且mysql的metastore库的PARTITIONS表里也没有分区表的记录：
mysql> select * from PARTITIONS;
+---------+-------------+------------------+--------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME    | SD_ID | TBL_ID |
+---------+-------------+------------------+--------------+-------+--------+
|       1 |  1555924635 |                0 | month=201509 |    22 |     21 |
+---------+-------------+------------------+--------------+-------+--------+
1 row in set (0.00 sec)


3、怎么才能查到数据呢？
hive (default)> msck repair table dept_part;        #修复一下分区表                                     
OK
Partitions not in metastore:    dept_part:day=20150913
Repair: Added partition to metastore dept_part:day=20150913
Time taken: 0.092 seconds, Fetched: 2 row(s)


hive (default)> select * from dept_part;            #此时就能查到数据了                                     
OK
dept_part.deptno    dept_part.dname    dept_part.loc    dept_part.day
10    ACCOUNTING    NEW YORK    20150913
20    RESEARCH    DALLAS    20150913
30    SALES    CHICAGO    20150913
40    OPERATIONS    BOSTON    20150913
Time taken: 0.037 seconds, Fetched: 4 row(s)


mysql> select * from PARTITIONS;    #而且MySQL的metastore库的PARTITIONS表里面也多了一条分区表的分区信息
+---------+-------------+------------------+--------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME    | SD_ID | TBL_ID |
+---------+-------------+------------------+--------------+-------+--------+
|       1 |  1555924635 |                0 | month=201509 |    22 |     21 |
|       2 |  1555986070 |                0 | day=20150913 |    24 |     22 |
+---------+-------------+------------------+--------------+-------+--------+
2 rows in set (0.00 sec)

2、修复分区表-方式二

#在dept_part里面再创建一个分区
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_part/day=20150914;               
hive (default)> dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_part/day=20150914;


#此时dept_part分区表中依然是4条数据
hive (default)> select * from dept_part;                                                 
OK
dept_part.deptno    dept_part.dname    dept_part.loc    dept_part.day
10    ACCOUNTING    NEW YORK    20150913
20    RESEARCH    DALLAS    20150913
30    SALES    CHICAGO    20150913
40    OPERATIONS    BOSTON    20150913
Time taken: 0.038 seconds, Fetched: 4 row(s)


#此时alter一下分区表
hive (default)> alter table dept_part add partition(day='20150914');
OK
Time taken: 0.065 seconds


#再查询，已经有8条数据了
hive (default)> select * from dept_part;                            
OK
dept_part.deptno    dept_part.dname    dept_part.loc    dept_part.day
10    ACCOUNTING    NEW YORK    20150913
20    RESEARCH    DALLAS    20150913
30    SALES    CHICAGO    20150913
40    OPERATIONS    BOSTON    20150913
10    ACCOUNTING    NEW YORK    20150914
20    RESEARCH    DALLAS    20150914
30    SALES    CHICAGO    20150914
40    OPERATIONS    BOSTON    20150914
Time taken: 0.033 seconds, Fetched: 8 row(s)


#MySQL中也多了一条分区表的分区信息
mysql> select * from PARTITIONS;
+---------+-------------+------------------+--------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME    | SD_ID | TBL_ID |
+---------+-------------+------------------+--------------+-------+--------+
|       1 |  1555924635 |                0 | month=201509 |    22 |     21 |
|       2 |  1555986070 |                0 | day=20150913 |    24 |     22 |
|       3 |  1555986603 |                0 | day=20150914 |    25 |     22 |
+---------+-------------+------------------+--------------+-------+--------+
3 rows in set (0.00 sec)


#我们还可以在select时加上where查询，这就是分区表的好处
hive (default)> select * from dept_part where day=20150914; 
OK
dept_part.deptno    dept_part.dname    dept_part.loc    dept_part.day
10    ACCOUNTING    NEW YORK    20150914
20    RESEARCH    DALLAS    20150914
30    SALES    CHICAGO    20150914
40    OPERATIONS    BOSTON    20150914
Time taken: 0.06 seconds, Fetched: 4 row(s)

3、查看分区表有多少分区

hive (default)> show partitions dept_part;
OK
partition
day=20150913
day=20150914
Time taken: 0.048 seconds, Fetched: 2 row(s)