Hive之Partition的使用

参考：
http://blog.csdn.net/qiaochao911/article/details/8613988
http://xm-king.iteye.com/blog/1088422

HIVE分区，实际上是通过一个路径来标识的，而不是在物理数据中。比如每天的数据，可能分区是pt=20121023这样，那么路径中它就会变成：/hdfs/path/pt=20121023/data_files。通过路径来标识的好处是，如果我们需要取特定分区的数据，只需要把这个路径下的数据取出来就可以了，不用扫描全部的数据。

HIVE默认是静态分区。但是有时候可能需要动态创建不同的分区，比如商品信息，我想根据它是否在线分成两个分区，这样后续如果要取在线商品，就只需要从在线的分区获取即可。动态分区可以通过下面的设置来打开：

[sql] view plaincopy

    set hive.exec.dynamic.partition=true;  
    set hive.exec.dynamic.partition.mode=nonstrict;  


然后代码里就可以这么写：

[sql] view plaincopy

    insert overwrite table tbl_name partition(pt, if_online)  
    select field1, field2, ..., pt, if_online  
    from tbl  
    where xxx;  


注意输入字段的最后面必须是动态分区字段。

看一下与静态分区写法的区别：

[sql] view plaincopy

    insert overwrite table tbl_name partition(pt=20121023, if_online=1)  
    select field1, field2, ..., fieldn  
    from tbl  
    where xxx;  


动态分区与静态分区还有一个细微的差别是，静态分区一定会创建分区，不管SELECT语句的结果有没有数据。而动态分区，只有在SELECT结果的记录数>0的时候，才会创建分区。因此在不同的业务场景下，可能会选择不同的方案。

另外使用动态分区时需要注意的比较重要的一点是，动态分区会为每一个分区分配reduce数。比如说你在脚本上面写了：set mapred.reduce.tasks=100;

并且有两个分区：pt, if_online。如果结果集中pt=20121023，if_online=0/1，那么它就会为pt=20121023/if_online=0，pt=20121023/if_online=1各分配100个reduce。也就是说，namenode会同时处理200个文件的写操作。这在分区值很多的情况下，会成为一个灾难，容易直接把namenode给搞挂掉，是非常危险的。因此使用动态分区时，一定要清楚地知道产生的动态分区值，并且合理地设置reduce数量。

hive可以通过partitions将表粗粒度划分为不同的目录来提高查询的效率，例如包含时间戳的日志文件，如果我们按照时间来把日志文件分在不同的目录下，那么相同日期的记录就会存储在同一个分区目录下面，那我们就可以更高效率地查询特定某个时间的记录了

hive (economy)> describe stocks;
OK
exchange    string    
symbol    string    
ymd    string    
price_open    float    
price_high    float    
price_low    float    
price_close    float    
volumn    int    
price_adj_close    float    
Time taken: 0.108 seconds

hive (economy)> alter table partition_stocks add if not exists
              > partition(exchange='NASDAQ',symbol='ABXA');
OK
Time taken: 1.834 seconds
hive (economy)> show partitions partition_stocks;
OK
exchange=NASDAQ/symbol=ABXA
Time taken: 0.155 seconds


hive (economy)> insert overwrite table partition_stocks partition(exchange='NASDAQ',      
              > symbol='ABXA')                                                            
              > select s.ymd,s.price_open,s.price_high,s.price_low,s.price_close,s.volumn,
              > s.price_adj_close from stocks s                                           
              > where s.exchange='NASDAQ' and s.symbol='ABXA';                            
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201303271617_0021, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303271617_0021
Kill Command = /home/landen/UntarFile/hadoop-1.0.4/libexec/../bin/hadoop job  -kill job_201303271617_0021
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2013-03-31 20:50:01,631 Stage-1 map = 0%,  reduce = 0%
2013-03-31 20:50:10,656 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:11,659 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:12,662 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:13,665 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:14,681 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:15,684 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:16,687 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:17,720 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:18,723 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:19,787 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:20,791 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:21,797 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:22,800 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:23,803 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:24,807 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 8.34 sec
2013-03-31 20:50:25,837 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.0 sec
2013-03-31 20:50:26,841 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.0 sec
2013-03-31 20:50:27,844 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.0 sec
2013-03-31 20:50:28,848 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.0 sec
2013-03-31 20:50:29,852 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.0 sec
2013-03-31 20:50:30,884 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 21.0 sec
2013-03-31 20:50:31,888 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 21.0 sec
MapReduce Total cumulative CPU time: 21 seconds 0 msec
Ended Job = job_201303271617_0021
Ended Job = 61034062, job is filtered out (removed at runtime).
Ended Job = 914476855, job is filtered out (removed at runtime).
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201303271617_0022, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303271617_0022
Kill Command = /home/landen/UntarFile/hadoop-1.0.4/libexec/../bin/hadoop job  -kill job_201303271617_0022
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2013-03-31 20:50:40,727 Stage-3 map = 0%,  reduce = 0%
2013-03-31 20:50:46,740 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.05 sec
2013-03-31 20:50:47,743 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.05 sec
2013-03-31 20:50:48,746 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.05 sec
2013-03-31 20:50:49,749 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.05 sec
2013-03-31 20:50:50,753 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.05 sec
2013-03-31 20:50:51,760 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.05 sec
2013-03-31 20:50:52,763 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 1.05 sec
MapReduce Total cumulative CPU time: 1 seconds 50 msec
Ended Job = job_201303271617_0022
Loading data to table economy.partition_stocks partition (exchange=NASDAQ, symbol=ABXA)
Deleted hdfs://localhost:9000/home/landen/UntarFile/hive-0.10.0/user/hive/warehouse/economy.db/partition_stocks/exchange=NASDAQ/symbol=ABXA
Partition economy.partition_stocks{exchange=NASDAQ, symbol=ABXA} stats: [num_files: 1, num_rows: 0, total_size: 67679, raw_data_size: 0]
Table economy.partition_stocks stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 67679, raw_data_size: 0]
1608 Rows loaded to partition_stocks
MapReduce Jobs Launched: 
Job 0: Map: 2   Cumulative CPU: 21.0 sec   HDFS Read: 481098497 HDFS Write: 67679 SUCCESS
Job 1: Map: 1   Cumulative CPU: 1.05 sec   HDFS Read: 67989 HDFS Write: 67679 SUCCESS
Total MapReduce CPU Time Spent: 22 seconds 50 msec
OK
Time taken: 66.143 seconds

hive (economy)> select * from partition_stocks 
              > where exchange = 'NASDAQ' and symbol = 'ABXA';

2005-03-16    7.97    7.97    7.91    7.92    300200    7.92    NASDAQ    ABXA
2005-03-15    7.92    7.97    7.92    7.97    42500    7.97    NASDAQ    ABXA
2005-03-14    7.93    7.99    7.92    7.92    109700    7.92    NASDAQ    ABXA
2005-03-11    7.96    7.96    7.91    7.94    188200    7.94    NASDAQ    ABXA
2005-03-10    7.99    8.0    7.93    7.96    1275900    7.96    NASDAQ    ABXA
2005-03-09    8.01    8.02    7.78    8.0    198600    8.0    NASDAQ    ABXA
2005-03-08    8.1    8.1    8.0    8.01    81600    8.01    NASDAQ    ABXA
2005-03-07    8.05    8.11    8.05    8.11    70300    8.11    NASDAQ    ABXA
2005-03-04    8.19    8.19    8.05    8.12    116300    8.12    NASDAQ    ABXA
2005-03-03    8.09    8.2    8.09    8.2    64200    8.2    NASDAQ    ABXA
2005-03-02    8.29    8.45    8.18    8.2    197600    8.2    NASDAQ    ABXA
2005-03-01    8.07    8.35    8.06    8.25    332000    8.25    NASDAQ    ABXA
2005-02-28    8.0    8.09    7.81    8.05    241000    8.05    NASDAQ    ABXA
2005-02-25    7.55    8.3    7.55    8.1    1114100    8.1    NASDAQ    ABXA
2005-02-24    7.01    7.3    7.0    7.3    174100    7.3    NASDAQ    ABXA
2005-02-23    7.22    7.22    6.9    7.06    542100    7.06    NASDAQ    ABXA
2005-02-22    7.56    7.56    7.25    7.26    209000    7.26    NASDAQ    ABXA
2005-02-18    7.7    7.8    7.58    7.58    1525500    7.58    NASDAQ    ABXA
2005-02-17    7.41    7.7    7.41    7.65    75200    7.65    NASDAQ    ABXA
2005-02-16    7.51    7.51    7.38    7.45    1297600    7.45    NASDAQ    ABXA
2005-02-15    7.58    7.61    7.41    7.56    248400    7.56    NASDAQ    ABXA
2005-02-14    7.78    7.81    7.53    7.6    1487200    7.6    NASDAQ    ABXA
2005-02-11    7.92    7.92    7.73    7.88    265400    7.88    NASDAQ    ABXA
2005-02-10    8.03    8.03    7.91    7.93    205100    7.93    NASDAQ    ABXA

Now Let's learn about dynamic partitions properties:
1. hive.exec.dynamic.partition = false by default. Set to true to enable dynamic partitioning;
2. hive.exec.dynamic.partition.mode = strict by default. Set to nonstrict to enable all partitions to be determined dynamiclly;
3. hive.exec.max.dynamic.partition.pernode = 100 by default. The maximum number of dynamic partitions that can be created by mapper or reducer. Raises a fatal error if
one mapper or reducer attempts to create more than the threshold;
4. hive.exec.max.dynamic.partitions = +1000 by default. The total number of dynamic partitions that can be created by one statement with dynamic partitioning. Raises a fatal error
if the limit is exceeded;
5. hive.exec.max.created.files = 100000 by default. The maximum total number of files that can be created globally. A Hadoop counter is used to track the number of files created.
Raises a fatal error if the limit is exceeded.